Word Level Language Identification

Code-mixing is common in Indonesian social media, where Indonesian, Javanese, and English can appear in the same sentence. This project started from that challenge: identifying the language of each word, not just the language of the whole post.

I built a word-level language identification pipeline using a Conditional Random Field (CRF) model trained on a code-mixed Twitter dataset. The model learns token-level patterns from surrounding context, then assigns each token to Indonesian, Javanese, or English.

The system achieved around 90% accuracy and performed well on clearer boundaries between languages. The hardest cases were highly blended phrases where context and spelling cues overlap.

This work serves as a strong baseline for sequence labeling in low-resource, code-mixed settings, and it points naturally to transformer-based follow-ups such as BERT for better robustness on ambiguous text.

Interactive preview