We discuss language identification of noisy, romanized text - an un-addressed but critical problem in Indic text mining, and release a language-identification utility. We then measure geographic extents of language use in India. Summary of a WNUT 2020 paper.
Polyglot word embeddings obtained by training a skipgram model on a multi-lingual corpus discover extremely high-quality language clusters.
These can be trivially retrieved using an algorithm like $k-$Means giving us a fully unsupervised language identification system.
We have successfully used this technique in many situations involving several low-resource languages that are poorly supported by popular open source models.
This blog post covers methods, intuition, and links to an implementation based on 100-dimensional FastText embeddings.