Polyglot Word Embeddings Discover Language Clusters
Polyglot word embeddings obtained by training a skipgram model on a multi-lingual corpus discover extremely high-quality language clusters.
These can be trivially retrieved using an algorithm like $k-$Means giving us a fully unsupervised language identification system.
Experiments show that these clusters are on-par with results produced by popular open source (FastText LangID) and commercial models (Google Cloud Translation).
We have successfully used this technique in many situations involving several low-resource languages that are poorly supported by popular open source models.
This blog post covers methods, intuition, and links to an implementation based on 100-dimensional FastText embeddings.