Posts tagged 'ml'

Indic Language Identification

2020-11-19

paper, machine-learning, ml, nlp, natural-language

We discuss language identification of noisy, romanized text - an un-addressed but critical problem in Indic text mining, and release a language-identification utility. We then measure geographic extents of language use in India. Summary of a WNUT 2020 paper.

Polyglot Word Embeddings Discover Language Clusters

2020-02-03

machine-learning, ai, ml, nlp, language, linguistics, natural-language-processing, nlp

Polyglot word embeddings obtained by training a skipgram model on a multi-lingual corpus discover extremely high-quality language clusters.

These can be trivially retrieved using an algorithm like $k-$Means giving us a fully unsupervised language identification system.

Experiments show that these clusters are on-par with results produced by popular open source (FastText LangID) and commercial models (Google Cloud Translation).

We have successfully used this technique in many situations involving several low-resource languages that are poorly supported by popular open source models.

This blog post covers methods, intuition, and links to an implementation based on 100-dimensional FastText embeddings.