Using GPT-2 for Malagasy

Long ago I became interested in natural language processing. From 2010 until 2014 I had been actively developing various programs to increase content coverage of the Malagasy Wiktionary. The result now is 5.9 million words in 4,100 languages.

From 2014 to this day, I have been researching ways to improve and perfect the quality of translations as provided by the bot. In 2018, the OpenAI community had released a language model used to generate news-like articles. Those generated articles were so believable that the consortium had refrained to release the full model until the end of 2019, as there were fears that fine-tuning the full model could lead to fake news or dangerous propaganda to  be published en masse. As a result, they were only released once detection techniques were accurate enough to tell generated and non generated articles apart.

Once the full model was released, I began fine-tuning the model on Malagasy language text. The target was to generate news-like articles from the existing corpus scraped from 4 major news website, resulting in 49 MB of training data. In comparison, the English language model was trained using 40 GB of data.

Scraping Malagasy language sources

On the internet, data sources and diversity for Malagasy are relatively scarce compared to English or any other European language. The main reason for that is that most Malagasy sites use French as their publishing language. As a consequence, the sources used were daily newspapers such as NewsMada, Madagascar Tribune, Aoraha, la Gazette de la Grande Ile. It is worth noting that two of these newspapers are bilingual so article had to be filtered.

Filtering out French articles

The next task was to detect and remove French language articles since we are training the model to generate Malagasy and not French.

How?

Since we’re basically both using the Latin alphabet, using Unicode to our advantage won’t do the job. Language detection using machine learning, while attractive, is clearly overkill and will further divert us from our goal.

Instead, to keep things simple, I relied on the single biggest difference between written Malagasy and French.  Our version of the Latin alphabet rules out the letters C, Q, U, W and X or other accented characters like É or È. In other words, all native Malagasy words won’t contain any of these.

I also fetched all French words and inflections to be spot on every single time. And in less than 100 lines, I could filter out anything French.

Using GPT-2

As expected, training takes time and space. Lots of it. Model for checkpoints take 1.3 GB and is saved on-disk every 50 iterations.  At 21,000 iterations, further progress seems hard, but this is what it can generate (article below does not exist):

ANTSIRABE: SARONA TANTERAKA NY FITAFIANA MPANAO SINTO-MAHERY | NEWSMADA

Par Taratra sur 08/12/2019

Nandray ny asa famonoana ho faty ny zandary nandray anjara tamin’ny fanafihana nitafiana mpanao
sy toeram-piantsonan’ny taxi-be nandritra ny fanarahan-dia, tao amin’ny kaompania Ambositra,
faran’ny herinandro teo, ka nanao ny fanarahan-dia.

Tsiahivina fa efa nisy ny nahafantarana fa nanafika mpandraharaha an’ilay mpandraharaha ny
tao Andranohazo Antsirabe. Raikitra ny fitifirana ka vokatry ny fanarahan-dia avy hatrany ity mpandraharaha
ity. Tsy fantatra mazava hatrany na ny sasany aza tambajotran-javatra malemy na koa raha tsy izany
mitohy na miaro ny kolikoly rehetra na mpanao sinto-mahery na manana ny anton-diany

Conclusions

Should we go further with our model,  we would end up creating a “thismalagasynewsarticledoesnotexist.com” website to host them all. Source code is present on Github along news anticles as training data, which for copyright reasons, cannot be made public.

Another use for a good-enough model would be to illustrate the Malagasy Wikitonary with unique examples for word usage.