It’s been a while since the last article I wrote on this blog. This article is about the mass adding content on the Malagasy Wiktionary. The object of this post is to provide some explanations on why and how the Malagasy Wiktionary has become so big.
But first, allow me to introduce myself. My nickame on all Wikimedia projects is Jagwar. I am a Wikimedia contributor since August 2008, and I am going to be 20 years old soon. I speak Malagasy as mother tongue, French as a second language and English as a foreign language (soon the third language, since it is not quite perfect yet…).
When I discovered perfectly randomly Malagasy language, the wiki was virtually dead, with no one adding interesting content, and an active community mainly constituted by non native speakers. Without any knowledge of the rules of the wiki, with almost no knowledge of how to correctly write Malagasy, I began an article. It grew up to 20,000 characters, making it to be the biggest page of the wiki at that time. Bust infortunately (or fortunately, for the sake of readers), a non-native speaker administrator spotted the lack of notability of the article, leading it to be deleted.
I could leave the wiki, as tens of hours of work had literally vanished of the wiki… But I didn’t, I still cannot figure out why, but deeper in my mind, a little voice told me to continue contributing. At that time, the Malagasy Wikipedia counted 550 articles, maybe less, but not more.
So I continued on this way for a while. To help me in my task I wrote to potential volunteers. These people didn’t see the point to contribute to a wiki in their mother tongue: either they were unable to spell correctly Malagasy words, or they didn’t have time enough to do good work; while others required money to start contributing (times are hard in Madagascar, I know), and even with money, I am not sure these ones will stay long once the money paid.
In October 2008, I discovered Malagasy Wiktionary. At the beginning I actually didn’t know what to do out there, so I continued to work on the Malagasy Wikipedia just to become more skilled and used to write Malagasy.
In July 2009, I was on vacation to my fatherland: Madagascar. I have taken this occasion to learn deeplier the written Malagasy language, though my means were quite limited: reading newspapers, the Bible (I am christian), watching news broadcasts on TV as well as on Radio… I almost forget French (!), though it was present almost everywhere as second official language.
When back to France, I have decided to incite potential volunteers that are able to write to contribute on the Malagasy language Wikimedia projects: but you know, Madagascar was in crisis and people sometimes asked for money to contribute: other blamed me on my spelling mistakes, and others simply ignore the request. I had less and less time to dedicate to the projects and I have no money to give this way. One day, I decided that I couldn’t wait anymore for someone to arrive: the progress of my skills in Malagasy, in programming languages, and the promise of a very busy future (inducing a chronical lack of time) mentally forced me to do something, to do something for my mother tongue, even a tiny little thing.
In 2010, when I could write in my mother tongue without too much spelling mistakes, I started to write bots. Once they are written, I ran them at the very full speed: fifty thousand edits per day: that was the pace, the normal pace. At the beginning it was the importation of foreign language wikis from other wikis, and it consisted mainly in importing verb forms, first through an import form, and after through a script that copy-pastes other wikis’ content pages to the Malagasy Wiktionary equivalent page. I went slightly at the beginning, but I did it more and more often, till the wiki got 200,000 content pages. On these possible coyright-infringing importations, I received a warning from a user that almost got his mother tongue wiki closed due to the creation of thousands of useless pages.
In 2011, I got mad: after discovering the astonishing easiness of Volapük, I wrote a script to upload the word forms of that language. At full speed – i.e around 50,000 edits per day – three weeks were required to make the Malagasy Wiktionary the third biggest Wiktionary of the world. But months passed, and no one, absolutely no one, did contribute: one day on the wiki, the number of active users dropped to two, for a wiki that contains 1,19 million content pages (in comparison, the German Wikipedia which had a comparable article count, didn’t count less than 25,000 active users) !
On July of the same year, a new script has been written. That script allowed to create translations based on foreign language entries. With that script, up to 5,000 articles were created, and they mainly concern lemmata entries. Just a few weeks later, the import of all Malagaasy words has been completed. But its repercussion on article count was not visible due to the mass deletion of Volapük language entries. Why this mass deletion? Because many entries seemed to be wrong as they are not conjugation of verbs, but nouns (-.-‘), so the decision is taken to delete them all to re-create them later, with a better quality if possible. Since then, my activity on the Malagasy Wikipedia is put in brackets to dedicate my whole wiki time to the renovation of the Malagasy Wiktionary.
During the summer vacation, I took the time to restructure the Malagasy Wiktionary. The article, category structure were inspired by the structure of the French Wiktionary: use of template for languages, parts of speech, allowed the Malagasy Wiktionary entries to be automatically categorised through the use of templates. Time passed and the routine started to install.
One night, I discovered an online Malagasy monolingual dictionary. Having no idea about the copyrightability of the content (the copyright seemed to apply only on design), I decided to reuse the content on that dictionary to complete the entries on the Malagasy Wiktionary. The problem arrived just a few weeks later, when I received a mail from a Wikimedia Foundation staff member. P. Beaudette. In its mail, he asked me the origin of the malagasy language entries, I answered they were from various bilingual dictionaries, and the online monolingual dictionary… An copyright infringement investigation was led and my bot was blocked during the whole processus. At the end of it, I was told by the staff member to remove the 30,000 entries that infringe the original dictionary’s copyright, which was done.
After this copyright infringement episode, I decided to orient my contribution in adding Malagasy language content to other wikis. But before that, I did some work on the Fijian and Tagalog Wiktionaries, that was more or less appreciated… There was in particular an IP address checking my contributions on the Fijian and Tagalog Wiktionaries. This IP told me to stop mass-adding content to these languages of which I speak no word. I ceased to work on both wikis a few weeks later, as the work is finished.
But this mass-adding content, especially in language I didn’t speak at all, seemed to annoy people that have decided to discuss about the case on MetaWiki forum. No concluding results was given, and things were as they were before.
With most of the hard work being removed, with a behaviour that has been reproved by many users, I decided to take a break of indefinite duration. It actually lasted 5 months, during which I tried to work on my written Malagasy outside Wikimedia projects. The progression of my skills, spelling as well as programming skills, were honourable, allowing me to go back again and make the Malagasy Wikimedia projects, and especially the Malagasy Wiktionary, evolve again. In July 2012, I built a new tool that allows me to know the non-exising entries/pages on the Malagasy wiktionary by consulting the daily online newspapers. Only two newspapers are currently supported, because of their use of RSS feeds. But the ability to make the script read non-RSS supporting websites is coming soon.
In September, I have developped a new, improved translation retriever that allows the script to get all translations of all languages on a given page (the previous version could only translate one language at once), which almost decuples the translation harvest. This function is embedded in a XML dump reader that ampifies the efficiency of the script: fast translation retrieving and no requirement to be connected to the server while processing. Done every month, the dump processing and uploading make the wiki to gain more than 100,000 lemmata in a few months. These lemmata may have translation errors, but it is low enough not to be taken in consideration (<1%). Hardest cases can be resolved by a single check on the source wiki (which is indicated by a template).
In October, I have thought about building a bot that completes a task as scheduled by a parameter file. This is particularly useful for maintaining list of wikis up-to-date. Currently, the pace at which the list of Wikis on the Malagasy Wiktionary is four times a day, i.e every six hours.
At the end of January 2013, I thought about a more efficient use of the translation retriever that I wrote a few months ago. Then comes the IRC bot: it retrieves in real time all the edits made on selected wikis and does its possible to translate the latter entry in Malagasy,  in real time! The first time it was developped, it only used the traditional translation retriever, but later, on March, it also features a basic entry processor that allows the IRC bot to also translate entries in foreign languages into Malagasy, using the same dictionary. This latter version of the IRC bot is currently in use, and it creates hundreds of entries and content pages on the Malagasy Wiktionary everyday. I have no precise idea about the error rate but I am pretty sure it is less than 5%. The positive side of the bot is its ability to keep the pace when several edits are made in a minute, nevertheless, as it requires to be online and to be connected to Wikimedia servers, the processing frequency is limited to one page per second. Something is being thought on allowing the bot to process more pages.

19 thoughts on “My story on the Malagasy Wiktionary”
  1. Awesome article. You’re going to go far in life, there’s no doubt about that. And Madagascar is lucky to have you pouring your efforts into improving the availability of knowledge in the Malagasy language on the net.

  2. i found my way here by noticing how huge is the Malagasy wiktionary, and wanting to find out why!
    i applaud your efforts in giving web visibility to your – and other – “minor” languages, this i mean appreciatively, the world is beautiful because it is so diverse, and the “major” languages and their political proponents make it more ugly by smothering smaller cultures.
    i wonder though, you evaluate translation errors at under 1%, but if an article is added every second that will still amount to quite a bit. how is the community? are their people on the malagasy wiktionary, in addition to you, active in correcting these mistakes, and perhaps improving entries with details and nuances a bot would be unable to gather?
    anyway. you are a very relentless and original person. 20 years old you say? to be continued then 🙂

    1. Hello,
      First of all, thank you for your input and your encouragements.
      To answer your question (very lately, as I don’t get much time last months), the community of the Malagasy Wiktionary is very very small (currently 2 members) and I am currently the steadiest contributor over there.
      To check the error rate, there are two solutions; to check the newest 200 entries, or to check the database. The first solution is tedious, as you need to load a page every time you want to check an entry. The second is faster and as you can see the last 200 in only one page.
      Nuances and polysemy are the hardest cases. And this may lead to anecdotic, incomplete, or even false translations (this latter mistake has been lowered by the use of a database where English-Malagasy words were checked one by one).

  3. Waow. What I can say is “misaotra”.. thank you for your work. Looking at what I wrote there at the beginning I am a bit ashamed 🙂
    I tried to do rakibolana.org but now I think you did the big job. If you didn’t yet, you can also take data from there.

    1. Hi, thanks for your comment.
      “I tried to do rakibolana.org but now I think you did the big job. If you didn’t yet, you can also take data from there.”
      I appreciate your generous proposition. Can you please confirm about the authorisation by “Rakibolana Malagasy” publishers to copy and to publish its content under CC-by-SA licence. The content on your site is seemingly from the Rakibolana Malagasy paper dictionary and has been copyrighted.
      I learned this the hard way and got serious trouble after doing it: my bot account had been blocked and I had to revert all edits containing a copy of that dictionary… and I had a lawsuit threat against me and the WMF.
      That’d be great if they could provide a document telling that they authorise the re-publishing of “Rakibolana Malagasy” content under a CC-by-SA licence, so that the WMF and I can be protected against future lawsuits.

  4. Hi!
    Thank you for your inspiring article. Is your bot written in Python? Can I humbly ask you to share your script under Open Source license?
    Thanks!

  5. Excellent work!
    It would be great if you could also do this for other African languages. Do you think it’s feasible? What would you need to make that happen?

    1. Hi Kasper!
      Given the advancement of my skills and deployment on smaller Wiktionaries, yes it is feasible.
      To do this well, the best things to have are a relatively comprehensive bilingual dictionary, and a good speaker able to write the language who is willing to collaborate with me, so I can use its knowledge of its language with my knowledge in computer science (and languistics :)).

        1. Hi Tegegne, normally particular scripts should not be a problem as long as they are normalised Unicode. The only problem is that I won’t be able to write easily in Amharic with my keyboard 🙂
          If you have the data and feel ready for the adventure, let me know by writing a message on my user page https://mg.wiktionary.org/w/index.php?title=Dinika_amin%27ny_mpikambana:Jagwar&action=edit&section=new or by mail at rado.kaontywiki at gmail.com.

  6. Is there any way you can help out with Amharic am.wiktionary.org to do the same thing? I have no experiance with coding but I can do the language part since I am native speaker. Apriciate all the help. I am sure you will find out that the script being used for Amharic is not latin. I hope that will not be a hindrance?

  7. […] Before we’ve got Google translate to translate almost anything in our language, including curse words, several websites have helped us Malagasy and other language enthusiasts to write corpora in a proper way in our mother tongue: many of us have already heard about Freelang, tenymalagasy.org and so on. The only drawback of these website is that they do not work in a collaborative way: they are not «crowdsourced». Wikibolana is a Malagasy language crowdsourced dictionary, but I have been so far the one that has generated most of its content. […]

  8. Amazing job, I was wondering if you can provide your code? Because I’ve been working in the Portuguese wiktionary and am currently completing the bending of verbs, which should provide about 1 millio entries. But found it amazing how worked to seek other dictionaries and also for translation. Although not perfect, will greatly help people!

  9. Hi Jagwar!

    I run a blog dedicated to stories about the internet and global affairs. I’ve been researching the Malagasy Wiktionary, and its immense size, for some time in preparation for an article. I’d like to ask you a few questions about your story, especially your involvement with the Wiktionary.

    I can be reached at sergeiovichs@gmail.com. Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *