My story on the Malagasy Wiktionary

It’s been a while since the last article I wrote on this blog. This article is about the mass adding content on the Malagasy Wiktionary. The object of this post is to provide some explanations on why and how the Malagasy Wiktionary has become so big.
But first, allow me to introduce myself. My nickame on all Wikimedia projects is Jagwar. I am a Wikimedia contributor since August 2008, and I am going to be 20 years old soon. I speak Malagasy as mother tongue, French as a second language and English as a foreign language (soon the third language, since it is not quite perfect yet…).
When I discovered perfectly randomly Malagasy language, the wiki was virtually dead, with no one adding interesting content, and an active community mainly constituted by non native speakers. Without any knowledge of the rules of the wiki, with almost no knowledge of how to correctly write Malagasy, I began an article. It grew up to 20,000 characters, making it to be the biggest page of the wiki at that time. Bust infortunately (or fortunately, for the sake of readers), a non-native speaker administrator spotted the lack of notability of the article, leading it to be deleted.
I could leave the wiki, as tens of hours of work had literally vanished of the wiki… But I didn’t, I still cannot figure out why, but deeper in my mind, a little voice told me to continue contributing. At that time, the Malagasy Wikipedia counted 550 articles, maybe less, but not more.
So I continued on this way for a while. To help me in my task I wrote to potential volunteers. These people didn’t see the point to contribute to a wiki in their mother tongue: either they were unable to spell correctly Malagasy words, or they didn’t have time enough to do good work; while others required money to start contributing (times are hard in Madagascar, I know), and even with money, I am not sure these ones will stay long once the money paid.
In October 2008, I discovered Malagasy Wiktionary. At the beginning I actually didn’t know what to do out there, so I continued to work on the Malagasy Wikipedia just to become more skilled and used to write Malagasy.
In July 2009, I was on vacation to my fatherland: Madagascar. I have taken this occasion to learn deeplier the written Malagasy language, though my means were quite limited: reading newspapers, the Bible (I am christian), watching news broadcasts on TV as well as on Radio… I almost forget French (!), though it was present almost everywhere as second official language.
When back to France, I have decided to incite potential volunteers that are able to write to contribute on the Malagasy language Wikimedia projects: but you know, Madagascar was in crisis and people sometimes asked for money to contribute: other blamed me on my spelling mistakes, and others simply ignore the request. I had less and less time to dedicate to the projects and I have no money to give this way. One day, I decided that I couldn’t wait anymore for someone to arrive: the progress of my skills in Malagasy, in programming languages, and the promise of a very busy future (inducing a chronical lack of time) mentally forced me to do something, to do something for my mother tongue, even a tiny little thing.
In 2010, when I could write in my mother tongue without too much spelling mistakes, I started to write bots. Once they are written, I ran them at the very full speed: fifty thousand edits per day: that was the pace, the normal pace. At the beginning it was the importation of foreign language wikis from other wikis, and it consisted mainly in importing verb forms, first through an import form, and after through a script that copy-pastes other wikis’ content pages to the Malagasy Wiktionary equivalent page. I went slightly at the beginning, but I did it more and more often, till the wiki got 200,000 content pages. On these possible coyright-infringing importations, I received a warning from a user that almost got his mother tongue wiki closed due to the creation of thousands of useless pages.
In 2011, I got mad: after discovering the astonishing easiness of Volapük, I wrote a script to upload the word forms of that language. At full speed – i.e around 50,000 edits per day – three weeks were required to make the Malagasy Wiktionary the third biggest Wiktionary of the world. But months passed, and no one, absolutely no one, did contribute: one day on the wiki, the number of active users dropped to two, for a wiki that contains 1,19 million content pages (in comparison, the German Wikipedia which had a comparable article count, didn’t count less than 25,000 active users) !
On July of the same year, a new script has been written. That script allowed to create translations based on foreign language entries. With that script, up to 5,000 articles were created, and they mainly concern lemmata entries. Just a few weeks later, the import of all Malagaasy words has been completed. But its repercussion on article count was not visible due to the mass deletion of Volapük language entries. Why this mass deletion? Because many entries seemed to be wrong as they are not conjugation of verbs, but nouns (-.-‘), so the decision is taken to delete them all to re-create them later, with a better quality if possible. Since then, my activity on the Malagasy Wikipedia is put in brackets to dedicate my whole wiki time to the renovation of the Malagasy Wiktionary.
During the summer vacation, I took the time to restructure the Malagasy Wiktionary. The article, category structure were inspired by the structure of the French Wiktionary: use of template for languages, parts of speech, allowed the Malagasy Wiktionary entries to be automatically categorised through the use of templates. Time passed and the routine started to install.
One night, I discovered an online Malagasy monolingual dictionary. Having no idea about the copyrightability of the content (the copyright seemed to apply only on design), I decided to reuse the content on that dictionary to complete the entries on the Malagasy Wiktionary. The problem arrived just a few weeks later, when I received a mail from a Wikimedia Foundation staff member. P. Beaudette. In its mail, he asked me the origin of the malagasy language entries, I answered they were from various bilingual dictionaries, and the online monolingual dictionary… An copyright infringement investigation was led and my bot was blocked during the whole processus. At the end of it, I was told by the staff member to remove the 30,000 entries that infringe the original dictionary’s copyright, which was done.
After this copyright infringement episode, I decided to orient my contribution in adding Malagasy language content to other wikis. But before that, I did some work on the Fijian and Tagalog Wiktionaries, that was more or less appreciated… There was in particular an IP address checking my contributions on the Fijian and Tagalog Wiktionaries. This IP told me to stop mass-adding content to these languages of which I speak no word. I ceased to work on both wikis a few weeks later, as the work is finished.
But this mass-adding content, especially in language I didn’t speak at all, seemed to annoy people that have decided to discuss about the case on MetaWiki forum. No concluding results was given, and things were as they were before.
With most of the hard work being removed, with a behaviour that has been reproved by many users, I decided to take a break of indefinite duration. It actually lasted 5 months, during which I tried to work on my written Malagasy outside Wikimedia projects. The progression of my skills, spelling as well as programming skills, were honourable, allowing me to go back again and make the Malagasy Wikimedia projects, and especially the Malagasy Wiktionary, evolve again. In July 2012, I built a new tool that allows me to know the non-exising entries/pages on the Malagasy wiktionary by consulting the daily online newspapers. Only two newspapers are currently supported, because of their use of RSS feeds. But the ability to make the script read non-RSS supporting websites is coming soon.
In September, I have developped a new, improved translation retriever that allows the script to get all translations of all languages on a given page (the previous version could only translate one language at once), which almost decuples the translation harvest. This function is embedded in a XML dump reader that ampifies the efficiency of the script: fast translation retrieving and no requirement to be connected to the server while processing. Done every month, the dump processing and uploading make the wiki to gain more than 100,000 lemmata in a few months. These lemmata may have translation errors, but it is low enough not to be taken in consideration (<1%). Hardest cases can be resolved by a single check on the source wiki (which is indicated by a template).
In October, I have thought about building a bot that completes a task as scheduled by a parameter file. This is particularly useful for maintaining list of wikis up-to-date. Currently, the pace at which the list of Wikis on the Malagasy Wiktionary is four times a day, i.e every six hours.
At the end of January 2013, I thought about a more efficient use of the translation retriever that I wrote a few months ago. Then comes the IRC bot: it retrieves in real time all the edits made on selected wikis and does its possible to translate the latter entry in Malagasy,  in real time! The first time it was developped, it only used the traditional translation retriever, but later, on March, it also features a basic entry processor that allows the IRC bot to also translate entries in foreign languages into Malagasy, using the same dictionary. This latter version of the IRC bot is currently in use, and it creates hundreds of entries and content pages on the Malagasy Wiktionary everyday. I have no precise idea about the error rate but I am pretty sure it is less than 5%. The positive side of the bot is its ability to keep the pace when several edits are made in a minute, nevertheless, as it requires to be online and to be connected to Wikimedia servers, the processing frequency is limited to one page per second. Something is being thought on allowing the bot to process more pages.

Dikanteny anglisy-malagasy

Rehefa avy namboatra diksionera aho dia lasa namorona mpandika teny iray.
Io mpandika teny io dia mandika ny teny anglisy amin’ny teny Malagasy nosoratana amin’ny alalan’i PHP izay mampiasa fisie *.txt ho an’ny rakibolana ary ny parser.
Efa ora maromaro izay no laniko nanamboarana io mpandika teny io, efa vita hatry ny ela ny dingana voalohany (izany hoe dikanteny mot-à-mot), kanefa tsy dia maha-afa-po loatra ny fehezanteny avoakany amin’ny teny malagasy. Manomboka manaketrika ny dingana faharoa aho kanefa mbola tsy maha-afa-po foana ny valiny. Ity misy ohatra iray:

I am to speak of the American Vandal this evening, but I wish to say in advance that I do not use this term in derision or apply it as a reproach, but I use it because it is convenient; and duly and properly modified, it best describes the roving, independent, free-and-easy character of that class of traveling Americans who are not elaborately educated, cultivated, and refined, and gilded and filigreed with the ineffable graces of the first society. The best class of our countrymen who go abroad keep us well posted about their doings in foreign lands, but their brethren vandals cannot sing their own praises or publish their adventures. (avy amin’i The American Vandal Abroad nosoratan’i Mark Twain)

nodikain’ilay mpandika teny :

izaho manafatrafatra amin’ny ny ny amerikanina firintsy ity hariva izaho anefa faniriana hono amin’ny amy dingina izaho ilay tsy akory manao mampiasa term ity derision amy na mampihatra toy azy fananarana izaho anefa mampiasa fa azy azy manavanana ary duly ary properly modified faratampony azy describes ny dia mahaleotena free-and-easy fomba ny ilay class ny traveling American izay tsy akory elaborately avara-pianarana cultivated ary Showing or having good feelings or good taste. ary Endrika efa lasa an’ny matoanteny gild ary Having filigree ornamentation miaraka miaraka amy ny ineffable graces ny ny aloha fikambanana . faratampony ny ny class -ay andao izay mibodo andafy fantsakàna us mikasika posted doings their vahiny amy lands, their anefa vandals brethren manara-bava cannot manana their na praises their publish sendrasendra

Raha jerena ny resaka fitsipi-pitenenana ary ny filahatry ny teny ao anatin’ny fehezanteny, dia mbola lavitry ny afo ny kitay. Efa misimisy ihany anefa ny ezaka amin’ny famadihana ny toeran’ny mpamari-toerana ary ny anarana iombonana, araka ny hitantsika anatin’ity fehezanteny ity: “that beautiful woman is my wife” (vadiko ilay vehivavy mahafinaritra iny) izay nodikainy hoe “ity vehivavy mahafinaritra -ko andefimandry”.
Ilaina fantarina ihany koa fa dikanteny iray ihany no fidian’ilay mpandika teny ao amin’ny rakibolana. Ka arak’izany izy io tsy miraharaha ny polisemia ananan’ny teny. Ny olana faharoa amin’ny fampiasana fisie dia ny filàna manokatra azy isaky ny dikan-teny, ka rehefa mahery ny 2 megaoktety ny totalin’ny haben’ny fisie rehetra sokafana dia miteraka hadisoana ilay dikanteny, izay manakana azy hanao ilay dikanteny eo amin’ny efa-joro. Ka ny fameetrahana banky angona no mety hahavaha izany olana izany.

Search on Google using Python scripts

What about a free unlimited Google API? In the past, Google provided such thing, but it is definitely deprecated (due to abuses?). The new Search API needs money ($5 for 1,000 queries), and the free API has a limited use of 100 queries per day. Without any money, you won’t get far. After getting that information. I let down that project… Until I contribute to Wiktionary!

Extracting words from Malagasy daily newspapers to Malagasy Wiktionary weren’t actually an easy thing to program. At the first version of the script. It only can parse RSS feeds, and is very slow compared to what I used to know. It is because it loads approx. 400,000 words at each launch.
While doing that work. I have noticed that there are a plenty of words that are actually compounded words.This notice gave me an idea: anticipate through looking on google search whether the word exists or not: because on 1,300 roots contained on the Malagasy Wiktionary, I can potentially make 1.7 million by combining two nouns,  2.2 billion with three, and likely 2.8 trillion using four roots. That is enormous, and even at full regime, I will never be able to look for them all: at 5 queries per second (fastest rate I’ve ever had) it will take respectively 4 days with 2 roots, 14 years with three and eventually 177 centuries (17,700 years) for four roots. This is the first reason for which I have decided to try hacking Google Search to see if the word combination has already been used.

First, I looked to the page source, and it is very, very complicated to understand. I even think that this page was made by bot as html tag names are not written in a human language. I also have tried to use the URL but it is actually very, very long, with characters that look more like hashes and keys (?), not findable as they don’t explicitly appear on the main page form. At first sight, this kind of project is likely to fall…

I have found on the Web a post describing how to use the Google Search without any API. But there was a problem: the discussion is almost three years old. And when downloaded, the search engine has visibly been changed: it is very probable that a Google employee reported that discussion leading the company to take adequate measures. When I ran the script, all I could see was that there was nothing operational: no results were given when doing any search. I still keep an eye on the downloaded script. And I am trying to find something which can solve this problem. This script just avoided me to spend hours and hours reinventing a (square) wheel.

Once this problem is solved, at least temporarily, the source code will be released on SourceForge: Bot-Jagwar. It will rapidly fall into deprecation, so if there are peoples willing to update the script. They’ll be welcome :).

Hatsarana sa habetsahana ?

Habetsahana sa hatsarana ? Misy ny sasany hilaza hoe aleo hatsarana toa izay habetsahana, ary misy ny sasany milaza «hatsarana ankabetsahana», ary ny sasany “habetsahana anie mandeha amin’ny fanangonambola ihany e” … Ka inona ny tena marina ? Habetsahana sa hatsarana ?…

Velom-panontaniana amin’izany aho ankehitriny satria eo amin’ny Wikibolana eo dia misy adihevitra mikasika ny habetsahana sy ny hatsarana, na marimarina kokoa, ny resaka teny iditra amboarina amin’ny alalaln’ny rôbô ary ireo teny iditra amboarina tanana, izany hoe amin’ny alalan’ny fampiasana ny mpanovan’ilay Wiki fa tsy ny API-ny.

Nanao traikefa aho mikasika ny famoronana votoatiny amin’ny alalaln’ny robo eo amin’ny Wikibolala amin’ny teny malagasy, dia aveo indray teo amin’ny Wikibolana amin’ny teny fijianina ary amin’ny farany, tagalaogy ; ka izao no tena verdict an’izy ity ; tsara ihany ny mampiasa script raha toa ka hampiditra teny anglisy maro dia maro, izany hoe eo amin’ny roa na telo arivo eo ho eo, dia izay ihany ; satria raha manandrana ny handika ny teny hafa ianao, dia misy risque satria mety tsy mitovy ny dikan’ilay teny amin’ny teny anglisy amin’ny teny indonezianina ; jereo fotsiny ny teny star amin’ny teny anglisy, izay azo dikaina hoe «kintana» amin’ny teny malagasy, na «bintang» amin’ny teny indonezianina ; ny dika hafan’i «star» koa anefa dia mety atao hoe «olo-malaza» amin’ny teny malagasy na «pesohor» amin’ny teny indonezianina. Ny olana amin’ilay teny iray mety midika zavatra hafa tsy misy fifandraisana, dia ilay famaritany mety hafangaron’ny milina : lasa «bintang» ohatra ilay «olo-malaza», na lasa «pesohor» ilay «kintana»; tranga tsotra fotsiny ilay nasehoko teto, fa raha ny tena marina dia mety lasa lavitra noho izany ilay olana. Ka raha milanja mahery ny dimy isan-jaton’ny teny iditra ao anaty Wikibolana iray izany olana kely izany, dia mety hametraka olana goavana be ho an’ireo izay mianatra ny teny malagasy ary koa ho an’ny malagasy izay mianatra teny vahiny. Ka izany zavatra izany, na dia iray tokana anatin’ny iray alina aza, dia manimba ny hatsaran’ny rakibolana foana, indrindra indrindra rehefa manakaiky ny iray tapitrisa ny isa tontalin’ny teny iditra.

Tsy afaka miresaka mikasika ny hatsarana izany isika eto amin’ity Wikibolana ity raha toa ka vitsy kely (izany hoe eo amin’ny folompolony eo ny isan’ny lahatsoratra) ny lahatsoratra voasoratra ato. Raha tsy tonga any amin’ny zaton-jatony any ho any ny isan’ny lahatsoratra dia tsy tokony hiadihevitra mikasika ny hatsaran’ny rakipahalalana isika, torak’izany koa ny mikasika ny rakibolana.

Tsy afaka miresaka mikasika ny hatsarana ihany koa anefa isika raha roa na telo fotsiny no isan’ny lahatsoratra, na dia vita tsara dia tsara aza izy ireo. Manomboka mireasaka mikasika ny hatsarana isika, rehefa tonga eo amin’ny roa na telo arivo eo ho eo ny isan’ny teny iditra, ka eo izy azo tombatombanina ny votoatiny. Raha latsak’izay dia kely loatra ny isany.

Ka izao ny olako miaksika ny fanadihadiana ny hatsarana ary ny habetsahana : tia handray anjara eo amin’ny Wikibolana amin’ny teny tsy fantatro aho, eo amin’n efapolo eo ho eo no isan’ny teny iditra ao anatiny ; ireo teny iditra ireo, dia tsy bordel be daholo. Ka raha tia hampitombo azy ianao, dia manao izay ilaina haha-mpandrindra ny tena, dia aveo amin’izay manadio ny votoatiny ao anatin’ilay wiki. Dia aveo amin’izay manomboka mamorona ny teny iditra mbola tsy misy. Azo atao izany, fa anefa mangata-potoana ihany, ary ho an’ireo sasany izay be dia be mihitsy «andrianiny ny fiteniny, ary ampiany ny an’ny hafa». Misy koa ny manana fisainana toa izao «andrianiny ny fiteny ampiasiana indrindra eo amin’ny aterineto, ary avelany hi-demerde ny hafa».

Misy koa ny hafa tia hampitombo ny wiki hafa, anefa vitsy ireo sady kely ny fotoana hananany hanaovana izany, izany hoe ny manao fikarohana mikasika ilay teny, dia aveo mameno ny Wikibolana avy amin’izay fahalalana nangonona izay. Anefa izany mangata-potoana ka eto amin’ny toerana ipetrahako, izany hoe aty Eoropa, ary indrindra indrindra amin’izaho mbola mianatra, dia tsy manam-potoana ny hanaovana izany rehetra izany aho ; koa rehefa manam-potoana, dia manoratra script Python hanaovana ilay asa amin’ny toerako, na dia ratsy kokoa aza ny hatsarany amin’izy nosoratana tamin’ny rôbô, satria mety ahitana hadisoana hafahafa (jereo ihany eo amin’ny pejin-dresako sy ny laogin’ny pejy voafafa), izay azo ahitsiana ihany, ho an’izay tia handray anjara amin’ny alalan’ny fanitsiana… Betsaka ny mpikambana malagasy fantatro tia handray anjara amin’ny alalan’ny fanitsiana ny lahatsoratra, ka eo izy afaka manao izany tsara.

Izaho no anisan’izay olona tia handray anjara be dia be izay anefa tsy misy fotoana. Ka tiako hanan-tombontsoa ny wiki hafa amin’ny fahaizako manamboatra rindrankajy. Izany zavatra izany no anisan’ny nampitombo wiki roa ; wikibolana moa no ankamaroan’izy ireo, fa misy koa ny wikipedia. Wikipedia malagasy, wikibolana malagasy, fijianina ary tagalaogy no nandramako ireo script nosoratako ireo (izay miisa eo amin’ny telo ambin’ny folo teo ho eo), ka mandeha tsara ilay izy, na dia mamoaka hadisoana hafahafa aza. Ary ireo hadisoana izay isaina amam-polony ireo no itsikeran’ny mpikambana hafa an’ny tenako, na dia hoe tsy hanimba an’ilay Wiki no asako voalohany, fa manisy votoatiny, avy amin’ny loharano izay azo trandrahana : Rakibolana an-tranonkala, rakibolana taratasy ary ny Wikibolana hafa (anglisy indrindra indrindra). Ankehitriny