GSoC/GCI Archive
Google Code-in 2010 The Apertium project

English-French: find the top 400 missing words from a set of Wikipedia articles

completed by: Narnian

mentors: Jimmy O'Regan

Choose a set of articles -- ones that interest you are better -- run them through the French analyser to find missing words, collect the top 400, and add translations. Many of these will be proper names -- they still count. If possible, keep track of gender for nouns.

HINT: If you collect the article names and paste them into Special:Export, you can get them in a single file

You can find a script that will give you the top unknown words here: http://wiki.apertium.org/wiki/Calculating_coverage