Train part-of-speech taggers for Dutch and Afrikaans
completed by: AureiAnimus
mentors: Francis Tyers
The aim of this task is to train part-of-speech taggers for Dutch and Afrikaans as part of the apertium-af-nl MT system. This will involve writing a TSX file for each of the languages.[1] Then running the training process 'unsupervised' as described on the Wiki.[2] As a corpus of Afrikaans use the Afrikaans Wikipedia,[3] for Dutch, use the EuroParl[4] corpus. You should also write 5--10 forbid/enforce rules for each tagger based on a brief survey of disambiguation errors.
1. http://wiki.apertium.org/wiki/TSX_format
2. http://wiki.apertium.org/wiki/Unsupervised_tagger_training
3. http://download.wikimedia.org/afwiki/20101104/
4. http://www.statmt.org/europarl/v5/nl-en.tgz