scrape Mongolian noun paradigms into yaml file
completed by: Richard Tynan
mentors: Francis Tyers, Jonathan
There are charts of Mongolian (=Khalkha) noun paradigms at the following url:
http://wiki.firespeaker.org/Khalkha_noun_classes
Your job is to write a script (preferably in python3) to scrape those paradigms into a yaml files (for testing of morphological transducers) like those at https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-cv-tr/tests/
The script should produce files according to the following guidelines:
- each sub-paradigm type should be a separate file, named e.g. "normal nouns - ending with consonants.yaml" and "normal nouns - ending with vowels.yaml" (it would be good to case-convert to all-lowercase),
- each word should be a section in the Tests section of the file, e.g. "гар = time:",
- transcriptions (in []s) should be ignored,
- empty case forms should be skipped (e.g., no "Pl" form for classroom),
- case forms highlighted in blue should be skipped,
- all formatting of individual letters should be ignored (e.g., bolded н is common—the '''s around the character should be done away with),
- variable forms should include all (and only) the forms given (no "—"s) (this will probably be the hardest part of designing this script),
- the script should be able to deal with new sub-paradigms, but it can (doesn't have to) ignore the "to sort" section
- the entries for the forms should be tagged as <n> with other tags coming from the form given, and the base form should be the Nom form for each noun; e.g.:
- гар<n><nom> : гар
- гар<n><gen> : гарыг
- гар<n><dat> : гарт
- гар<n><nom><pl> : гарууд
- note that the Pl form is actually <nom><pl>
- the header of the yaml files should point to ../khk.autogen.hfst for Gen and ../khk.automorf.hfst for Morph.