Implement a feature extractor for a MaxEnt POS tagger
completed by: Stan K.
mentors: Kevin Brubeck Unhammer, Francis Tyers
The objective of this task is to write a python program which extracts features from a tagged corpus.
The analysed corpus looks like:
https://svn.code.sf.net/p/apertium/svn/languages/apertium-eng/texts/turing1.tagged.txt
The tagged corpus looks like:
https://svn.code.sf.net/p/apertium/svn/languages/apertium-eng/texts/turing1.handtagged.spectie.txt
The feature patterns should be described as regular expressions, e.g.
surface = ^([A-Za-z]+)/
lemma = /([A-Za-z]+)<
number = (<sg>|<pl>)
gender = (<f>|<m>|<nt>)
pos = (<n>|<adj>|<vblex>|<pr>)
So in the example below for "this":
^to/to<pr>$
^this/this<det><dem><sg>/this<prn><tn><mf><sg>$
^day/day<n><sg>$
^to/to<pr>$
^this/this<det><dem><sg>$
^day/day<n><sg>$
Your features table might be :
Output | Input | Features
this<det><dem><sg> | this | (-1, "lemma", "to") (-1, "pos", "<pr>") (0, "number", "<sg>") (0, "lemma", "this") (1, "lemma", "day") (1, "pos", "<n>") (1, "number", "<sg>")
--------------------------------------------
Untagged:
^Turing/Turing<np><cog><sg>$
^machines/machine<n><pl>$
^are/be<vbser><pres>$
^to/to<pr>$
^this/this<det><dem><sg>/this<prn><tn><mf><sg>$
^day/day<n><sg>$
^a/a<det><ind><sg>$
^central/central<adj>$
^object/object<n><sg>/object<vblex><inf>/object<vblex><pres>$
^of/of<pr>$
^study/study<n><sg>/study<vblex><inf>/study<vblex><pres>$
^in/in<pr>$
^theory/theory<n><sg>$
^of/of<pr>$
^computation/computation<n><sg>$
^./.<sent>$
Tagged:
^Turing/Turing<np><cog><sg>$
^machines/machine<n><pl>$
^are/be<vbser><pres>$
^to/to<pr>$
^this/this<det><dem><sg>$
^day/day<n><sg>$
^a/a<det><ind><sg>$
^central/central<adj>$
^object/object<n><sg>$
^of/of<pr>$
^study/study<n><sg>$
^in/in<pr>$
^theory/theory<n><sg>$
^of/of<pr>$
^computation/computation<n><sg>$
^./.<sent>$
http://www.aclweb.org/anthology/W96-0213