Support non-ASCII characters in flex lexers
completed by: Dalimil Hájek
mentors: Mikel L. Forcada, Francis Tyers, Kirill Krylov
Currently, flex lexers generated by the 'create-lexer.py' script[1] do not support non-ASCII characters. The objective of this task is to adjust the regular expressions so that they do.
Some ideas may be gleaned from the regular expressions in the Apertium code:
attr_items[L"lem"] = L"(([^<]|\"\\<\")+)";
attr_items[L"lemq"] = L"\\#[- _][^<]+";
attr_items[L"lemh"] = L"(([^<#]|\"\\<\"|\"\\#\")+)";
attr_items[L"whole"] = L"(.+)";
attr_items[L"tags"] = L"((<[^>]+>)+)";
This task will also involve making the lexers and the format-parse.py script work properly (e.g. allow) spaces in lemmas. And also make sure that lemmas are specified correctly, e.g. only in ( ) when they are optional and only with " " when they have spaces inside.
1. https://svn.code.sf.net/p/apertium/svn/branches/transfer4/scripts/create-lexer.py