Research existing MS Office text extractors
completed by: qxcv
mentors: Bastian Blank, Reimar Bauer, Thomas Waldmann, Prashant Kumar, Eugene Syromyatnikov
Abstract
Research existing solutions for extracting text from proprietary Microsoft file formats.
Details
For moin2, we already have quite some converters (including Open Document Format [OpenOffice / LibreOffice]), but nothing for Microsoft Office formats. Now we need to create a survey of the GPL2+ license compatible code that can extract text from these proprietary file formats.
We need to know:
- is a license compatible to GPL2+ used?
- for python libraries e.g.: GPL, BSD, MIT, ... (not: Apache License 2)
- in general: a free software license, not any proprietary license
- the programming language used
- strongly preferred is library code in python (we can just call it)
- also maybe working is a commandline tool (supported platforms?) that we can call as a subprocess
- windows-only solutions are not wanted
- compatibility with different file formats (mainly Word but also Excel and Powerpoint)
- compatibility with different versions (i.e. .DOC and .DOCX)
- reliability (is it well-maintained code, is it recently updated?)
Deliverable: wiki page
Benefits
Many Moin users would like to have a platform-independant, pure python way to extract text for indexing.
Researching existing code base is a first step on this direction.
Skill Requirements
You'll need to do a lot of search on the Web. Discuss with moin devs online on IRC.
Links
This task refers to moin2 (http://moinmo.in/MoinMoin2.0)!
http://hg.moinmo.in/moin/2.0 or http://bitbucket.org/thomaswaldmann/moin-2.0 - repository of moin2
http://moimo.in/MoinMoinChat - please join us on IRC #moin-dev
You can discuss this issue in the MoinMoin wiki: http://moinmo.in/EasyToDo/TextExtractors