Google Summer of Code 2014 Organization Apache Software Foundation Project Optical Character Recognition for Apache PDFBox

Optical Character Recognition for Apache PDFBox

by Dimuthu Upeksha for Apache Software Foundation

Apache PDFBox is widely used as a text extraction tool from PDF files. But in current approach text can not be extracted from image contents and corrupted character encodings. In this project a new approach to extract text from PDF is introduced using Optical Character Recognition.