Options
A Complete OCR System for Tamil Magazine Documents
Date Issued
2009
Author(s)
Kokku, A
Chakravarthy, S
Abstract
We present a complete optical character recognition (OCR) system for Tamil magazines/documents. All the standard elements of OCR process like de-skewing, preprocessing, segmentation, character recognition, and reconstruction are implemented. Experience with OCR problems teaches that for most subtasks of OCR, there is no single technique that gives perfect results for every type of document image. We exploit the ability of neural networks to learn from experience in solving the problems of segmentation and character recognition. Text segmentation of Tamil newsprint poses a new challenge owing to its italic-like font type; problems that arise in recognition of touching and close characters are discussed. Character recognition efficiency varied from 94 to 97% for this type of font. The grouping of blocks into logical units and the determination of reading order within each logical unit helped us in reconstructing automatically the document image in an editable format.