OCR Software-- Optical Character Recognition or Optical Crud
Recognition?
Is it really possible to get high OCR accuracy from poor quality
documents?
Optical Character Recognition (OCR) refers to a software
technology and processes that involve the translation of printed
text into computer searchable text.
Done correctly, OCR enables users to search for and retrieve
individual words contained within a file or page. In addition,
when a set of files is indexed, users are able to search for
keywords across an entire document library and retrieve each
page with exact precision. OCR enables users to execute searches
in seconds, searches that once could take several hours or days
to complete.
However, this technology did not work well on older or poor
quality documents that contained mixed fonts or combinations of
texts and graphics. Until now!!
Due to several recent technology advances, it is now possible to
obtain six-sigma level character accuracy from these types of
document collections.
Although it is important to keep in mind that the quality and
condition of the paper documents are still key factors in the
successful OCR conversion, dramatically improved results can be
obtained by enhancing the quality of the scanned image prior to
processing.
Noise removal of borders, speckles and skews are now common on
the more advanced document scanners.
Furthermore, advanced color filter technologies may be used to
reduce any page background colors, in conjunction with
multi-light image capture technologies to remove any shadows
cast by page creases that could impact image quality or
recognition accuracy.
Once document scanning and processing are complete, an OCR text
layer can actually be added and hidden behind each image. An
additional orientation filter can be used to ensure that the
best image is presented to the OCR engines.
To achieve the highest conversion accuracy possible, the
characters in the image can be processed using multi-engine OCR
voting technologies that rank each character to determine the
best text recognition fit. Then once a word is generated, it
will be filtered through a proprietary lexicon to ensure the
highest quality results.
Finally, this text can be processed utilizing sophisticated
layout retention technologies to represent the image text
layout, to provide the best possible text representation for
precise search and retrieval. After all, isn't that why they
call it Optical Character Recognition?