We describe a system for indexing of census records in tabular documents with the goal of recognizing the content
of each cell, including both headers and handwritten entries. Each document is automatically rectified, registered
and scaled to a known template following which lines and fields are detected and delimited as cells in a tabular
form. Whole-word or whole-phrase recognition of noisy machine-printed text is performed using a glyph library,
providing greatly increased efficiency and accuracy (approaching 100%), while avoiding the problems inherent
with traditional OCR approaches. Constrained handwriting recognition results for a single author reach as high
as 98% and 94.5% for the Gender field and Birthplace respectively. Multi-author accuracy (currently 82%) can
be improved through an increased training set. Active integration of user feedback in the system will accelerate
the indexing of records while providing a tightly coupled learning mechanism for system improvement.
For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR
text unusable. Since image binarization is often the method used to identify foreground pixels, a body of research seeks
to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method
incorporates information from multiple simple thresholding binarizations of the same image to improve text output. Using
a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of
13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines
the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives
from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold
images and a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the
methods of Lund et al. (2011) achieving an improvement over the original OCR output and a 8.41% WER result on this
data set.
We present a method of interactive training for handwriting recognition in collections of documents. As the user transcribes (labels) the words in the training set, words are automatically skipped if they appear to match words that are already transcribed. By reducing the amount of redundant training, better coverage of the data is achieved, resulting in more accurate recognition. Using word-level features for training and recognition in a collection of George Washington's manuscripts, the recognition ratio is approximately 2%-8% higher after training with our interactive method than after training the same number of words sequentially. Using our approach, less training is required to achieve an equivalent recognition ratio. A slight improvement in recognition ratio is also observed when using our method on a second data set, which consists of several pages from a diary written by Jennie Leavitt Smith.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.