A technique appropriate for extracting textual information from documents with complex layouts, such as newspapers and journals, is presented. It is a combination of a foreground analysis and a text localization method. The first one is used to segment the page in text and nontext blocks, whereas the second one is used to detect text that may be embedded inside images, charts, diagrams, tables, etc. Detailed experiments on two public databases showed that mixing layout analysis and text localization techniques can lead to improved page segmentation and text extraction results.
The slant removal is a necessary preprocessing task in many document image processing systems. In this paper, we
describe a technique for removing the slant from the entire page, avoiding the segmentation procedure. The presented
technique could be combined with the most existed slant removal algorithms. Experimental results are presented on two
databases.
In this paper, a classification-free Word-Spotting system, appropriate for the retrieval of printed historical document
images is proposed. The system skips many of the procedures of a common approach. It does not include segmentation,
feature extraction or classification. Instead it treats the queries as compact shapes and uses image processing techniques
in order to localize a query in the document images. Our system was tested on a historical document collection with
many problems and a Google book, printed in 1675. Moreover, some comparative results are given for a traditional word
spotting system.
In this paper we present a procedure for removing ruling lines from a handwritten document image that does not require
any preprocessing or postprocessing tasks and it does not break existing characters. We take advantage of common
ruling line properties such as uniform width, predictable spacing, position vs. text, etc. The deletion procedure of the
detected ruling line is based on the fact that the coordinates of three collinear points have a determinant equal to zero.
The system is evaluated on synthetic page images in five different languages and is compared to a previous
methodology.
The implementation of word spotting is not an easy procedure and it gets even worse in the case of historical documents
since it requires character recognition and indexing of the document images. A general technique for word spotting is
presented, independent of OCR, using automatic representation of the text queries of the user by word images and
comparing them with the word images extracted from the document images. The proposed system does not require
training. The only required preprocessing task is the alphabet determination. Global shape features are used to describe
the words. They are very general in order to capture the form of the word and appropriately normalized in order to face
the usual problems of variance in resolution, width of words and fonts. A novel technique that makes use of the
interpolation method is presented. In our experiments, we analyze the system dependence on its parameters and we prove
that its performance is similar to the trainable systems.
Conference Committee Involvement (2)
Document Recognition and Retrieval XXII
11 February 2015 | San Francisco, California, United States
Document Recognition and Retrieval XXI
5 February 2014 | San Francisco, California, United States
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.