In this paper, we introduce HoughToRadon Transform layer, a novel layer designed to improve the speed of neural networks incorporated with Hough Transform to solve semantic image segmentation problems. By placing it after a Hough Transform layer, ’inner’ convolutions receive modified feature maps with new beneficial properties, such as a smaller area of processed images and parameter space linearity by angle and shift. These properties were not presented in Hough Transform alone. Furthermore, HoughToRadon Transform layer allows us to adjust the size of intermediate feature maps using two new parameters, thus allowing us to balance the speed and quality of the resulting neural network. Our experiments on the open MIDV-500 dataset show that this new approach leads to time savings in document segmentation tasks and achieves state-of-the-art 97.7% accuracy, outperforming HoughEncoder with larger computational complexity.
Virtual unrolling or unfolding, digital unwrapping, flattening or unfurling - all these terms are used to describe the process of surface straightening of a tomographically reconstructed digital object. For many objects of historical heritage, tomography is the only way to obtain a hidden image of the original object without its destruction. Digital flattening is no longer considered a unique met hodology. It being applied by many research group, but AI-based methods are used insignificantly in such projects, despite the amazing success of AI in computer vision, in particular optical text recognition. It can be explained by the fact that the success of AI depends on large, broad and high quality datasets, but there are very few published CT-based datasets relevant to the task of digital flattening. Accumulation of a sufficient amount of data necessary for training models is a key point for the next technological breakthrough. In this paper, we present open and cumulative dataset CT-OCR-2022. Dataset includes 6 packages data for different model objects that help to enrich tomographic solutions and to train machine learning models. Each package contains optically scanned image of model objects, 400 measured X-ray projections, 2687 CT- reconstructed cross-sections of 3D reconstructed image, segmentation markups. We believe that CT-OCR-2022 dataset will serve as a benchmark for reconstructed object digital flattening and recognition systems, and that it will prove invaluable for advancement of the field of CT-reconstruction, symbols analysis and recognition. The data presented are openly available in Zenodo at doi:10.5281/zenodo.7123495 and linked repositories.
For text line recognition, much attention is paid to augmentation of the training images. Yet the inner structure of the textual information in the images also affects the accuracy of the resulting model. In this paper, we propose an ANNbased method for textual data generation for printing in images with a background of a synthetic training sample. In our method we avoid the usage of completely random sequences as well as the dictionary-based ones. As a result, we gain the data that saves the basic properties of the target language model, such as the balance of vowels and consonants, but avoid the lexicon-based properties, like the prevalence of the specific characters. Moreover, as our method focuses only on high-levels features and does not try to generate the real words, we can use a small training sample and light-weight ANN for text generation. To check our method, we train three ANNs with same architecture, but with different training samples. We choose machine readable zones as a target field because of their structure that does not correspond with the ordinary lexicon. The results of the experiments on three public datasets of identity documents demonstrate the effectiveness of our method and allows to enhance the state-of-the art results for the target field.
In this work, we present the auto-clustering method which can be used for pattern recognition tasks and applied to the training of a metric convolutional neural network. The main idea is that the algorithm creates clusters consisting of classes similar from a network’s point of view. The usage of clusters allows the network to pay more attention to classes that are hard to differ. This method improves the generation of pairs during the training process, which is a current problem because the optimal generation of data significantly affects the quality of training. The algorithm works in parallel with the training process and is fully automatic. To evaluate this method we chose the Korean alphabet with the corresponding PHD08 dataset and compared our auto-clustering with random-mining, hard-mining, distance-based mining. Opensource framework Tesseract OCR 4.0.0 was also considered to evaluate the baseline.
Image recognition includes problems where special features can be found only in a specific area of an image. This fact suggests us to apply different filters to different areas of input images. Convolutional networks have only fully-connected and locally-connected layers to make it. A Fully-connected layer erases the position factor for every output and a locally connected layer storage an enormous number of parameters. We need a layer that can apply different convolution kernels for different areas of an input image and not carry so many parameters as a locally-connected layer for high scale resolution images. This is why in this paper, we introduce a new type of convolutional layer - a block layer, and a way to construct a neural network using block convolutional layers to achieve better performance in the image classification problem. The influence of block layers on the quality of the neural network classifier is shown in this paper. We also provide a comparison with neural network architecture LeNet-5 as a baseline. The research was conducted on open datasets: MNIST, CIFAR-10, Fashion MNIST. The results of our research prove that this layer can increase the accuracy of neural network classifiers without increasing the number of operations for the neural network.
In the field of document analysis and recognition using mobile devices for capturing, and the field of object recognition in a video stream, it is important to be able to combine the information received from different frames, since the quality of text recognition depends on the effectiveness of collecting the maximal amount of information about the target object. This paper examines and compares the effectiveness of two different combination approaches, namely pre-combination of images before recognition and the combination of recognition results. The combination methods are briefly described. The quality of the combined results obtained using different methods was measured and compared on the MIDV-500 dataset. The results show that the approach with a combination of text strings recognition results is more effective in comparison with the preliminary combination of images. It can be concluded that simple image stacking with projective alignment does not allow to achieve a comparable recognition results combination quality, and thus in order to include the information about per-frame changes of the text images more sophisticated image combination algorithms need to be employed.
Due to a noticeable expansion of document recognition applicability, there is a high demand for recognition on mobile devices. A mobile camera, unlike a scanner, cannot always ensure the absence of various image distortions, therefore the task of improving the recognition precision is relevant. The advantage of mobile devices over scanners is the ability to use video stream input, which allows to get multiple images of a recognized document. Despite this, not enough attention is currently paid to the issue of combining recognition results obtained from different frames when using video stream input. In this paper we propose a weighted text string recognition results combination method and weighting criteria, and provide experimental data for verifying their validity and effectiveness. Based on the obtained results, it is concluded that the use of such weighted combination is appropriate for improving the quality of the video stream recognition result.
In this work we consider the problem of the fluorescent security fibers detection on the images of identity documents captured under ultraviolet light. As an example we use images of the second and third pages of the Russian passport and show features that render known methods and approaches based on image binarization non applicable. We propose a solution based on ridge detection in the gray-scale image of the document with preliminary normalized background. The algorithm was tested on a private dataset consisting of both authentic and model passports. Abandonment of binarization allowed to provide reliable and stable functioning of the proposed detector on a target dataset.
Despite the significant success in the field of text recognition, complex and unsolved problems still exist in this field. In recent years, the recognition accuracy of the English language has greatly increased, while the problem of recognition of hieroglyphs has received much less attention. Hieroglyph recognition or image recognition with Korean, Japanese or Chinese characters have differences from the traditional text recognition task. This article discusses the main differences between hieroglyph languages and the Latin alphabet in the context of image recognition. A light-weight method for recognizing images of the hieroglyphs is proposed and tested on a public dataset of Korean hieroglyph images. Despite the existing solutions, the proposed method is suitable for mobile devices. Its recognition accuracy is better than the accuracy of the open-source OCR framework. The presented method of training embedded net bases on the similarities in the recognition data.
In this paper, we propose a new method to detect monospaced font in text line images. Although many authors address more complex problems of text recognition or font recognition, this problem is still challenging when dealing with camera-captured images of identity documents. However, such a font characteristic can be useful in document authentication. These images usually contain complex backgrounds and various distortions. Our approach is based on a segmentation neural network and Fourier Transform for detecting “strong” periodic components in the segmentor output. The experimental results show that the combination of neural network and Fourier Transform deals with the task of monospaced font detection more effectively than the same Fourier analysis applied to the results of an image processing method for segmentation. The main advantage of the neural network is that its output does not depend on background, font and characters characteristics directly.
The paper presents an algorithm for document image recognition robust to projective distortions. This algorithm is based on a similarity metric, which is learned using Siamese architecture. The idea of training Siamese networks is to build a function of converting the image into a space where a distance function corresponding to a pre-defined metric approximates the similarity between objects of initial space. During learning the loss function tries to minimize the distance between pairs of object from the same class and maximize it between the ones from different classes. A convolutional network is used for mapping initial space to the target one. This network lets to construct a feature vector in target space for each class. Classification of objects is performed using the mapping function and finding the nearest feature vector. The proposed algorithm achieved recognition quality comparable to classifying convolutional network on an open dataset of document images MIDV-500 [1]. Another important advantage of this method is the possibility of one-shot learning that is also shown in the paper.
This paper proposes an improvement for an existing and widely spread approach of panorama stitching for images of planar objects. The proposed method is based on projective transformations graph adjustment. Evaluation is presented on a heterogeneous dataset which contains images of Earth’s and Mars’s surfaces, images taken using a microscope, as well as handwritten and printed text documents. Quality enhancement of panorama stitching method is illustrated on this dataset and shows more than twofold reduction in the accumulated computation error of projective transformations.
The important part of the system of a planar rectangular object analysis is the localization: the estimation of projective transform from template image of an object to its photograph. The system also includes such subsystems as the selection and recognition of text fields, the usage of contexts etc. In this paper three localization algorithms are described. All algorithms use feature points and two of them also analyze near-horizontal and near- vertical lines on the photograph. The algorithms and their combinations are tested on a dataset of real document photographs. Also the method of localization quality estimation is proposed that allows configuring the localization subsystem independently of the other subsystems quality.
Textual blocks rectification or slant correction is an important stage of document image processing in OCR systems. This paper considers existing methods and introduces an approach for the construction of such algorithms based on Fast Hough Transform analysis. A quality measurement technique is proposed and obtained results are shown for both printed and handwritten textual blocks processing as a part of an industrial system of identity documents recognition on mobile devices.
In this work we describe an approach to real-time image search in large databases robust to variety of query distortions such as lighting alterations, projective distortions or digital noise. The approach is based on the extraction of keypoints and their descriptors, random hierarchical clustering trees for preliminary search and RANSAC for refining search and result scoring. The algorithm is implemented in Snapscreen system which allows determining a TV-channel and a TV-show from a picture acquired with mobile device. The implementation is enhanced using preceding localization of screen region. Results for the real-world data with different modifications of the system are presented.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.