In the field of document analysis and recognition using mobile devices for capturing, and the field of object recognition in a video stream, it is important to be able to combine the information received from different frames, since the quality of text recognition depends on the effectiveness of collecting the maximal amount of information about the target object. This paper examines and compares the effectiveness of two different combination approaches, namely pre-combination of images before recognition and the combination of recognition results. The combination methods are briefly described. The quality of the combined results obtained using different methods was measured and compared on the MIDV-500 dataset. The results show that the approach with a combination of text strings recognition results is more effective in comparison with the preliminary combination of images. It can be concluded that simple image stacking with projective alignment does not allow to achieve a comparable recognition results combination quality, and thus in order to include the information about per-frame changes of the text images more sophisticated image combination algorithms need to be employed.
|