Historical Document Analysis and Recognition Project

Published:

First, we present methods for three different tasks of recognizing anomalously deformed Kana in Japanese historical documents, which were contested by IEICE PRMU1 2017. The tasks have three levels: single character recognition, three Kana characters sequence recognition and unrestricted Kana recognition. We compare several methods for each task. For the level 1, we evaluate CNN based methods and BLSTM based methods. For the level 2, we consider several variations of a combined architecture of CNN and BLSTM. For the level 3, we compare an extension of the method for the level 2 and a segmentation based method. We achieve the single character recognition accuracy of 96.8%, the three Kana characters sequence recognition accuracy of 87.12% and the unrestricted Kana recognition accuracy of 73.3%. These results prove the performance of CNN and BLSTM on these tasks.


Secondly, there exist many challenges such as damage, fade, show-through, anomalous deformation, various backgrounds, limited resources and so on for historical document analysis and recognition. These challenges raise the demand for preprocessing historical document images. In this paper, we propose deep neural networks, named Pixel Segmentation Networks (PSNet) for text segmentation from Pre-Modern Japanese text (PMJT) historical document images. The proposed networks are used to segment pixels of text from raw document images with various background styles and image sizes, which is helpful for the later steps in historical document analysis and recognition. For preparing training patterns, we applied the Otsu local binarization method on every single character and extracted the pixel-level labels of all training document images. To evaluate the proposed networks, we used following two metrics: pixel-level accuracy (PlA) and the ratio of intersection over a union of the true test region and its detected region (IoU). Since there is the great imbalance between the number of background pixels and that of text pixels, we normalize the measurements by a weighted parameter based on the frequency of background and text pixels. Then, we made experiments on the PMJT database, which is randomly split into the training set of 1,556 images, validation set of 333 images and testing set of 333 images. The experiments show the best PlA of 98.75%, the frequency-weighted PlA of 95.27%, IoU of 87.89%, and the frequency-weighted IoU of 97.68% when 1,556 images are uses for training. Moreover, the performance of CED-PSNet12 is only degraded as little as around 2 percentage points even when under 100 images, 1/16 of the original training set are used.

For future work, we will use end-to-end text-line recognizers to recognize text regions without character segmentation. It should be useful for researchers in the historical document processing area since a trained model could be used to process an enormous number of scanned images without requiring large human effort.