Authors:
Syed Saqib Bukhari
1
;
Ashutosh Gupta
2
;
Anil Kumar Tiwari
3
and
Andreas Dengel
4
Affiliations:
1
German Research Center for Artificial Intelligence, Germany
;
2
German Research Center for Artificial Intelligence and IITJ-Indian Institute of Technology Jodhpur, Germany
;
3
IITJ-Indian Institute of Technology Jodhpur, India
;
4
German Research Center for Artificial Intelligence and Technical University Kaiserslautern, Germany
Keyword(s):
Document Analysis, Historical Document Analysis, Layout Analysis, Document Image Segmentation.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Computer Vision, Visualization and Computer Graphics
;
Image Understanding
;
Pattern Recognition
Abstract:
Layout analysis, mainly including binarization and page segmentation, is one of the most important performance
determining steps of an OCR system for complex medieval document images, which contain noise,
distortions and irregular layouts. In this paper, we present high performance page segmentation techniques
for medieval European document images which include a novel main-body and side-notes segregation and
an improved version of OCRopus (OCRopus, ) based text line extraction. In order to complete the high
performance layout analysis pipeline, we have also presented the application of the percentile based binarization
(Afzal et al., 2014) and the multiresolution morphology based text and non-text segmentation (Bukhari
et al., 2011) methods over historical document images. presented layout analysis techniques are applied to a
collection of the 15th century Latin document images, which achieved more than 90% accuracy for each of
the segmentation techniques.