Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3352631.3352637acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

Information Extraction in Handwritten Marriage Licenses Books

Published: 20 September 2019 Publication History

Abstract

Handwritten marriage licenses books are characterized by a simple structure of the text in the records with an evolutionary vocabulary, mainly composed of proper names that change along the time. This distinct vocabulary makes automatic transcription and semantic information extraction difficult tasks. Previous works have shown that the use of category-based language models and a Grammatical Inference technique known as MGGI can improve the accuracy of these tasks. However, the application of the MGGI algorithm requires an a priori knowledge to label the words of the training strings, that is not always easy to obtain. In this paper we study how to automatically obtain the information required by the MGGI algorithm using a technique based on Confusion Networks. Using the resulting language model, full handwritten text recognition and information extraction experiments have been carried out with results supporting the proposed approach.

References

[1]
Théodore Bluche, Jérôome Louradour, and Ronaldo Messina. 2017. Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 1050--1055.
[2]
Manuel Carbonell, Mauricio Villegas, Alicia Fornés, and Josep Lladós. 2018. Joint recognition of handwritten text and named entities with a neural end-to-end model. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 399--404.
[3]
Andreas Fischer, Volkmar Frinken, and Horst Bunke. 2013. Hidden markov models for off-line cursive handwriting recognition. In Handbook of Statistics. Vol. 31. Elsevier, 421--442.
[4]
Alicia Fornés, Verónica Romero, Arnau Baró, Juan Ignacio Toledo, Joan Andreu Sánchez, Enrique Vidal, and Josep Lladós. 2017. ICDAR2017 competition on information extraction in historical handwritten records. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 1389--1394.
[5]
P. Garcia, E. Vidal, and F. Casacuberta. 1987. Local languages, the succesor method, and a step towards a general methodology for the inference of regular grammars. IEEE Transactions on PAMI 6 (1987), 841--845.
[6]
Alex Graves and Jürgen Schmidhuber. 2008. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. In NIPS. 545--552.
[7]
F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.
[8]
L. Mangu, E. Brill, and A. Stolcke. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language 14, 4 (2000), 373--400.
[9]
Joan Mas, Alicia Fornés, and Josep Lladós. 2016. An interactive transcription system of census records using word-spotting based information transfer. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, 54--59.
[10]
T.R. Niesler and P.C. Woodland. 1996. A variable-length category-based n-gram language model. In Proc. of ICASSP-96, Vol. 1. 164 -167 vol. 1. https://doi.org/10.1109/ICASSP.1996.540316
[11]
Animesh Prasad, Hervé Déjean, Jean-Luc Meunier, Max Weidemann, Johannes Michael, and Gundram Leifert. 2018. Bench-marking information extraction in semi-structured historical handwritten records. arXiv preprint arXiv:1807.06270 (2018).
[12]
Joan Puigcerver. 2017. Are multidimensional recurrent layers really necessary for handwritten text recognition?. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 67--72.
[13]
V. Romero, A. Fornés, N. Serrano, J. A. Sánchez, A.H. Toselli, V. Frinken, E. Vidal, and J. Lladós. 2013. The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition 46 (2013), 1658--1669.
[14]
V. Romero, A. Fornés, E. Vidal, and J.A. Sánchez. 2016. Using the MGGI Methodology for Category-based Language Modeling in Handwritten Marriage Licenses Books. In ICFHR. Shenzhen, China, 201--206.
[15]
V. Romero and J. A. Sánchez. 2013. Category-Based Language Models for Handwriting Recognition of Marriage License Books. In Proc. of ICDAR 2013. 788--792.
[16]
A. Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Proc. of the 3rd Interspeech. 901--904.
[17]
J Ignacio Toledo, Manuel Carbonell, Alicia Fornés, and Josep Lladós. 2019. Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recognition 86 (2019), 27--36.
[18]
A. H. Toselli, A. Juan, D. Keysers, J. González, I. Salvador, H. Ney, E. Vidal, and F. Casacuberta. 2004. Integrated Handwriting Recognition and Interpretation using Finite-State Models. IJPRAI 18, 4 (June 2004), 519--539.
[19]
E. Vidal and D. Llorens. 1996. Using Knowledge to Improve N-gram Language Modelling Through the MGGI Methodology. In Proceedings of the 3rd International Colloquium on Grammatical Inference: Learning Syntax from Sentences (ICG! '96). Springer-Verlag, London, UK, UK, 179--190. http://dl.acm.org/citation.cfm?id=645516.658257
[20]
E. Vidal, F. Thollard, C. De La Higuera, F. Casacuberta, and R.C. Carrasco. 2005. Probabilistic finite-state machines-part II. IEEE Transactions on PAMI 27, 7 (2005), 1026--1039.

Cited By

View all
  • (2023)Do bridges dream of water pollutants? Towards DreamsKG, a knowledge graph to make digital access for sustainable environmental assessment come trueCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587590(724-730)Online publication date: 30-Apr-2023
  • (2022)Date Recognition in Historical Parish RecordsFrontiers in Handwriting Recognition10.1007/978-3-031-21648-0_4(49-64)Online publication date: 25-Nov-2022
  • (2022)Combining Visual and Linguistic Models for a Robust Recipient Line Recognition in Historical DocumentsDocument Analysis Systems10.1007/978-3-031-06555-2_40(598-612)Online publication date: 18-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing
September 2019
98 pages
ISBN:9781450376686
DOI:10.1145/3352631
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • FamilySearch: FamilySearch

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Handwritten recognition
  2. Information extraction
  3. Marriage Licenses

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HIP '19

Acceptance Rates

HIP '19 Paper Acceptance Rate 15 of 26 submissions, 58%;
Overall Acceptance Rate 52 of 90 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Do bridges dream of water pollutants? Towards DreamsKG, a knowledge graph to make digital access for sustainable environmental assessment come trueCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587590(724-730)Online publication date: 30-Apr-2023
  • (2022)Date Recognition in Historical Parish RecordsFrontiers in Handwriting Recognition10.1007/978-3-031-21648-0_4(49-64)Online publication date: 25-Nov-2022
  • (2022)Combining Visual and Linguistic Models for a Robust Recipient Line Recognition in Historical DocumentsDocument Analysis Systems10.1007/978-3-031-06555-2_40(598-612)Online publication date: 18-May-2022
  • (2022)Information Extraction from Handwritten Tables in Historical DocumentsDocument Analysis Systems10.1007/978-3-031-06555-2_13(184-198)Online publication date: 18-May-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media