Books of Hours. the First Liturgical Data Set for Text Segmentation.

Amir Hazem; Béatrice Daille; Christopher Kermorvant; Dominique Stutzmann; Marie-Laurence Bonhomme; Martin Maarand; Mélodie Boillet

Books of Hours. the First Liturgical Data Set for Text Segmentation.

Amir Hazem, Beatrice Daille, Christopher Kermorvant, Dominique Stutzmann, Marie-Laurence Bonhomme, Martin Maarand, Mélodie Boillet

Abstract

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.

Anthology ID:: 2020.lrec-1.97
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 776–784
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.97
DOI:
Bibkey:
Cite (ACL):: Amir Hazem, Beatrice Daille, Christopher Kermorvant, Dominique Stutzmann, Marie-Laurence Bonhomme, Martin Maarand, and Mélodie Boillet. 2020. Books of Hours. the First Liturgical Data Set for Text Segmentation.. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 776–784, Marseille, France. European Language Resources Association.
Cite (Informal):: Books of Hours. the First Liturgical Data Set for Text Segmentation. (Hazem et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.97.pdf

PDF Cite Search