Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Lazar, Koren; Saret, Benny; Yehudai, Asaf; Horowitz, Wayne; Wasserman, Nathan; Stanovsky, Gabriel

Computer Science > Computation and Language

arXiv:2109.04513 (cs)

[Submitted on 9 Sep 2021 (v1), last revised 24 Oct 2021 (this version, v2)]

Title:Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Authors:Koren Lazar, Benny Saret, Asaf Yehudai, Wayne Horowitz, Nathan Wasserman, Gabriel Stanovsky

View PDF

Abstract:We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE - 100 CE). Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text in a subjective and time-consuming process. We identify that this challenge can be formulated as a masked language modelling task, used mostly as a pretraining objective for contextualized language models. Following, we develop several architectures focusing on the Akkadian language, the lingua franca of the time. We find that despite data scarcity (1M tokens) we can achieve state of the art performance on missing tokens prediction (89% hit@5) using a greedy decoding scheme and pretraining on data from other languages and different time periods. Finally, we conduct human evaluations showing the applicability of our models in assisting experts to transcribe texts in extinct languages.

Comments:	Accepted to EMNLP 2021 (Main Conference)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2109.04513 [cs.CL]
	(or arXiv:2109.04513v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2109.04513

Submission history

From: Gabriel Stanovsky [view email]
[v1] Thu, 9 Sep 2021 18:58:14 UTC (2,711 KB)
[v2] Sun, 24 Oct 2021 07:46:54 UTC (2,712 KB)

Computer Science > Computation and Language

Title:Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators