research-article

Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature Domain

Authors:

Dominique Piché,

Michel GagnonAuthors Info & Claims

ACM Journal on Computing and Cultural Heritage, Volume 16, Issue 3

Article No.: 62, Pages 1 - 18

https://doi.org/10.1145/3606699

Published: 09 August 2023 Publication History

Abstract

The cultural world offers a staggering amount of rich and varied metadata on cultural heritage, accumulated by governmental, academic, and commercial players. However, the variety of involved institutions means that the data are stored in as many complex and often incompatible models and standards, which limits its availability and explorability by the greater public.

The adoption of Linked Open Data technologies allows a strong interlinking of these various databases as well as external connections with existing knowledge bases. However, as they often contain references to the same entities, the delicate issue of entity alignment becomes the central challenge, especially in the absence or scarcity of unique global identifiers.

To tackle this issue, we explored two approaches, one based on a set of heuristic rules and one based on masked language models, or masked language models (MLMs). We compare these two approaches, as well as different variations of MLMs, including some models trained on a different language, and various levels of data cleaning and labeling. Our results show that heuristics are a solid approach but also that MLM-based entity alignment obtains better performance coupled with the fact that it is robust to the data format and does not require any form of data preprocessing, which was not the case of the heuristic approach in our experiments.

References

[1]

Juriaan Baas, Mehdi M. Dastani, and Ad J. Feelders. 2021. Entity matching in digital humanities knowledge graphs. ISSN 1613 (2021), 0073. http://ceur-ws.org

[2]

Felix Bensmann, Benjamin Zapilko, and Philipp Mayr. 2017. Interlinking large-scale library data with authority records. Front. Digit. Human. 4 (2017), 5.

[3]

Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data integration. In Proceedings of the International Conference on Extending Database Technology. OpenProceedings.

[4]

Giovanni Colavizza, Maud Ehrmann, and Yannick Rochat. 2016. A method for record linkage with sparse historical data. In Proceedings of the Digital Humanities Conference.

[5]

Guilherme Dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Inf. Syst. 75 (2018), 75–89.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from http://arxiv.org/abs/1810.04805

[7]

Manda Sai Divya and Shiv Kumar Goyal. 2013. ElasticSearch: An advanced and quick search technique to handle voluminous data. Compusoft 2, 6 (2013), 171.

[8]

Gordon Dunsire. 2010. Interoperability and semantics in RDF representations of FRBR, FRAD and FRSAD. In Proceedings of the Cologne Conference on Interoperability and Semantics in Knowledge Organization. 113.

[9]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2017. DeepER–deep entity resolution. arXiv:1710.00597. Retrieved from https://arxiv.org/abs/1710.00597

[10]

Ivan P. Fellegi and Alan B. Sunter. 1969. A theory for record linkage. J. Am. Statist. Assoc. 64, 328 (1969), 1183–1210.

[11]

James Flamino, Christopher Abriola, Benjamin Zimmerman, Zhongheng Li, and Joel Douglas. 2020. Robust and scalable entity alignment in big data. In Proceedings of the IEEE International Conference on Big Data (Big Data’20). IEEE, 2526–2533.

[12]

Philip Gage. 1994. A new algorithm for data compression. C Users J. 12, 2 (1994), 23–38.

Digital Library

[13]

Anthony Gillioz, Jacky Casas, Elena Mugellini, and Omar Abou Khaled. 2020. Overview of the transformer-based models for nlp tasks. In Proceedings of the 15th Conference on Computer Science and Information Systems (FedCSIS’20). 179–183.

[14]

Google Research. 2019. Retrieved from https://github.com/google-research/bert/blob/master/multilingual.md

[15]

Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Renée J Miller. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2, 1 (2009), 1282–1293.

Digital Library

[16]

Jonas Helgertz, Joseph Price, Jacob Wellington, Kelly J. Thompson, Steven Ruggles, and Catherine A. Fitch. 2022. A new strategy for linking US historical censuses: A case study for the IPUMS multigenerational longitudinal panel. Hist. Methods 55, 1 (2022), 12–29.

[17]

Barry Hendriks, Paul Groth, and Marieke van Erp. 2020. Recognizing and linking entities in old dutch text: A case study on VOC notary records. In COLCO. 25–36.

[18]

Mikko Koho, Petri Leskinen, and Eero Hyvönen. 2020. Integrating historical person registers as linked open data in the warsampo knowledge graph. In Proceedings of the International Conference on Semantic Systems (SEMANTiCS’20). 118–126.

Digital Library

[19]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 66–71.

[20]

Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. 2019. Flaubert: Unsupervised language model pre-training for french. arXiv:1912.05372. Retrieved from https://arxiv.org/abs/1912.05372

[21]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14, 1 (2020), 50–60.

Digital Library

[22]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692

[23]

Abdullah-Al Mamun, Robert Aseltine, and Sanguthevar Rajasekaran. 2016. Efficient record linkage algorithms using complete linkage clustering. PloS ONE 11, 4 (2016), e0154446.

[24]

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2019. Camembert: A tasty french language model. arXiv:1911.03894 (2019).

[25]

Sumit Mishra, Sriparna Saha, and Samrat Mondal. 2016. An automatic framework for entity matching in bibliographic databases. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC’16). IEEE, 271–278.

Digital Library

[26]

Axel-Cyrille Ngonga Ngomo and Sören Auer. 2011. LIMES–a time-efficient approach for large-scale link discovery on the web of data. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.

[27]

George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2012. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2012), 2665–2682.

Digital Library

[28]

Glenn E. Patton. 2009. From FRBR to FRAD: Extending the model. IFLA: World Library (2009).

[29]

Joe Raad, Rick Mourits, Auke Rijpma, Ruben Schalk, Richard Zijdeman, Kees Mandemakers, and Albert Merono-Penuela. 2020. Linking dutch civil certificates. In Proceedings of the 3rd Workshop on Humanities in the Semantic Web (WHiSe’20). CEUR-WS, 47–58.

[30]

Pat Riva, Patrick Le Boeuf, Maja Žumer, et al. 2018. IFLA library reference model: A conceptual model for bibliographic information.

[31]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108. Retrieved from https://arxiv.org/abs/1910.01108

[32]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725.

[33]

Rebecca C Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In International Conference on Privacy in Statistical Databases. Springer, 253–268.

[34]

Barbara Tillett. 2005. What is FRBR? A conceptual model for the bibliographic universe. 54, 1 (2005), 24–30.

[35]

Bayu Distiawan Trisedya, Jianzhong Qi, and Rui Zhang. 2019. Entity alignment between knowledge graphs using attribute embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 297–304.

Digital Library

[36]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019).

[37]

Kaisheng Zeng, Chengjiang Li, Lei Hou, Juanzi Li, and Ling Feng. 2021. A comprehensive survey of entity alignment for knowledge graphs. AI Open 2 (2021), 1–13.

Index Terms

Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature Domain
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Graph-based database models
        Network data models
    2. Information integration
      1. Deduplication
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

From mud to the museum

An archaeological site is a palimpsest in which the evidence of the depositional episodes is destroyed through the excavation processes; all that remains are the artefacts and their documentary evidence manifested in registers, datasets, dig diaries and ...
Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
Linked open data for cultural heritage: evolution of an information technology
SIGDOC '13: Proceedings of the 31st ACM international conference on Design of communication

Communication design encompasses how information is structured behind the scenes, as much as how the information is shared across networks (Potts & Albers). Information architecture can profoundly alter our perceptions of society and culture (Swarts). ...

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage

Journal on Computing and Cultural Heritage Volume 16, Issue 3

September 2023

468 pages

ISSN:1556-4673

EISSN:1556-4711

DOI:10.1145/3615350

Editor:
Franco Niccolucci
VAST-LAB at PIN, University of Florence, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2023

Online AM: 15 July 2023

Accepted: 18 May 2023

Revised: 16 December 2022

Received: 01 June 2022

Published in JOCCH Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
171
Total Downloads

Downloads (Last 12 months)116
Downloads (Last 6 weeks)8

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents