Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Comparing Heuristic Rules and Masked Language Models for Entity Alignment in the Literature Domain

Published: 09 August 2023 Publication History

Abstract

The cultural world offers a staggering amount of rich and varied metadata on cultural heritage, accumulated by governmental, academic, and commercial players. However, the variety of involved institutions means that the data are stored in as many complex and often incompatible models and standards, which limits its availability and explorability by the greater public.
The adoption of Linked Open Data technologies allows a strong interlinking of these various databases as well as external connections with existing knowledge bases. However, as they often contain references to the same entities, the delicate issue of entity alignment becomes the central challenge, especially in the absence or scarcity of unique global identifiers.
To tackle this issue, we explored two approaches, one based on a set of heuristic rules and one based on masked language models, or masked language models (MLMs). We compare these two approaches, as well as different variations of MLMs, including some models trained on a different language, and various levels of data cleaning and labeling. Our results show that heuristics are a solid approach but also that MLM-based entity alignment obtains better performance coupled with the fact that it is robust to the data format and does not require any form of data preprocessing, which was not the case of the heuristic approach in our experiments.

References

[1]
Juriaan Baas, Mehdi M. Dastani, and Ad J. Feelders. 2021. Entity matching in digital humanities knowledge graphs. ISSN 1613 (2021), 0073. http://ceur-ws.org
[2]
Felix Bensmann, Benjamin Zapilko, and Philipp Mayr. 2017. Interlinking large-scale library data with authority records. Front. Digit. Human. 4 (2017), 5.
[3]
Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data integration. In Proceedings of the International Conference on Extending Database Technology. OpenProceedings.
[4]
Giovanni Colavizza, Maud Ehrmann, and Yannick Rochat. 2016. A method for record linkage with sparse historical data. In Proceedings of the Digital Humanities Conference.
[5]
Guilherme Dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Inf. Syst. 75 (2018), 75–89.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from http://arxiv.org/abs/1810.04805
[7]
Manda Sai Divya and Shiv Kumar Goyal. 2013. ElasticSearch: An advanced and quick search technique to handle voluminous data. Compusoft 2, 6 (2013), 171.
[8]
Gordon Dunsire. 2010. Interoperability and semantics in RDF representations of FRBR, FRAD and FRSAD. In Proceedings of the Cologne Conference on Interoperability and Semantics in Knowledge Organization. 113.
[9]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2017. DeepER–deep entity resolution. arXiv:1710.00597. Retrieved from https://arxiv.org/abs/1710.00597
[10]
Ivan P. Fellegi and Alan B. Sunter. 1969. A theory for record linkage. J. Am. Statist. Assoc. 64, 328 (1969), 1183–1210.
[11]
James Flamino, Christopher Abriola, Benjamin Zimmerman, Zhongheng Li, and Joel Douglas. 2020. Robust and scalable entity alignment in big data. In Proceedings of the IEEE International Conference on Big Data (Big Data’20). IEEE, 2526–2533.
[12]
Philip Gage. 1994. A new algorithm for data compression. C Users J. 12, 2 (1994), 23–38.
[13]
Anthony Gillioz, Jacky Casas, Elena Mugellini, and Omar Abou Khaled. 2020. Overview of the transformer-based models for nlp tasks. In Proceedings of the 15th Conference on Computer Science and Information Systems (FedCSIS’20). 179–183.
[15]
Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Renée J Miller. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proc. VLDB Endow. 2, 1 (2009), 1282–1293.
[16]
Jonas Helgertz, Joseph Price, Jacob Wellington, Kelly J. Thompson, Steven Ruggles, and Catherine A. Fitch. 2022. A new strategy for linking US historical censuses: A case study for the IPUMS multigenerational longitudinal panel. Hist. Methods 55, 1 (2022), 12–29.
[17]
Barry Hendriks, Paul Groth, and Marieke van Erp. 2020. Recognizing and linking entities in old dutch text: A case study on VOC notary records. In COLCO. 25–36.
[18]
Mikko Koho, Petri Leskinen, and Eero Hyvönen. 2020. Integrating historical person registers as linked open data in the warsampo knowledge graph. In Proceedings of the International Conference on Semantic Systems (SEMANTiCS’20). 118–126.
[19]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 66–71.
[20]
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. 2019. Flaubert: Unsupervised language model pre-training for french. arXiv:1912.05372. Retrieved from https://arxiv.org/abs/1912.05372
[21]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14, 1 (2020), 50–60.
[22]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692
[23]
Abdullah-Al Mamun, Robert Aseltine, and Sanguthevar Rajasekaran. 2016. Efficient record linkage algorithms using complete linkage clustering. PloS ONE 11, 4 (2016), e0154446.
[24]
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2019. Camembert: A tasty french language model. arXiv:1911.03894 (2019).
[25]
Sumit Mishra, Sriparna Saha, and Samrat Mondal. 2016. An automatic framework for entity matching in bibliographic databases. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC’16). IEEE, 271–278.
[26]
Axel-Cyrille Ngonga Ngomo and Sören Auer. 2011. LIMES–a time-efficient approach for large-scale link discovery on the web of data. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.
[27]
George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2012. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2012), 2665–2682.
[28]
Glenn E. Patton. 2009. From FRBR to FRAD: Extending the model. IFLA: World Library (2009).
[29]
Joe Raad, Rick Mourits, Auke Rijpma, Ruben Schalk, Richard Zijdeman, Kees Mandemakers, and Albert Merono-Penuela. 2020. Linking dutch civil certificates. In Proceedings of the 3rd Workshop on Humanities in the Semantic Web (WHiSe’20). CEUR-WS, 47–58.
[30]
Pat Riva, Patrick Le Boeuf, Maja Žumer, et al. 2018. IFLA library reference model: A conceptual model for bibliographic information.
[31]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108. Retrieved from https://arxiv.org/abs/1910.01108
[32]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725.
[33]
Rebecca C Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In International Conference on Privacy in Statistical Databases. Springer, 253–268.
[34]
Barbara Tillett. 2005. What is FRBR? A conceptual model for the bibliographic universe. 54, 1 (2005), 24–30.
[35]
Bayu Distiawan Trisedya, Jianzhong Qi, and Rui Zhang. 2019. Entity alignment between knowledge graphs using attribute embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 297–304.
[36]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019).
[37]
Kaisheng Zeng, Chengjiang Li, Lei Hou, Juanzi Li, and Ling Feng. 2021. A comprehensive survey of entity alignment for knowledge graphs. AI Open 2 (2021), 1–13.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 16, Issue 3
September 2023
468 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3615350
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2023
Online AM: 15 July 2023
Accepted: 18 May 2023
Revised: 16 December 2022
Received: 01 June 2022
Published in JOCCH Volume 16, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Linked open data
  2. entity matching
  3. masked language models
  4. cultural heritage
  5. literature

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 171
    Total Downloads
  • Downloads (Last 12 months)116
  • Downloads (Last 6 weeks)8
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media