Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Linking Entities across Relations and Graphs

Published: 28 February 2024 Publication History
  • Get Citation Alerts
  • Abstract

    This article proposes a notion of parametric simulation to link entities across a relational database đť’ź and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations, and important properties as parameters, parametric simulation identifies tuples t in đť’ź and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across đť’ź and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to đť’ź and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database đť’ź and graph G for both batch and incremental computations.

    References

    [1]
    SemTab Challenge. (n.d.). Retrieved from https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
    [4]
    2020. DBpedia As Tables. Retrieved from https://wiki.dbpedia.org
    [5]
    2020. DBpedia version 2016-10. Retrieved from https://wiki.dbpedia.org
    [7]
    2020. Freebase/Wikidata Mappings. Retrieved from https://developers.google.com/freebase
    [9]
    2020. GraphScope. Retrieved from https://graphscope.io/
    [12]
    2020. \(\mathsf {IMDB}\) graph dataset. Retrieved from https://www.cs.toronto.edu
    [13]
    2020. \(\mathsf {IMDB}\) relational dataset. Retrieved from https://datasets.imdbws.com/
    [16]
    2021. Pre-trained GloVe word embedding of different dimensions. Retrieved from https://nlp.stanford.edu/data/glove.6B.zip
    [17]
    Sareh Aghaei, Sepide Masoudi, Tek Raj Chhetri, and Anna Fensel. 2022. Question answering over knowledge graphs: A graph-driven approach. In WI-IAT. 296–302.
    [18]
    Amir Harati Alamdari and Omid Bushehrian. 2012. A method for implementing semantic search on relational databases. In IST. 936–941.
    [19]
    Hannah Bast, Florian Bäurle, Björn Buchhold, and Elmar Haußmann. 2014. Easy access to the freebase dataset. In WWW. 95–98. DOI:
    [20]
    James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 1 (2012), 281–305.
    [21]
    Tim Berners-Lee. 1998. Relational databases and the semantic web (in design issues). In World Wide Web Consortium.
    [22]
    Florian Bourse, Marc Lelarge, and Milan Vojnovic. 2014. Balanced graph edge partition. In SIGKDD. 1456–1465.
    [23]
    Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In SIGMOD. 1335–1349.
    [24]
    Shuang Chen, Alperen Karaoglu, Carina Negreanu, Tingting Ma, Jin-Ge Yao, Jack Williams, Andy Gordon, and Chin-Yew Lin. 2020. LinkingPark: An integrated approach for semantic table interpretation. In SemTab@ ISWC. 65–74.
    [25]
    Zhaoqiang Chen, Qun Chen, Fengfeng Fan, Yanyan Wang, Zhuo Wang, Youcef Nafa, Zhanhuai Li, Hailong Liu, and Wei Pan. 2018. Enabling quality control for entity resolution: A human and machine cooperation framework. In ICDE. IEEE, 1156–1167.
    [26]
    Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2019. End-to-end entity resolution for big data: A survey. ACM Comput. Surv. 53, 6 (2019).
    [27]
    Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53, 6 (2021), 127:1–127:42.
    [28]
    Vincenzo Cutrona, Federico Bianchi, Ernesto Jiménez-Ruiz, and Matteo Palmonari. Tough Tables: Carefully Benchmarking Semantic Table Annotators [Data set]. Zenodo. Retrieved from https://zenodo.org/record/4246370#.YWUB9dpBz-g
    [29]
    Vincenzo Cutrona, Federico Bianchi, Ernesto Jiménez-Ruiz, and Matteo Palmonari. 2020. Tough tables: Carefully evaluating entity linking for tabular data. In ISWC. Springer, 328–343.
    [30]
    DBLP. 2020. DBLP RDF data. Retrieved from http://dblp.rkbexplorer.com
    [31]
    DBLP. 2020. DBLP relational data. Retrieved from https://dblp.org
    [32]
    Ting Deng, Wenfei Fan, Ping Lu, Xiaomeng Luo, Xiaoke Zhu, and Wanhe An. 2022. Deep and collective entity resolution in parallel. In ICDE. 2060–2072.
    [33]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
    [34]
    Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD.
    [35]
    Wenfei Fan, Zhe Fan, Chao Tian, and Xin Luna Dong. 2015. Keys for graphs. Proc. VLDB 8, 12 (2015), 1590–1601.
    [36]
    Wenfei Fan, Liang Geng, Ruochun Jin, Ping Lu, Resul Tugay, and Wenyuan Yu. 2022. Linking entities across relations and graphs. In ICDE. 634–647.
    [37]
    Wenfei Fan, Chunming Hu, and Chao Tian. 2017. Incremental graph computations: Doable and undoable. In SIGMOD. 155–169.
    [38]
    Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Yinghui Wu, and Yunpeng Wu. 2010. Graph pattern matching: From intractable to polynomial time. Proc. VLDB 3, 1 (2010).
    [39]
    Wenfei Fan and Ping Lu. 2020. Dependencies for graphs. Trans. Datab. Syst. 45, 3 (2020), 15:1–15:42.
    [40]
    Wenfei Fan, Ping Lu, Wenyuan Yu, Jingbo Xu, Qiang Yin, Xiaojian Luo, Jingren Zhou, and Ruochun Jin. 2020. Adaptive asynchronous parallelization of graph algorithms. Trans. Datab. Syst. 45, 2 (2020), 6:1–6:45.
    [41]
    Wenfei Fan, Chao Tian, Qiang Yin, Ruiqi Xu, Wenyuan Yu, and Jingren Zhou. 2021. Incrementalizing graph algorithms. In SIGMOD. 459–471.
    [42]
    Wenfei Fan, Wenyuan Yu, Jingbo Xu, Jingren Zhou, Xiaojian Luo, Qiang Yin, Ping Lu, Yang Cao, and Ruiqi Xu. 2018. Parallelizing sequential graph computations. Trans. Datab. Syst. 43, 4 (2018), 18:1–18:39.
    [43]
    Jun Feng, Minlie Huang, Yang Yang, and Xiaoyan Zhu. 2016. GAKE: Graph aware knowledge embedding. In COLING. 641–651.
    [44]
    Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. In WC.
    [45]
    Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In EMNLP, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 6894–6910.
    [46]
    Michael Garey and David Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.
    [47]
    Leonardo Gazzarri and Melanie Herschel. 2021. End-to-end task based parallelization for entity resolution on dynamic data. In ICDE. 1248–1259.
    [48]
    Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2001. The complexity of acyclic conjunctive queries. J. ACM 48, 3 (2001), 431–498.
    [49]
    Mohamed A. G. Hazber, Ruixuan Li, Bing Li, Yuqi Zhao, and Khaled M. Alalayah. 2019. A survey: Transformation for integrating relational database with semantic web. In ICMSS. 66–73.
    [50]
    Tamer Hossam Eldin Helmy, Mohamed Zaki Abd-ElMegied, Tarek S. Sobh, and Khaled Mahmoud Shafea Badran. 2014. Design of a monitor for detecting money laundering and terrorist financing. Int. J. Comput. Netw. Applic. 1, 1 (2014), 15–25.
    [51]
    M. R. Henzinger, T. Henzinger, and P. Kopke. 1995. Computing simulations on finite and infinite graphs. In FOCS.
    [52]
    Boyi Hou, Qun Chen, Yanyan Wang, Youcef Nafa, and Zhanhua Li. 2022. Gradual machine learning for entity resolution. Trans. Knowl. Datab. Eng. 34, 4 (2022), 1803–1814.
    [53]
    Sen Hu, Lei Zou, Jeffrey Xu Yu, Haixun Wang, and Dongyan Zhao. 2018. Answering natural language questions by subgraph matching over knowledge graphs. Trans. Knowl. Datab. Eng. 30, 5 (2018), 824–837.
    [54]
    Robert Isele, Anja Jentzsch, and Christian Bizer. 2010. Silk server—Adding missing links while consuming linked data. In COLD, Vol. 665.
    [55]
    Isabel Cristina Italiano and João Eduardo Ferreira. 2006. Synchronization options for data warehouse designs. IEEE Comput. 39, 3 (2006), 53–57. DOI:
    [56]
    Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In KDD. 538–543.
    [57]
    Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, Kavitha Srinivas, and Vincenzo Cutrona. 2020. Results of SemTab 2020. In CEUR Workshop Proceedings, Aachen : R. Piskac c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, Vol. 2775. 1–8.
    [58]
    David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. In NIPS. 1953–1961.
    [59]
    Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019).
    [60]
    Phokion G. Kolaitis. 2007. On the Expressive Power of Logics on Finite Models. Springer, Berlin, 27–123.
    [61]
    Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward building entity matching management systems. Proc. VLDB 9, 12 (2016).
    [62]
    Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
    [63]
    Walter Kropatsch. 1996. Building irregular pyramids by dual-graph contraction. In Vision Image and Signal Processing, Stevenage UK: Institution of Engineering and Technology.
    [64]
    Mitsuru Kusumoto, Takanori Maehara, and Ken-ichi Kawarabayashi. 2014. Scalable similarity search for SimRank. In SIGMOD. 325–336.
    [65]
    Selasi Kwashie, Lin Liu, Jixue Liu, Markus Stumptner, Jiuyong Li, and Lujing Yang. 2019. Certus: An effective entity resolution approach with graph differential dependencies (GDDs). Proc. VLDB 12, 6 (2019), 653–666.
    [66]
    Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang. 2020. GraphER: Token-centric entity resolution with graph convolutional neural networks. In AAAI.
    [67]
    Manling Li, Qi Zeng, Ying Lin, Kyunghyun Cho, Heng Ji, Jonathan May, Nathanael Chambers, and Clare Voss. 2020. Connecting the dots: Event graph schema induction with path language modeling. In EMNLP. 684–695.
    [68]
    Xi Victoria Lin, Richard Socher, and Caiming Xiong. 2018. Multi-hop knowledge graph reasoning with reward shaping. arXiv preprint arXiv:1808.10568 (2018).
    [69]
    Yankai Lin, Zhiyuan Liu, Huan-Bo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015. Modeling relation paths for representation learning of knowledge bases. In EMNLP.
    [70]
    Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, and Tianyu Wo. 2014. Strong simulation: Capturing topology in graph pattern matching. Trans. Datab. Syst. 39, 1 (2014), 4:1–4:46.
    [71]
    Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589 (2017).
    [72]
    Franck Michel, Johan Montagnat, and Catherine Faron Zucker. 2014. A survey of RDB to RDF translation approaches and tools. Retrieved from https://credible.i3s.unice.fr/lib/exe/fetch.php?media=credible-13-2-v1-rdb2rdf.pdf
    [73]
    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
    [74]
    Robin Milner. 1989. Communication and Concurrency. Prentice Hall.
    [75]
    Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19–34.
    [76]
    Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data lake management: Challenges and opportunities. Proc. VLDB 12, 12 (2019).
    [77]
    Axel-Cyrille Ngonga Ngomo and Sören Auer. 2011. LIMES—A time-efficient approach for large-scale link discovery on the web of data. In IJCAI.
    [78]
    Phuc Nguyen, Ikuya Yamada, Natthawut Kertkeidkachorn, Ryutaro Ichise, and Hideaki Takeda. 2020. MTab4Wikidata at SemTab 2020: Tabular data annotation with Wikidata. In SemTab@ ISWC. 86–95.
    [79]
    ORACLE. 2012. An Oracle White Paper: Best Practices for Real-Time Data Warehousing. Retrieved from http://www.busygin.dp.ua/npc.html
    [80]
    George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. 2012. A blocking framework for entity resolution in highly heterogeneous information spaces. Trans. Knowl. Datab. Eng. 25, 12 (2012), 2665–2682.
    [81]
    George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional entity resolution with JedAI. Inf. Syst. 93 (2020).
    [82]
    George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2018. The return of JedAI: End-to-end entity resolution for structured and semi-structured data. Proc. VLDB 11, 12 (2018), 1950–1953.
    [83]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP. 1532–1543.
    [84]
    Giuseppe Pirrò. 2015. Explaining and suggesting relatedness in knowledge graphs. In ISWC, Vol. 9366. 622–639.
    [85]
    Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In CIKM. 1379–1388.
    [86]
    Lu Qin, Jeffrey Xu Yu, and Lijun Chang. 2009. Keyword search in databases: The power of RDBMS. In SIGMOD. 681–694.
    [87]
    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In EMNLP-IJCNLP. 3980–3990.
    [88]
    Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2018. Using link features for entity clustering in knowledge graphs. In ESWC. 576–592.
    [89]
    Alieh Saeedi, Eric Peukert, and Erhard Rahm. 2020. Incremental multi-source entity resolution for knowledge graph completion. In ESWC.
    [90]
    Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In CVPR.
    [91]
    Renat Shigapov, Philipp Zumstein, Jan Kamlah, Lars Oberländer, Jörg Mechnich, and Irene Schumm. 2020. bbw: Matching CSV to Wikidata via meta-lookup. In CEUR Workshop Proceedings, Vol. 2775. RWTH, 17–26.
    [92]
    Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Probabilistic alignment of relations, instances, and schema. Proc. VLDB 5, 3 (2011), 157–168.
    [93]
    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification? In CCL. Springer, 194–206.
    [94]
    Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. PathSim: Meta path-based top-K similarity search in heterogeneous information networks. Proc. VLDB 4, 11 (2011), 992–1003.
    [95]
    Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Samuel Madden, and Mourad Ouzzani. 2021. RPT: Relational pre-trained transformer is almost all you need towards democratizing data preparation. Proc. VLDB 14, 8 (2021), 1254–1261.
    [96]
    Rakshit Trivedi, Bunyamin Sisman, Jun Ma, Christos Faloutsos, Hongyuan Zha, and Xin Luna Dong. 2018. LinkNBed: Multi-graph representation learning with entity linkage. In ACL.
    [97]
    Shalini Tyagi and Ernesto Jimenez-Ruiz. 2020. LexMa: Tabular data to knowledge graph matching using lexical techniques. In CEUR Workshop Proceedings, Aachen : R. Piskac c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, Vol. 2775. 59–64.
    [98]
    UK. Government open data. Retrieved from https://opendata.camden.gov.uk
    [99]
    Leslie G. Valiant. 1990. A bridging model for parallel computation. CACM 33, 8 (1990), 103–111.
    [100]
    W3C. 2012. R2RML: RDB to RDF Mapping Language. https://www.w3.org/TR/r2rml/
    [101]
    W3C. 2012. Relational Databases to RDF (RDB2RDF). https://www.w3.org/2001/sw/wiki/RDB2RDF
    [102]
    Yue Wang, Zhe Wang, Ziyuan Zhao, Zijian Li, Xun Jian, Hao Xin, Lei Chen, Jianchun Song, Zhenhong Chen, and Meng Zhao. 2022. Effective similarity search on heterogeneous networks: A meta-path free approach. Trans. Knowl. Datab. Eng. 34, 7 (2022), 3225–3240.
    [103]
    Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay-as-you-go entity resolution. Trans. Knowl. Datab. Eng. 25, 5 (2013), 1111–1124.
    [104]
    Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, Rui Yan, and Dongyan Zhao. 2019. Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI. 5278–5284.
    [105]
    Feng Xiong and Hongzhi Wang. 2022. Mining simple path traversal patterns in knowledge graph. J. Web Eng. 21, 2 (2022).
    [106]
    Baoshi Yan, Lokesh Bajaj, and Anmol Bhasin. 2011. Entity resolution using social graphs for business applications. In ASONAM. 220–227.
    [107]
    Yiming Yang. 2001. A study of thresholding strategies for text categorization. In SIGIR. 137–145.
    [108]
    Weiren Yu, Xuemin Lin, Wenjie Zhang, Jian Pei, and Julie A. McCann. 2019. SimRank*: Effective and scalable pairwise similarity search based on graph topology. VLDB J. 28, 3 (2019), 401–426.
    [109]
    Hyunyoon Yun and Buhyun Hwang. 1993. A pessimistic concurrency control algorithm in multidatabase systems. In DASFAA, Vol. 4. 379–386.
    [110]
    Qingheng Zhang, Zequn Sun, Wei Hu, Muhao Chen, Lingbing Guo, and Yuzhong Qu. 2019. Multi-view knowledge graph embedding for entity alignment. In IJCAI. 5429–5435.
    [111]
    Shuo Zhang, Edgar Meij, Krisztian Balog, and Ridho Reinanda. 2020. Novel entity discovery from web tables. In WWW. 1298–1308.
    [112]
    Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In WWW. 2413–2424.
    [113]
    Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. 1998. Inverted files versus signature files for text indexing. Trans. Datab. Syst. 23, 4 (1998), 453–490.

    Index Terms

    1. Linking Entities across Relations and Graphs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 49, Issue 1
      March 2024
      176 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/3613511
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 February 2024
      Online AM: 03 January 2024
      Accepted: 05 December 2023
      Revised: 29 August 2023
      Received: 28 August 2022
      Published in TODS Volume 49, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Entity resolution
      2. Knowledge graph
      3. Relational database
      4. Parallelization
      5. Incremental algorithm
      6. Relative boundedness

      Qualifiers

      • Research-article

      Funding Sources

      • Royal Society Wolfson Research Merit Award
      • NSFC
      • NUDT Youth Independent Innovation Science Fund Project

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 319
        Total Downloads
      • Downloads (Last 12 months)319
      • Downloads (Last 6 weeks)76

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media