research-article

Geospatial Entity Resolution

Authors:

Pasquale Balsebre,

Zhen HaiAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 3061 - 3070

https://doi.org/10.1145/3485447.3512026

Published: 25 April 2022 Publication History

Abstract

A geospatial database is today at the core of an ever increasing number of services. Building and maintaining it remains challenging due to the need to merge information from multiple providers. Entity Resolution (ER) consists of finding entity mentions from different sources that refer to the same real world entity. In geospatial ER, entities are often represented using different schemes and are subject to incomplete information and inaccurate location, making ER and deduplication daunting tasks. While tremendous advances have been made in traditional entity resolution and natural language processing, geospatial data integration approaches still heavily rely on static similarity measures and human-designed rules. In order to achieve automatic linking of geospatial data, a unified representation of entities with heterogeneous attributes and their geographical context, is needed. To this end, we propose Geo-ER1, a joint framework that combines Transformer-based language models, that have been successfully applied in ER, with a novel learning-based architecture to represent the geospatial character of the entity. Different from existing solutions, Geo-ER does not rely on pre-defined rules and is able to capture information from surrounding entities in order to make context-based, accurate predictions. Extensive experiments on eight real world datasets demonstrate the effectiveness of our solution over state-of-the-art methods. Moreover, Geo-ER proves to be robust in settings where there is no available training data for a specific city.

References

[1]

Rifaat Abdalla. 2016. Geospatial Data Integration. 105–124. https://doi.org/10.1007/978-3-319-33603-9_6

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv 1409 (09 2014).

[3]

Sandrine Balley, Christine Parent, and Stefano Spaccapietra. 2004. Modelling geographic data with multiple representations. International Journal of Geographical Information Science 18 (06 2004), 327–352. https://doi.org/10.1080/13658810410001672881

[4]

Nils Barlaug and Jon Atle Gulla. 2020. Neural Networks for Entity Matching. CoRR abs/2010.11075(2020). arXiv:2010.11075https://arxiv.org/abs/2010.11075

[5]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2016. Enriching Word Vectors with Subword Information. CoRR abs/1607.04606(2016). arXiv:1607.04606http://arxiv.org/abs/1607.04606

[6]

Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2019. End-to-End Entity Resolution for Big Data: A Survey. CoRR abs/1905.06397(2019). arXiv:1905.06397http://arxiv.org/abs/1905.06397

[7]

Nilesh Dalvi, Vibhor Rastogi, Anirban Dasgupta, Anish Das Sarma, and Tamas Sarlos. 2013. Optimal Hashing Schemes for Entity Matching. In 22nd International World Wide Web Conference, WWW ’13. Rio de Janeiro, Brazil, 295–306. http://dl.acm.org/citation.cfm?id=2488415

[8]

Hongzhong Deng, Luo Yun, Yi Liu, and Wang Pu. 2019. Point of Interest Matching between Different Geospatial Datasets. ISPRS International Journal of Geo-Information 8 (10 2019), 435. https://doi.org/10.3390/ijgi8100435

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805(2018). arXiv:1810.04805http://arxiv.org/abs/1810.04805

[10]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2017. DeepER - Deep Entity Resolution. CoRR abs/1710.00597(2017). arXiv:1710.00597http://arxiv.org/abs/1710.00597

[11]

Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2014. NADEEF/ER: Generic and Interactive Entity Resolution. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD ’14). Association for Computing Machinery, New York, NY, USA, 1071–1074. https://doi.org/10.1145/2588555.2594511

Digital Library

[12]

Ahmed Elmagarmid, Panos Ipeirotis, and Vassilios Verykios. 2007. Duplicate Record Detection: A Survey. Knowledge and Data Engineering, IEEE Transactions on 19 (02 2007), 1 – 16. https://doi.org/10.1109/TKDE.2007.250581

[13]

Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online Entity Resolution Using an Oracle. Proc. VLDB Endow. 9, 5 (Jan. 2016), 384–395. https://doi.org/10.14778/2876473.2876474

Digital Library

[14]

Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2020. Hierarchical Matching Network for Heterogeneous Entity Resolution. In IJCAI. 3665–3671.

[15]

Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-End Multi-Perspective Matching for Entity Resolution. 4961–4967. https://doi.org/10.24963/ijcai.2019/689

[16]

Chaitanya S. Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourcing for entity matching. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014).

Digital Library

[17]

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. CoRR abs/2003.11080(2020). arXiv:2003.11080https://arxiv.org/abs/2003.11080

[18]

Suela Isaj, Torben Bach Pedersen, and Esteban Zimányi. 2019. Multi-Source Spatial Entity Linkage. CoRR abs/1911.09016(2019). arXiv:1911.09016http://arxiv.org/abs/1911.09016

[19]

Roula Karam, Franck Favetta, Rima Kilany, and Robert Laurini. 2010. Integration of Similar Location Based Services Proposed by Several Providers. Communications in Computer and Information Science 88, 136–144. https://doi.org/10.1007/978-3-642-14306-9_14

[20]

Bing Li, Yukai Miao, Yaoshu Wang, Yifang Sun, and Wei Wang. 2021. Improving the Efficiency and Effectiveness for BERT-based Entity Resolution. In The 35th AAAI Conference on Artificial Intelligence (AAAI 2021).

[21]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (Sep 2020), 50–60. https://doi.org/10.14778/3421424.3421431

Digital Library

[22]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692(2019). arXiv:1907.11692http://arxiv.org/abs/1907.11692

[23]

George Miller, R. Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1991. Introduction to WordNet: An On-line Lexical Database*. 3 (01 1991). https://doi.org/10.1093/ijl/3.4.235

[24]

Anthony Morana, Thomas Morel, Bilal Berjawi, and Fabien Duchateau. 2014. GeoBench: a Geospatial Integration Tool for Building a Spatial Entity Matching Benchmark (Demo. In International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL’2014). Dallas, Texas, United States, 533–536. https://hal.archives-ouvertes.fr/hal-01301125

Digital Library

[25]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 19–34. https://doi.org/10.1145/3183713.3196926

Digital Library

[26]

Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution. In CIKM. 629–638.

[27]

Ralph Peeters and Christian Bizer. 2021. Dual-Objective Fine-Tuning of BERT for Entity Matching. Proc. VLDB Endow. 14(2021), 1913–1921.

Digital Library

[28]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162

[29]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.

[30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.

[31]

Michael Schäfers and Udo W. Lipeck. 2014. SimMatching: Adaptable Road Network Matching for Efficient and Scalable Spatial Data Integration. In Proceedings of the 1st ACM SIGSPATIAL PhD Workshop (Dallas/Fort Worth, Texas) (SIGSPATIAL PhD ’14). Association for Computing Machinery, New York, NY, USA, Article 5, 5 pages. https://doi.org/10.1145/2694859.2694866

Digital Library

[32]

Vivek R. Shivaprabhu, Booma Sowkarthiga Balasubramani, and Isabel F. Cruz. 2017. Ontology-Based Instance Matching for Geospatial Urban Data Integration. In Proceedings of the 3rd ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics (Redondo Beach, CA, USA) (UrbanGIS’17). Association for Computing Machinery, New York, NY, USA, Article 8, 8 pages. https://doi.org/10.1145/3152178.3152186

Digital Library

[33]

Paulo Tabarro, Jacynthe Pouliot, Richard Fortier, and Louis-Martin Losier. 2017. A WEBGIS TO SUPPORT GPR 3D DATA ACQUISITION: A FIRST STEP FOR THE INTEGRATION OF UNDERGROUND UTILITY NETWORKS IN 3D CITY MODELS. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-4/W7 (10 2017), 43–48. https://doi.org/10.5194/isprs-archives-XLII-4-W7-43-2017

[34]

Antonio Torralba, Kevin Murphy, W.T. Freeman, and Mark Rubin. 2003. Context-Based Vision System for Place and Object Recognition. Proceedings of the IEEE International Conference on Computer Vision 1, 273–280 vol.1. https://doi.org/10.1109/ICCV.2003.1238354

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762(2017). arXiv:1706.03762http://arxiv.org/abs/1706.03762

[36]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. arxiv:1710.10903 [stat.ML]

[37]

Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do NLP Models Know Numbers? Probing Numeracy in Embeddings. CoRR abs/1909.07940(2019). arXiv:1909.07940http://arxiv.org/abs/1909.07940

[38]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. Proc. VLDB Endow. 5, 11 (July 2012), 1483–1494. https://doi.org/10.14778/2350229.2350263

Digital Library

[39]

Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity Matching: How Similar is Similar. Proc. VLDB Endow. 4, 10 (July 2011), 622–633. https://doi.org/10.14778/2021017.2021020

Digital Library

[40]

Ying Zhang, Puhai Yang, Chaopeng Li, Gengrui Zhang, Cheng Wang, Hui He, Xiang Hu, and Zhitao Guan. 2018. A Multi-Feature Based Automatic Approach to Geospatial Record Linking. International Journal on Semantic Web and Information Systems 14 (10 2018), 73–91. https://doi.org/10.4018/IJSWIS.2018100104

Digital Library

[41]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2019. Semantics-aware BERT for Language Understanding. CoRR abs/1909.02209(2019). arXiv:1909.02209http://arxiv.org/abs/1909.02209

Cited By

Jiang XXu CShen YWang YSu FShi ZSun FLi ZGuo JShen HChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Toward Practical Entity Alignment Method Design: Insights from New Highly Heterogeneous Knowledge Graph DatasetsProceedings of the ACM Web Conference 202410.1145/3589334.3645720(2325-2336)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645720
Mugeni JLynden SAmagasa TMatono A(2024)MultiMatch: Low-Resource Generalized Entity Matching Using Task-Conditioned Hyperadapters in Multitask LearningBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_4(51-65)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_4
Mugeni JLynden SAmagasa TMatono A(2023)AdapterEM: Pre-trained Language Model Adaptation for Generalized Entity Matching using Adapter-tuningProceedings of the 27th International Database Engineered Applications Symposium10.1145/3589462.3589498(140-147)Online publication date: 5-May-2023
https://dl.acm.org/doi/10.1145/3589462.3589498
Show More Cited By

Index Terms

Geospatial Entity Resolution
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Geographic visualization
2. Information systems
  1. Information systems applications
    1. Spatial-temporal systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Pay-As-You-Go Entity Resolution

Entity resolution (ER) is the problem of identifying which records in a database refer to the same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data ...
Joint entity resolution on multiple datasets

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
519
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)12

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang XXu CShen YWang YSu FShi ZSun FLi ZGuo JShen HChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Toward Practical Entity Alignment Method Design: Insights from New Highly Heterogeneous Knowledge Graph DatasetsProceedings of the ACM Web Conference 202410.1145/3589334.3645720(2325-2336)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645720
Mugeni JLynden SAmagasa TMatono A(2024)MultiMatch: Low-Resource Generalized Entity Matching Using Task-Conditioned Hyperadapters in Multitask LearningBig Data Analytics and Knowledge Discovery10.1007/978-3-031-68323-7_4(51-65)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-68323-7_4
Mugeni JLynden SAmagasa TMatono A(2023)AdapterEM: Pre-trained Language Model Adaptation for Generalized Entity Matching using Adapter-tuningProceedings of the 27th International Database Engineered Applications Symposium10.1145/3589462.3589498(140-147)Online publication date: 5-May-2023
https://dl.acm.org/doi/10.1145/3589462.3589498
Balsebre PYao DCong GHuang WHai Z(2023)Mining Geospatial Relationships from TextProceedings of the ACM on Management of Data10.1145/35889471:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588947
Wang PZeng XChen LYe FMao YZhu JGao Y(2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565836

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents