research-article

ZeroER: Entity Resolution using Zero Labeled Examples

Authors:

Saurabh Sawlani,

Saravanan ThirumuruganathanAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 1149 - 1164

https://doi.org/10.1145/3318464.3389743

Published: 31 May 2020 Publication History

Abstract

Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation --- the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches.

Supplementary Material

MP4 File (3318464.3389743.mp4)

Presentation Video

Download
119.19 MB

References

[1]

Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.

[2]

Duplicate detection, record linkage, and identity uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html.

[3]

Bhattacharyya distance - Wikipedia. https://en.wikipedia.org/wiki/Bhattacharyya_distance#Bhattacharyya_coefficient, Oct 2019. [Online; accessed 14. Oct. 2019].

[4]

How to understand the drawbacks of k-means? https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means, Jun 2019. [Online; accessed 28 Jun 2019].

[5]

sklearn.mixture.GaussianMixture ifmmode--else-fi scikit-learn 0.21.2 documentation, May 2019. [Online; accessed 31. May 2019].

[6]

User Manual for py_entitymatching ifmmode--else-fi py_entitymatching 0.3.0 documentation, Jun 2019. [Online; accessed 9. Oct. 2019].

[7]

ZeroER technical report. https://www.dropbox.com/s/aersjnp0gjmy5pz/ZeroER_technical_report.pdf?dl=0, April 2020.

[8]

anhaidgroup. Deepmatcher. https://github.com/anhaidgroup/deepmatcher.

[9]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, pages 918--929. VLDB Endowment, 2006.

Digital Library

[10]

A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, pages 783--794, 2010.

Digital Library

[11]

V. Berisha, A. Wisler, A. O. Hero, and A. Spanias. Empirically estimable classification bounds based on a nonparametric divergence measure. IEEE Transactions on Signal Processing, 64(3):580--591, 2015.

Digital Library

[12]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.

Digital Library

[13]

C. M. Bishop. Pattern recognition and machine learning. springer, 2006.

Digital Library

[14]

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[15]

A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 353--364. ACM, 2007.

Digital Library

[16]

S. Chaudhuri, B. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB07, pages 327--338, 2007.

[17]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In 22nd International Conference on Data Engineering (ICDE'06), pages 5--5. IEEE, 2006.

Digital Library

[18]

D. E. Clark. Practical introduction to record linkage for injury research. Injury Prevention, 10(3):186--191, 2004.

[19]

P. Cryer, S. Westrup, A. Cook, V. Ashwell, P. Bridger, and C. Clarke. Investigation of bias after data linkage of hospital admissions data to police road traffic crash reports. Injury prevention, 7(3):234--241, 2001.

[20]

T. Danka and P. Horvath. modAL: A modular active learning framework for Python. available on arXiv at https://arxiv.org/abs/1805.00979.

[21]

J. De Bruin. Probabilistic record linkage with the fellegi and sunter framework: Using probabilistic record linkage to link privacy preserved police and hospital road accident records. 2015.

[22]

J. de Bruin. Python record linkage toolkit. https://github.com/J535D165/recordlinkage, 2018.

[23]

A. P. Dempster. Covariance Selection. Biometrics, 28(1):157--175, Mar 1972.

[24]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1--22, 1977.

[25]

X. L. Dong and F. Naumann. Data fusion: resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009.

Digital Library

[26]

X. L. Dong and T. Rekatsinas. Data integration and machine learning: A natural synergy. In Proceedings of the 2018 International Conference on Management of Data, pages 1645--1650. ACM, 2018.

Digital Library

[27]

J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI'08, pages 153--160, Arlington, Virginia, United States, 2008. AUAI Press.

Digital Library

[28]

M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. In PVLDB, 2018.

Digital Library

[29]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEETKDE, 19(1):1--16, 2007.

[30]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183--1210, 1969.

[31]

L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.

Digital Library

[32]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, et al. Approximate string joins in a database (almost) for free. In VLDB, volume 1, pages 491--500, 2001.

Digital Library

[33]

M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In 2008 IEEE 24th International Conference on Data Engineering, pages 267--276. IEEE, 2008.

Digital Library

[34]

T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer Science & Business Media, 2007.

Digital Library

[35]

J. Honorio and T. Jaakkola. Inverse Covariance Estimation for High-Dimensional Data in Linear Time and Space: Spectral Methods for Riccati and Sparse Models. Association for Uncertainty in Artificial Intelligence (AUAI), Jul 2013.

[36]

P. Jain, P. Kar, et al. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3--4):142--336, 2017.

[37]

M. A. Jaro. Unimatch: A record linkage system: User's manual. U.S. Bureau of the Census, 1976.

[38]

M. I. Jordan and C. Bishop. An introduction to graphical models, 2004.

[39]

P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.

Digital Library

[40]

H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1--2):484--493, 2010.

Digital Library

[41]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006.

Digital Library

[42]

A. Lyon. Why are normal distributions normal? The British Journal for the Philosophy of Science, 65(3):621--649, 2013.

[43]

F. Maggi. A survey of probabilistic record matching models, techniques and tools. Scienti_c Report TR-2008, 2008.

[44]

S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018.

Digital Library

[45]

F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. 2010.

Digital Library

[46]

R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355--368. Springer, 1998.

Digital Library

[47]

G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946--1960, 2013.

[48]

G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9):684--695, 2016.

Digital Library

[49]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.

Digital Library

[50]

M. Sadinle and S. E. Fienberg. A generalized fellegi--sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502):385--397, 2013.

[51]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.

Digital Library

[52]

M. Stonebraker and I. F. Ilyas. Data integration: The current status and the way forward. IEEE Data Eng. Bull., 41(2):3--9, 2018.

[53]

H. Tuy and N. Van Thuong. On the global minimization of a convex function under general nonconvex constraints. Applied Mathematics and Optimization, 18(1):119--142, 1988.

[54]

S. Velasco-Forero, M. Chen, A. Goh, and S. K. Pang. Comparative Analysis of Covariance Matrix Estimation for Anomaly Detection in Hyperspectral Images. IEEE J. Sel. Top. Signal Process., 9(6):1061--1073, Sep 2015.

[55]

E. W. Weisstein. Newton's Method, Oct 2019. [Online; accessed 11. Oct. 2019].

[56]

S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 219--232. ACM, 2009.

Digital Library

[57]

W. E. Winkler. The state of record linkage and current research problems. In Statistical Research Division, U.S. Census Bureau, 1999.

[58]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):1--41, 2011.

Cited By

Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Papadakis GKirielle NChristen PPalpanas T(2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00265
Zeng XWang PMao YChen LLiu XGao Y(2024)MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00264(3421-3434)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00264
Show More Cited By

Index Terms

ZeroER: Entity Resolution using Zero Labeled Examples
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Gradual Machine Learning for Entity Resolution
WWW '19: The World Wide Web Conference

Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural ...
Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER ...
FlexER: Flexible Entity Resolution for Multiple Intents
PACMMOD

Entity resolution, a longstanding problem of data cleaning and integration, aims at identifying data records that represent the same real-world entity. Existing approaches treat entity resolution as a universal task, assuming the existence of a single ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
784
Total Downloads

Downloads (Last 12 months)100
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shah VParashos TKumar A(2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648178
Papadakis GKirielle NChristen PPalpanas T(2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00265
Zeng XWang PMao YChen LLiu XGao Y(2024)MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00264(3421-3434)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00264
Eibeck AZhang SLim MKraft M(2024)A simple and efficient approach to unsupervised instance matching and its application to linked data of power plantsJournal of Web Semantics10.1016/j.websem.2024.10081580(100815)Online publication date: Apr-2024
https://doi.org/10.1016/j.websem.2024.100815
Peng JShen DNie TKou Y(2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
https://doi.org/10.1016/j.ins.2024.121281
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
https://doi.org/10.1007/s10515-024-00453-w
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
Zeakis APapadakis GSkoutas DKoubarakis M(2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598594
Fürst JArgerich MCheng B(2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583148
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents