Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389743acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ZeroER: Entity Resolution using Zero Labeled Examples

Published: 31 May 2020 Publication History

Abstract

Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation --- the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches.

Supplementary Material

MP4 File (3318464.3389743.mp4)
Presentation Video

References

[1]
Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution.
[2]
Duplicate detection, record linkage, and identity uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html.
[3]
Bhattacharyya distance - Wikipedia. https://en.wikipedia.org/wiki/Bhattacharyya_distance#Bhattacharyya_coefficient, Oct 2019. [Online; accessed 14. Oct. 2019].
[4]
How to understand the drawbacks of k-means? https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means, Jun 2019. [Online; accessed 28 Jun 2019].
[5]
sklearn.mixture.GaussianMixture ifmmode--else-fi scikit-learn 0.21.2 documentation, May 2019. [Online; accessed 31. May 2019].
[6]
User Manual for py_entitymatching ifmmode--else-fi py_entitymatching 0.3.0 documentation, Jun 2019. [Online; accessed 9. Oct. 2019].
[7]
ZeroER technical report. https://www.dropbox.com/s/aersjnp0gjmy5pz/ZeroER_technical_report.pdf?dl=0, April 2020.
[8]
anhaidgroup. Deepmatcher. https://github.com/anhaidgroup/deepmatcher.
[9]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on Very large data bases, pages 918--929. VLDB Endowment, 2006.
[10]
A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, pages 783--794, 2010.
[11]
V. Berisha, A. Wisler, A. O. Hero, and A. Spanias. Empirically estimable classification bounds based on a nonparametric divergence measure. IEEE Transactions on Signal Processing, 64(3):580--591, 2015.
[12]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, 2003.
[13]
C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
[14]
S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
[15]
A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 353--364. ACM, 2007.
[16]
S. Chaudhuri, B. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB07, pages 327--338, 2007.
[17]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In 22nd International Conference on Data Engineering (ICDE'06), pages 5--5. IEEE, 2006.
[18]
D. E. Clark. Practical introduction to record linkage for injury research. Injury Prevention, 10(3):186--191, 2004.
[19]
P. Cryer, S. Westrup, A. Cook, V. Ashwell, P. Bridger, and C. Clarke. Investigation of bias after data linkage of hospital admissions data to police road traffic crash reports. Injury prevention, 7(3):234--241, 2001.
[20]
T. Danka and P. Horvath. modAL: A modular active learning framework for Python. available on arXiv at https://arxiv.org/abs/1805.00979.
[21]
J. De Bruin. Probabilistic record linkage with the fellegi and sunter framework: Using probabilistic record linkage to link privacy preserved police and hospital road accident records. 2015.
[22]
J. de Bruin. Python record linkage toolkit. https://github.com/J535D165/recordlinkage, 2018.
[23]
A. P. Dempster. Covariance Selection. Biometrics, 28(1):157--175, Mar 1972.
[24]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1--22, 1977.
[25]
X. L. Dong and F. Naumann. Data fusion: resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009.
[26]
X. L. Dong and T. Rekatsinas. Data integration and machine learning: A natural synergy. In Proceedings of the 2018 International Conference on Management of Data, pages 1645--1650. ACM, 2018.
[27]
J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI'08, pages 153--160, Arlington, Virginia, United States, 2008. AUAI Press.
[28]
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. In PVLDB, 2018.
[29]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEETKDE, 19(1):1--16, 2007.
[30]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183--1210, 1969.
[31]
L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.
[32]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, et al. Approximate string joins in a database (almost) for free. In VLDB, volume 1, pages 491--500, 2001.
[33]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In 2008 IEEE 24th International Conference on Data Engineering, pages 267--276. IEEE, 2008.
[34]
T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer Science & Business Media, 2007.
[35]
J. Honorio and T. Jaakkola. Inverse Covariance Estimation for High-Dimensional Data in Linear Time and Space: Spectral Methods for Riccati and Sparse Models. Association for Uncertainty in Artificial Intelligence (AUAI), Jul 2013.
[36]
P. Jain, P. Kar, et al. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3--4):142--336, 2017.
[37]
M. A. Jaro. Unimatch: A record linkage system: User's manual. U.S. Bureau of the Census, 1976.
[38]
M. I. Jordan and C. Bishop. An introduction to graphical models, 2004.
[39]
P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.
[40]
H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1--2):484--493, 2010.
[41]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006.
[42]
A. Lyon. Why are normal distributions normal? The British Journal for the Philosophy of Science, 65(3):621--649, 2013.
[43]
F. Maggi. A survey of probabilistic record matching models, techniques and tools. Scienti_c Report TR-2008, 2008.
[44]
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018.
[45]
F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. 2010.
[46]
R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355--368. Springer, 1998.
[47]
G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946--1960, 2013.
[48]
G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9):684--695, 2016.
[49]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[50]
M. Sadinle and S. E. Fienberg. A generalized fellegi--sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502):385--397, 2013.
[51]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002.
[52]
M. Stonebraker and I. F. Ilyas. Data integration: The current status and the way forward. IEEE Data Eng. Bull., 41(2):3--9, 2018.
[53]
H. Tuy and N. Van Thuong. On the global minimization of a convex function under general nonconvex constraints. Applied Mathematics and Optimization, 18(1):119--142, 1988.
[54]
S. Velasco-Forero, M. Chen, A. Goh, and S. K. Pang. Comparative Analysis of Covariance Matrix Estimation for Anomaly Detection in Hyperspectral Images. IEEE J. Sel. Top. Signal Process., 9(6):1061--1073, Sep 2015.
[55]
E. W. Weisstein. Newton's Method, Oct 2019. [Online; accessed 11. Oct. 2019].
[56]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 219--232. ACM, 2009.
[57]
W. E. Winkler. The state of record linkage and current research problems. In Statistical Research Division, U.S. Census Bureau, 1999.
[58]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):1--41, 2011.

Cited By

View all
  • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
  • (2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
  • (2024)MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00264(3421-3434)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity matching
  2. entity resolution
  3. unsupervised learning

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical AnalysesProceedings of the VLDB Endowment10.14778/3648160.364817817:6(1391-1404)Online publication date: 1-Feb-2024
  • (2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
  • (2024)MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00264(3421-3434)Online publication date: 13-May-2024
  • (2024)A simple and efficient approach to unsupervised instance matching and its application to linked data of power plantsJournal of Web Semantics10.1016/j.websem.2024.10081580(100815)Online publication date: Apr-2024
  • (2024)RLclean: An Unsupervised Integrated Data Cleaning Framework Based on Deep Reinforcement LearningInformation Sciences10.1016/j.ins.2024.121281(121281)Online publication date: Jul-2024
  • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
  • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
  • (2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
  • (2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
  • (2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media