research-article

On active learning of record matching packages

Authors:

Michaela Götz,

Raghav KaushikAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 783 - 794

https://doi.org/10.1145/1807167.1807252

Published: 06 June 2010 Publication History

Abstract

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.

References

[1]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases, pages 586--597, Aug. 2002.

Digital Library

[2]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. of the 32nd Intl. Conf. on Very Large Data Bases, pages 918--929, 2006.

Digital Library

[3]

S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. (JAIR), 11:335--360, 1999.

[4]

R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Digital Library

[5]

M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. of the 5th IEEE Intl. Conf. on Data Mining, pages 58--65, 2005.

Digital Library

[6]

M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proc. of the 6th IEEE Intl. Conf. on Data Mining, pages 87--96, Dec 2006.

Digital Library

[7]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 39--48, Aug. 2003.

Digital Library

[8]

M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proc. of the ACM SIGKDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, Aug. 2003.

[9]

C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, June 1998.

Digital Library

[10]

A. Chandel, O. Hassanzadeh, N. Koudas, et al. Benchmarking declarative approximate selection predicates. In Proc. of the 2007 ACM SIGMOD Intl. Conf. on Management of Data, pages 353--364, June 2007.

Digital Library

[11]

S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases, pages 327--338, 2007.

Digital Library

[12]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. of the 22nd Intl. Conf. on Data Engineering, Apr. 2006.

Digital Library

[13]

Citeseer. http://citeseerx.ist.psu.edu/.

[14]

W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems, 18(3):288--321, July 2000.

Digital Library

[15]

D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.

[16]

P. Dagum, R. M. Karp, M. Luby, and S. M. Ross. An optimal algorithm for monte carlo estimation. SIAM Journal of Computing, 29(5):1484--1496, 2000.

Digital Library

[17]

S. Dasgupta. Coarse sample complexity bounds for active learning. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 235--242. MIT Press, Cambridge, MA, 2006.

[18]

The DBLP computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.

[19]

X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, pages 85--96, June 2005.

Digital Library

[20]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engg., 19(1):1--16, Jan. 2007.

Digital Library

[21]

Y. Freund, H. S. Seung, E. Shamir, et al. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133--168, 1997.

Digital Library

[22]

C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proc. of the 3rd Intl. Conf. on Digital Libraries, pages 89--98, June 1998.

Digital Library

[23]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, et al. Approximate string joins in a database (almost) for free. In Proc. of the 27th Intl. Conf. on Very Large Data Bases, pages 491--500, Sept. 2001.

Digital Library

[24]

D. Gunopulos, R. Khardon, H. Mannila, et al. Discovering all most specific sentences. ACM Trans. on Database Systems, 28(2):140--174, June 2003.

Digital Library

[25]

M. Hadjieleftheriou, A. Chandel, N. Koudas, et al. Fast indexes and algorithms for set similarity selection queries. In Proc. of the 24th Intl. Conf. on Data Engineering, pages 267--276, Apr. 2008.

Digital Library

[26]

M. Hall, E. Frank, G. Holmes, et al. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.

Digital Library

[27]

S. Hanneke. A bound on the label complexity of agnostic active learning. In Proc. of the 24th Intl. Conf. on Machine Learning, pages 353--360, 2007.

Digital Library

[28]

M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, pages 127--138, May 1995.

Digital Library

[29]

M. A. Jaro. Unimatch: A record linkage system: User's manual. Technical report, US Bureau of the Census, Washington DC, 1976.

[30]

R. M. Karp and R. Kleinberg. Noisy binary search and its applications. In Proc. of the 8th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 881--890, Jan. 2007.

Digital Library

[31]

C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases, pages 303--314, Sept. 2007.

Digital Library

[32]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 169--178, Aug. 2000.

Digital Library

[33]

T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.

Digital Library

[34]

A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. of the 1st SIGMOD workshop on data mining and knowledge discovery, 1997.

[35]

G. N. Norén, R. Orre, and A. Bate. A hit-miss model for duplicate detection in the who drug safety database. In Proc. of the 11th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 459--468, Aug. 2005.

Digital Library

[36]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

Digital Library

[37]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 269--278, July 2002.

Digital Library

[38]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proc. of the 2004 ACM SIGMOD Intl. Conf. on Management of Data, pages 743--754, June 2004.

Digital Library

[39]

P. Singla and P. Domingos. Multi-relational record linkage. In Proc. of the 3rd KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.

[40]

S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, Dec. 2001.

Digital Library

[41]

S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.

Digital Library

[42]

W. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington DC, 1999.

[43]

W. E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington DC, 1993.

[44]

B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proc. of the 7th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 204--213, 2001

Digital Library

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702315
Show More Cited By

Index Terms

On active learning of record matching packages
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Active Learning with Density-Initialized Decision Tree for Record Matching
SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management

One of the fundamental problem in data management and data integration fields is Record Matching, which refers to identifying records that relate to the same entities across different data sources. In recent literature, active learning has demonstrated ...
Record Matching over Query Results from Multiple Web Databases

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These ...
Dynamic constraints for record matching

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

123
Total Citations
View Citations
920
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)6

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Yan MFan WWang YXie M(2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3681987
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702315
Lu DHan GZhao YHan Q(2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9893-7_5
Movaffaghi Ardestani ARezaei Ghahroodi Z(2023)Improving Fellegi-Sunter ‎model in record linkage using log-linear model and weight ‎adjustment ‎Journal of Statistical Sciences10.61186/jss.17.1.517:1(0-0)Online publication date: 1-Sep-2023
https://doi.org/10.61186/jss.17.1.5
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://doi.org/10.1145/3626763
Fan WFu WJin RLiu MLu PTian C(2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
https://doi.org/10.1145/3588940
Wu RBendeck AChu XHe Y(2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588712
Li SWu H(2023)Transformer-based Denoising Adversarial Variational Entity ResolutionJournal of Intelligent Information Systems10.1007/s10844-022-00773-x61:2(631-650)Online publication date: 17-Apr-2023
https://doi.org/10.1007/s10844-022-00773-x
Fan W(2022)Big graphsProceedings of the VLDB Endowment10.14778/3554821.355489915:12(3782-3797)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554899
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents