Article

Interactive deduplication using active learning

Authors:

Sunita Sarawagi,

Anuradha BhamidipatyAuthors Info & Claims

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 269 - 278

https://doi.org/10.1145/775047.775087

Published: 23 July 2002 Publication History

Abstract

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

References

[1]

S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335--360, 1999.

Digital Library

[2]

V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara, USA, 2001.

Digital Library

[3]

C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In Proc. of SIGIR, pages 292--300, 1994.

Digital Library

[4]

C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998.

Digital Library

[5]

S. Chaudhuri, V. Narasayya, and S. Sarawagi. Efficient evaluation of queries with mining predicates. In Proc. of the 18th Int'l Conference on Data Engineering (ICDE), San Jose, USA, April 2002.

Digital Library

[6]

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.

[7]

R. Collobert and S. Bengio. Svmtorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001. Software available from "http://www.idiap.ch/learning/SVMTorch.html".

Digital Library

[8]

Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2--3):133--168, 1997.

Digital Library

[9]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001.

Digital Library

[10]

L. Gravano, Panagiotis, and H. V. Jagadish. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001.

Digital Library

[11]

M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.

Digital Library

[12]

J. Hylton. Identifying and merging related bibliographic records. Master's thesis, MIT, 1996.

[13]

V. S. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In R. Ramakrishnan, S. Stolfo, R. Bayardo, and I. Parsa, editors, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-00), pages 91--98, N. Y., Aug. 20--23 2000. ACM Press.

Digital Library

[14]

W. C. Jacob. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.

[15]

R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++: A machine learning library in C++. In Tools with Artificial Intelligence, pages 234--245. IEEE Computer Society Press, available from http://www.sgi.com/tech/mlc/, 1996.

[16]

S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999.

Digital Library

[17]

R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence, pages 591--596, Providence, US, 1997. AAAI Press, Menlo Park, US.

[18]

A. McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore. Cora: Computer science research paper search engine, http://cora.whizbang.com/, 2000.

[19]

A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000.

Digital Library

[20]

A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In J. W. Shavlik, editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 350--358, Madison, US, 1998. Morgan Kaufmann Publishers, San Francisco, US.

Digital Library

[21]

T. Mitchell. Machine Learning. McGraw-Hill, 1997.

Digital Library

[22]

A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.

[23]

G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.

Digital Library

[24]

J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. software available from http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz.

Digital Library

[25]

V. Raman and J. M. Hellerstein. Potters wheel: An interactive data cleaning system. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001.

Digital Library

[26]

S. Sarawagi, editor. IEEE Data Engineering special issue on Data Cleaning. http://www.research.microsoft, com/research/db/debull/A00dec/issue.htm, December 2000.

[27]

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th International Conf. on Machine Learning, pages 839--846. Morgan Kaufmann, San Francisco, CA, 2000.

Digital Library

[28]

H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287--294, 1992.

Digital Library

[29]

S. Toney. Cleanup and deduplication of an international deduplication function. Information Technology and libraries, 11(1):19--28, 1992.

[30]

S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, Nov. 2001.

Digital Library

[31]

W. E. Winkler. Matching and record linkage. In B. G. C. et al, editor, Business Survey Methods, pages 355--384. New York: J. Wiley, 1995. available from http://www.census.gov/.

[32]

W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.

[33]

B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (KDD), 2001.

Digital Library

[34]

T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proc. 17th International Conf. on Machine Learning, pages 1191--1198. Morgan Kaufmann, San Francisco, CA, 2000.

Digital Library

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Nananukul NSisaengsuwanchai KKejriwal M(2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
https://doi.org/10.1007/s44163-024-00159-8
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Show More Cited By

Index Terms

Interactive deduplication using active learning

Recommendations

Cost‐effective multi‐instance multilabel active learning
Abstract
Multi‐instance multi‐label (MIML) Active Learning (M2AL) aims to improve the learner while reducing the cost as much as possible by querying informative labels of complex bags composed of diverse instances. Existing M2AL solutions suffer high ...
Multiple-instance active learning
NIPS'07: Proceedings of the 20th International Conference on Neural Information Processing Systems

We present a framework for active learning in the multiple-instance (MI) setting. In an MI learning problem, instances are naturally organized into bags and it is the bags, instead of individual instances, that are labeled for training. MI learners ...
Transfer active learning
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Active learning traditionally assumes that labeled and unlabeled samples are subject to the same distributions and the goal of an active learner is to label the most informative unlabeled samples. In reality, situations may exist that we may not have ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

July 2002

719 pages

ISBN:158113567X

DOI:10.1145/775047

Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

KDD02

Sponsor:

KDD02: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

July 23 - 26, 2002

Alberta, Edmonton, Canada

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

480
Total Citations
View Citations
3,198
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Nananukul NSisaengsuwanchai KKejriwal M(2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
https://doi.org/10.1007/s44163-024-00159-8
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Lu DHan GZhao YHan Q(2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9893-7_5
Zhu LLiu HSong XWei YWang Y(2024)Entity Resolution Based on Pre-trained Language Models with Two AttentionsWeb and Big Data10.1007/978-981-97-2387-4_29(433-448)Online publication date: 28-Apr-2024
https://doi.org/10.1007/978-981-97-2387-4_29
Goyle KXie QGoyle V(2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
https://doi.org/10.1007/978-3-031-66431-1_33
Rrustemi BBaholli DBalaj H(2023)Mobile Architecture for Version Control SystemsDesigning and Developing Innovative Mobile Applications10.4018/978-1-6684-8582-8.ch003(38-55)Online publication date: 30-Jun-2023
https://doi.org/10.4018/978-1-6684-8582-8.ch003
Ali Omar ZZamzuri ZMohd Ariff NAbu Bakar M(2023)Training Data Selection for Record Linkage ClassificationSymmetry10.3390/sym1505106015:5(1060)Online publication date: 10-May-2023
https://doi.org/10.3390/sym15051060
姚荣(2023)Attribute Augmentation Based Alignment Algorithm for Pairs of Dyadic Graph Entity AlignmentComputer Science and Application10.12677/CSA.2023.13511413:05(1166-1177)Online publication date: 2023
https://doi.org/10.12677/CSA.2023.135114
Xu CGuo RZhang YLuo X(2023)Toward an Efficient and Effective Credit Scorer for Cross-Border E-Commerce EnterprisesScientific Programming10.1155/2023/52810502023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/5281050
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents