Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/775047.775087acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Interactive deduplication using active learning

Published: 23 July 2002 Publication History

Abstract

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

References

[1]
S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335--360, 1999.
[2]
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara, USA, 2001.
[3]
C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In Proc. of SIGIR, pages 292--300, 1994.
[4]
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998.
[5]
S. Chaudhuri, V. Narasayya, and S. Sarawagi. Efficient evaluation of queries with mining predicates. In Proc. of the 18th Int'l Conference on Data Engineering (ICDE), San Jose, USA, April 2002.
[6]
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.
[7]
R. Collobert and S. Bengio. Svmtorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001. Software available from "http://www.idiap.ch/learning/SVMTorch.html".
[8]
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2--3):133--168, 1997.
[9]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001.
[10]
L. Gravano, Panagiotis, and H. V. Jagadish. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001.
[11]
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.
[12]
J. Hylton. Identifying and merging related bibliographic records. Master's thesis, MIT, 1996.
[13]
V. S. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In R. Ramakrishnan, S. Stolfo, R. Bayardo, and I. Parsa, editors, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-00), pages 91--98, N. Y., Aug. 20--23 2000. ACM Press.
[14]
W. C. Jacob. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.
[15]
R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++: A machine learning library in C++. In Tools with Artificial Intelligence, pages 234--245. IEEE Computer Society Press, available from http://www.sgi.com/tech/mlc/, 1996.
[16]
S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999.
[17]
R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence, pages 591--596, Providence, US, 1997. AAAI Press, Menlo Park, US.
[18]
A. McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore. Cora: Computer science research paper search engine, http://cora.whizbang.com/, 2000.
[19]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000.
[20]
A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In J. W. Shavlik, editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 350--358, Madison, US, 1998. Morgan Kaufmann Publishers, San Francisco, US.
[21]
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[22]
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
[23]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.
[24]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. software available from http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz.
[25]
V. Raman and J. M. Hellerstein. Potters wheel: An interactive data cleaning system. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001.
[26]
S. Sarawagi, editor. IEEE Data Engineering special issue on Data Cleaning. http://www.research.microsoft, com/research/db/debull/A00dec/issue.htm, December 2000.
[27]
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th International Conf. on Machine Learning, pages 839--846. Morgan Kaufmann, San Francisco, CA, 2000.
[28]
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287--294, 1992.
[29]
S. Toney. Cleanup and deduplication of an international deduplication function. Information Technology and libraries, 11(1):19--28, 1992.
[30]
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, Nov. 2001.
[31]
W. E. Winkler. Matching and record linkage. In B. G. C. et al, editor, Business Survey Methods, pages 355--384. New York: J. Wiley, 1995. available from http://www.census.gov/.
[32]
W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.
[33]
B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (KDD), 2001.
[34]
T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proc. 17th International Conf. on Machine Learning, pages 1191--1198. Morgan Kaufmann, San Francisco, CA, 2000.

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
  • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

KDD02
Sponsor:

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)3
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
  • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
  • (2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
  • (2024)Entity Resolution Based on Pre-trained Language Models with Two AttentionsWeb and Big Data10.1007/978-981-97-2387-4_29(433-448)Online publication date: 28-Apr-2024
  • (2024)DataAssist: A Machine Learning Approach to Data Cleaning and PreparationIntelligent Systems and Applications10.1007/978-3-031-66431-1_33(476-486)Online publication date: 31-Jul-2024
  • (2023)Mobile Architecture for Version Control SystemsDesigning and Developing Innovative Mobile Applications10.4018/978-1-6684-8582-8.ch003(38-55)Online publication date: 30-Jun-2023
  • (2023)Training Data Selection for Record Linkage ClassificationSymmetry10.3390/sym1505106015:5(1060)Online publication date: 10-May-2023
  • (2023)Attribute Augmentation Based Alignment Algorithm for Pairs of Dyadic Graph Entity AlignmentComputer Science and Application10.12677/CSA.2023.13511413:05(1166-1177)Online publication date: 2023
  • (2023)Toward an Efficient and Effective Credit Scorer for Cross-Border E-Commerce EnterprisesScientific Programming10.1155/2023/52810502023Online publication date: 1-Jan-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media