Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807167.1807252acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

On active learning of record matching packages

Published: 06 June 2010 Publication History

Abstract

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.

References

[1]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases, pages 586--597, Aug. 2002.
[2]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. of the 32nd Intl. Conf. on Very Large Data Bases, pages 918--929, 2006.
[3]
S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. (JAIR), 11:335--360, 1999.
[4]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[5]
M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. of the 5th IEEE Intl. Conf. on Data Mining, pages 58--65, 2005.
[6]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proc. of the 6th IEEE Intl. Conf. on Data Mining, pages 87--96, Dec 2006.
[7]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 39--48, Aug. 2003.
[8]
M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proc. of the ACM SIGKDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, Aug. 2003.
[9]
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, June 1998.
[10]
A. Chandel, O. Hassanzadeh, N. Koudas, et al. Benchmarking declarative approximate selection predicates. In Proc. of the 2007 ACM SIGMOD Intl. Conf. on Management of Data, pages 353--364, June 2007.
[11]
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases, pages 327--338, 2007.
[12]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. of the 22nd Intl. Conf. on Data Engineering, Apr. 2006.
[13]
Citeseer. http://citeseerx.ist.psu.edu/.
[14]
W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems, 18(3):288--321, July 2000.
[15]
D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.
[16]
P. Dagum, R. M. Karp, M. Luby, and S. M. Ross. An optimal algorithm for monte carlo estimation. SIAM Journal of Computing, 29(5):1484--1496, 2000.
[17]
S. Dasgupta. Coarse sample complexity bounds for active learning. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 235--242. MIT Press, Cambridge, MA, 2006.
[18]
The DBLP computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.
[19]
X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, pages 85--96, June 2005.
[20]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engg., 19(1):1--16, Jan. 2007.
[21]
Y. Freund, H. S. Seung, E. Shamir, et al. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133--168, 1997.
[22]
C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proc. of the 3rd Intl. Conf. on Digital Libraries, pages 89--98, June 1998.
[23]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, et al. Approximate string joins in a database (almost) for free. In Proc. of the 27th Intl. Conf. on Very Large Data Bases, pages 491--500, Sept. 2001.
[24]
D. Gunopulos, R. Khardon, H. Mannila, et al. Discovering all most specific sentences. ACM Trans. on Database Systems, 28(2):140--174, June 2003.
[25]
M. Hadjieleftheriou, A. Chandel, N. Koudas, et al. Fast indexes and algorithms for set similarity selection queries. In Proc. of the 24th Intl. Conf. on Data Engineering, pages 267--276, Apr. 2008.
[26]
M. Hall, E. Frank, G. Holmes, et al. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.
[27]
S. Hanneke. A bound on the label complexity of agnostic active learning. In Proc. of the 24th Intl. Conf. on Machine Learning, pages 353--360, 2007.
[28]
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, pages 127--138, May 1995.
[29]
M. A. Jaro. Unimatch: A record linkage system: User's manual. Technical report, US Bureau of the Census, Washington DC, 1976.
[30]
R. M. Karp and R. Kleinberg. Noisy binary search and its applications. In Proc. of the 8th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 881--890, Jan. 2007.
[31]
C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases, pages 303--314, Sept. 2007.
[32]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 169--178, Aug. 2000.
[33]
T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[34]
A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. of the 1st SIGMOD workshop on data mining and knowledge discovery, 1997.
[35]
G. N. Norén, R. Orre, and A. Bate. A hit-miss model for duplicate detection in the who drug safety database. In Proc. of the 11th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 459--468, Aug. 2005.
[36]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
[37]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 269--278, July 2002.
[38]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proc. of the 2004 ACM SIGMOD Intl. Conf. on Management of Data, pages 743--754, June 2004.
[39]
P. Singla and P. Domingos. Multi-relational record linkage. In Proc. of the 3rd KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.
[40]
S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, Dec. 2001.
[41]
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.
[42]
W. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington DC, 1999.
[43]
W. E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington DC, 1993.
[44]
B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proc. of the 7th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 204--213, 2001

Cited By

View all

Index Terms

  1. On active learning of record matching packages

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
    June 2010
    1286 pages
    ISBN:9781450300322
    DOI:10.1145/1807167
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. active learning
    2. data cleaning
    3. record matching

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '10
    Sponsor:
    SIGMOD/PODS '10: International Conference on Management of Data
    June 6 - 10, 2010
    Indiana, Indianapolis, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
    • (2024)Enriching Relations with Additional Attributes for ERProceedings of the VLDB Endowment10.14778/3681954.368198717:11(3109-3123)Online publication date: 30-Aug-2024
    • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
    • (2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
    • (2023)Improving Fellegi-Sunter ‎model in record linkage using log-linear model and weight ‎adjustment ‎Journal of Statistical Sciences10.61186/jss.17.1.517:1(0-0)Online publication date: 1-Sep-2023
    • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
    • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
    • (2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
    • (2023)Transformer-based Denoising Adversarial Variational Entity ResolutionJournal of Intelligent Information Systems10.1007/s10844-022-00773-x61:2(631-650)Online publication date: 17-Apr-2023
    • (2022)Big graphsProceedings of the VLDB Endowment10.14778/3554821.355489915:12(3782-3797)Online publication date: 1-Aug-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media