Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807167.1807252acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

On active learning of record matching packages

Published: 06 June 2010 Publication History
  • Get Citation Alerts
  • Abstract

    We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.

    References

    [1]
    R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases, pages 586--597, Aug. 2002.
    [2]
    A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. of the 32nd Intl. Conf. on Very Large Data Bases, pages 918--929, 2006.
    [3]
    S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. (JAIR), 11:335--360, 1999.
    [4]
    R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
    [5]
    M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. of the 5th IEEE Intl. Conf. on Data Mining, pages 58--65, 2005.
    [6]
    M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In Proc. of the 6th IEEE Intl. Conf. on Data Mining, pages 87--96, Dec 2006.
    [7]
    M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 39--48, Aug. 2003.
    [8]
    M. Bilenko and R. J. Mooney. On evaluation and training-set construction for duplicate detection. In Proc. of the ACM SIGKDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, Aug. 2003.
    [9]
    C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, June 1998.
    [10]
    A. Chandel, O. Hassanzadeh, N. Koudas, et al. Benchmarking declarative approximate selection predicates. In Proc. of the 2007 ACM SIGMOD Intl. Conf. on Management of Data, pages 353--364, June 2007.
    [11]
    S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases, pages 327--338, 2007.
    [12]
    S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. of the 22nd Intl. Conf. on Data Engineering, Apr. 2006.
    [13]
    Citeseer. http://citeseerx.ist.psu.edu/.
    [14]
    W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. on Information Systems, 18(3):288--321, July 2000.
    [15]
    D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994.
    [16]
    P. Dagum, R. M. Karp, M. Luby, and S. M. Ross. An optimal algorithm for monte carlo estimation. SIAM Journal of Computing, 29(5):1484--1496, 2000.
    [17]
    S. Dasgupta. Coarse sample complexity bounds for active learning. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 235--242. MIT Press, Cambridge, MA, 2006.
    [18]
    The DBLP computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.
    [19]
    X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, pages 85--96, June 2005.
    [20]
    A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engg., 19(1):1--16, Jan. 2007.
    [21]
    Y. Freund, H. S. Seung, E. Shamir, et al. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133--168, 1997.
    [22]
    C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proc. of the 3rd Intl. Conf. on Digital Libraries, pages 89--98, June 1998.
    [23]
    L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, et al. Approximate string joins in a database (almost) for free. In Proc. of the 27th Intl. Conf. on Very Large Data Bases, pages 491--500, Sept. 2001.
    [24]
    D. Gunopulos, R. Khardon, H. Mannila, et al. Discovering all most specific sentences. ACM Trans. on Database Systems, 28(2):140--174, June 2003.
    [25]
    M. Hadjieleftheriou, A. Chandel, N. Koudas, et al. Fast indexes and algorithms for set similarity selection queries. In Proc. of the 24th Intl. Conf. on Data Engineering, pages 267--276, Apr. 2008.
    [26]
    M. Hall, E. Frank, G. Holmes, et al. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009.
    [27]
    S. Hanneke. A bound on the label complexity of agnostic active learning. In Proc. of the 24th Intl. Conf. on Machine Learning, pages 353--360, 2007.
    [28]
    M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, pages 127--138, May 1995.
    [29]
    M. A. Jaro. Unimatch: A record linkage system: User's manual. Technical report, US Bureau of the Census, Washington DC, 1976.
    [30]
    R. M. Karp and R. Kleinberg. Noisy binary search and its applications. In Proc. of the 8th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 881--890, Jan. 2007.
    [31]
    C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases, pages 303--314, Sept. 2007.
    [32]
    A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 169--178, Aug. 2000.
    [33]
    T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
    [34]
    A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. of the 1st SIGMOD workshop on data mining and knowledge discovery, 1997.
    [35]
    G. N. Norén, R. Orre, and A. Bate. A hit-miss model for duplicate detection in the who drug safety database. In Proc. of the 11th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 459--468, Aug. 2005.
    [36]
    J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
    [37]
    S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 269--278, July 2002.
    [38]
    S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proc. of the 2004 ACM SIGMOD Intl. Conf. on Management of Data, pages 743--754, June 2004.
    [39]
    P. Singla and P. Domingos. Multi-relational record linkage. In Proc. of the 3rd KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.
    [40]
    S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, Dec. 2001.
    [41]
    S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.
    [42]
    W. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington DC, 1999.
    [43]
    W. E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington DC, 1993.
    [44]
    B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proc. of the 7th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 204--213, 2001

    Cited By

    View all

    Index Terms

    1. On active learning of record matching packages

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
      June 2010
      1286 pages
      ISBN:9781450300322
      DOI:10.1145/1807167
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 06 June 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. active learning
      2. data cleaning
      3. record matching

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '10
      Sponsor:
      SIGMOD/PODS '10: International Conference on Management of Data
      June 6 - 10, 2010
      Indiana, Indianapolis, USA

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)3

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
      • (2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
      • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
      • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
      • (2023)Ground Truth Inference for Weakly Supervised Entity MatchingProceedings of the ACM on Management of Data10.1145/35887121:1(1-28)Online publication date: 30-May-2023
      • (2023)Transformer-based Denoising Adversarial Variational Entity ResolutionJournal of Intelligent Information Systems10.1007/s10844-022-00773-x61:2(631-650)Online publication date: 17-Apr-2023
      • (2022)Big graphsProceedings of the VLDB Endowment10.14778/3554821.355489915:12(3782-3797)Online publication date: 1-Aug-2022
      • (2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
      • (2022)Machine Learning and Data Cleaning: Which Serves the Other?Journal of Data and Information Quality10.1145/350671214:3(1-11)Online publication date: 21-Jul-2022
      • (2022)(Almost) all of entity resolutionScience Advances10.1126/sciadv.abi80218:12Online publication date: 25-Mar-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media