Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/775047.775116acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Learning to match and cluster large high-dimensional data sets for data integration

Published: 23 July 2002 Publication History
  • Get Citation Alerts
  • Abstract

    Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.

    References

    [1]
    William W. Cohen. Reasoning about textual similarity in information access. Autonomous Agents and Multi-Agent Systems, pages 65--86, 1999.
    [2]
    William W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000.
    [3]
    William W. Cohen. WHIRL: A word-based information representation language. Artificial Intelligence, 118:163--196, 2000.
    [4]
    William W. Cohen and Jacob Richman. Learning to match and cluster entity names. In Proceedings of the ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, New Orleans, LA, 2001.
    [5]
    William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of Artificial Intelligence Research, 10:243--270, 1999.
    [6]
    M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, WI, 1998.
    [7]
    I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
    [8]
    H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: an extensible data-cleaning tool. In Proceedings of ACM SIGMOD-2000, June 2000.
    [9]
    M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD, May 1995.
    [10]
    B. Kilss and W. Alvey. Record linkage techniques--1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts.gov/fcsm/methodology/, 1985.
    [11]
    Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999.
    [12]
    A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000.
    [13]
    A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 169--178, 2000.
    [14]
    A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
    [15]
    H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
    [16]
    Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.
    [17]
    H.A. Baler Saip and C.L. Lucchesi. Matching algorithm, for bipartite graph. Technical Report DCC-03/93, Departamento de Cincia da Computao, Universidade Estudal de Campinas, 1993.
    [18]
    Gerard Salton, editor. Automatic Text Processing. Addison Welsley, Reading, Massachusetts, 1989.
    [19]
    W. E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Statistics of Income Division, Internal Revenue Service Publication RR93/12. Available from http://www.census.gov/srd/www/byname.html, 1993.
    [20]
    W. E. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html, 1999.
    [21]
    William E. Winkler. Matching and record linkage. In Business Survey methods. Wiley, 1995.

    Cited By

    View all
    • (2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
    • (2023)A Domain-Oriented Entity Alignment Approach Based on Filtering Multi-Type Graph Neural NetworksApplied Sciences10.3390/app1316923713:16(9237)Online publication date: 14-Aug-2023
    • (2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
    July 2002
    719 pages
    ISBN:158113567X
    DOI:10.1145/775047
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 July 2002

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. large datasets
    3. learning
    4. text mining

    Qualifiers

    • Article

    Conference

    KDD02
    Sponsor:

    Acceptance Rates

    KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
    • (2023)A Domain-Oriented Entity Alignment Approach Based on Filtering Multi-Type Graph Neural NetworksApplied Sciences10.3390/app1316923713:16(9237)Online publication date: 14-Aug-2023
    • (2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
    • (2023)MixER: linear interpolation of latent space for entity resolutionComplex & Intelligent Systems10.1007/s40747-023-01018-210:1(3-22)Online publication date: 14-Mar-2023
    • (2023)Effective entity matching with transformersThe VLDB Journal10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
    • (2023)Unsupervised Deep Cross-Language Entity AlignmentMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43421-1_1(3-19)Online publication date: 18-Sep-2023
    • (2023)QA-Matcher: Unsupervised Entity Matching Using a Question Answering ModelAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33383-5_14(174-185)Online publication date: 26-May-2023
    • (2023)An Improved Active Machine Learning Query Strategy for Entity Matching ProblemAdvances in Machine Intelligence and Computer Science Applications10.1007/978-3-031-29313-9_28(317-327)Online publication date: 7-Apr-2023
    • (2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
    • (2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media