Article

Learning to match and cluster large high-dimensional data sets for data integration

Authors:

William W. Cohen,

Jacob RichmanAuthors Info & Claims

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 475 - 480

https://doi.org/10.1145/775047.775116

Published: 23 July 2002 Publication History

Abstract

Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.

References

[1]

William W. Cohen. Reasoning about textual similarity in information access. Autonomous Agents and Multi-Agent Systems, pages 65--86, 1999.

Digital Library

[2]

William W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000.

Digital Library

[3]

William W. Cohen. WHIRL: A word-based information representation language. Artificial Intelligence, 118:163--196, 2000.

Digital Library

[4]

William W. Cohen and Jacob Richman. Learning to match and cluster entity names. In Proceedings of the ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, New Orleans, LA, 2001.

[5]

William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of Artificial Intelligence Research, 10:243--270, 1999.

Digital Library

[6]

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, WI, 1998.

Digital Library

[7]

I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.

[8]

H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: an extensible data-cleaning tool. In Proceedings of ACM SIGMOD-2000, June 2000.

Digital Library

[9]

M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD, May 1995.

Digital Library

[10]

B. Kilss and W. Alvey. Record linkage techniques--1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts.gov/fcsm/methodology/, 1985.

[11]

Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999.

Digital Library

[12]

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000.

Digital Library

[13]

A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 169--178, 2000.

Digital Library

[14]

A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.

[15]

H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.

[16]

Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text classification. In Proceedings of Machine Learning for Information Filtering Workshop, IJCAI '99, Stockholm, Sweden, 1999.

[17]

H.A. Baler Saip and C.L. Lucchesi. Matching algorithm, for bipartite graph. Technical Report DCC-03/93, Departamento de Cincia da Computao, Universidade Estudal de Campinas, 1993.

[18]

Gerard Salton, editor. Automatic Text Processing. Addison Welsley, Reading, Massachusetts, 1989.

Digital Library

[19]

W. E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Statistics of Income Division, Internal Revenue Service Publication RR93/12. Available from http://www.census.gov/srd/www/byname.html, 1993.

[20]

W. E. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html, 1999.

[21]

William E. Winkler. Matching and record linkage. In Business Survey methods. Wiley, 1995.

Cited By

Lu DHan GZhao YHan Q(2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9893-7_5
Xu YZhong JZhang SLi CLi PGuo YLi YLiang HZhang Y(2023)A Domain-Oriented Entity Alignment Approach Based on Filtering Multi-Type Graph Neural NetworksApplied Sciences10.3390/app1316923713:16(9237)Online publication date: 14-Aug-2023
https://doi.org/10.3390/app13169237
Genossar BShraga RGal A(2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588722
Show More Cited By

Index Terms

Learning to match and cluster large high-dimensional data sets for data integration

Recommendations

Effective data summarization for hierarchical clustering in large datasets

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link ...
Machine-learned cluster identification in high-dimensional data

3-D representation of high dimensional data following ESOM projection and visualization of group (cluster) structures using the U-matrix, which employs a geographical map analogy of valleys where members of the same cluster are located, separated by ...
Tolerance rough set theory based data summarization for clustering large datasets
Transactions on rough sets XIV

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

July 2002

719 pages

ISBN:158113567X

DOI:10.1145/775047

Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD02

Sponsor:

KDD02: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

July 23 - 26, 2002

Alberta, Edmonton, Canada

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

187
Total Citations
View Citations
1,815
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)4

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lu DHan GZhao YHan Q(2024)Review of Deep Learning-Based Entity Alignment MethodsGreen, Pervasive, and Cloud Computing10.1007/978-981-99-9893-7_5(61-71)Online publication date: 23-Jan-2024
https://doi.org/10.1007/978-981-99-9893-7_5
Xu YZhong JZhang SLi CLi PGuo YLi YLiang HZhang Y(2023)A Domain-Oriented Entity Alignment Approach Based on Filtering Multi-Type Graph Neural NetworksApplied Sciences10.3390/app1316923713:16(9237)Online publication date: 14-Aug-2023
https://doi.org/10.3390/app13169237
Genossar BShraga RGal A(2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588722
Wu HLi S(2023)MixER: linear interpolation of latent space for entity resolutionComplex & Intelligent Systems10.1007/s40747-023-01018-210:1(3-22)Online publication date: 14-Mar-2023
https://doi.org/10.1007/s40747-023-01018-2
Li YLi JSuhara YDoan ATan W(2023)Effective entity matching with transformersThe VLDB Journal10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
https://doi.org/10.1007/s00778-023-00779-z
Jiang CQian YChen LGu YXie X(2023)Unsupervised Deep Cross-Language Entity AlignmentMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43421-1_1(3-19)Online publication date: 18-Sep-2023
https://doi.org/10.1007/978-3-031-43421-1_1
Hayashi SDong YOyamada M(2023)QA-Matcher: Unsupervised Entity Matching Using a Question Answering ModelAdvances in Knowledge Discovery and Data Mining10.1007/978-3-031-33383-5_14(174-185)Online publication date: 26-May-2023
https://doi.org/10.1007/978-3-031-33383-5_14
Jabrane MHafidi IRochd Y(2023)An Improved Active Machine Learning Query Strategy for Entity Matching ProblemAdvances in Machine Intelligence and Computer Science Applications10.1007/978-3-031-29313-9_28(317-327)Online publication date: 7-Apr-2023
https://doi.org/10.1007/978-3-031-29313-9_28
Tudoreanu M(2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
https://doi.org/10.3389/fdata.2022.931398
Wang PZeng XChen LYe FMao YZhu JGao Y(2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565836
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents