research-article

Duplicate Record Detection: A Survey

Authors:

Ahmed K. Elmagarmid,

Panagiotis G. Ipeirotis,

Vassilios S. VerykiosAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 19, Issue 1

Pages 1 - 16

Published: 01 January 2007 Publication History

Abstract

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.

References

[1]

A. Chatterjee and A. Segev, “Data Manipulation in Heterogeneous Databases,” ACM SIGMOD Record, vol. 20, no. 4, pp. 64-68, Dec. 1991.

Digital Library

[2]

IEEE Data Eng. Bull., S. Sarawagi, ed., special issue on data cleaning, vol. 23, no. 4, Dec. 2000.

[3]

J. Widom, “Research Problems in Data Warehousing,” Proc. 1995 ACM Conf. Information and Knowledge Management (CIKM '95), pp.25-30, 1995.

Digital Library

[4]

A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth Int'l World Wide Web Conf. (WWW6), pp. 1157-1166, 1997.

Digital Library

[5]

J. Cho, N. Shivakumar, and H. Garcia-Molina, “Finding Replicated Web Collections,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '00), pp. 355-366, 2000.

Digital Library

[6]

R. Mitkov, Anaphora Resolution, first ed. Longman, Aug. 2002.

[7]

A. McCallum, “Information Extraction: Distilling Structured Data from Unstructured Text,” ACM Queue, vol. 3, no. 9, pp. 48-57, 2005.

Digital Library

[8]

H.B. Newcombe, J.M. Kennedy, S. Axford, and A. James, “Automatic Linkage of Vital Records,” Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.

[9]

H.B. Newcombe and J.M. Kennedy, “Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,” Comm. ACM, vol. 5, no. 11, pp. 563-566, Nov. 1962.

Digital Library

[10]

H.B. Newcombe, “Record Linking: The Design of Efficient Systems for Linking Records into Individual and Family Histories,” Am. J. Human Genetics, vol. 19, no. 3, pp. 335-359, May 1967.

[11]

B.J. Tepping, “A Model for Optimum Linkage of Records,” J. Am. Statistical Assoc., vol. 63, no. 324, pp. 1321-1332, Dec. 1968.

[12]

I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969.

[13]

H.B. Newcombe, Handbook of Record Linkage. Oxford Univ. Press, 1988.

[14]

M.A. Hernández and S.J. Stolfo, “Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, Jan. 1998.

Digital Library

[15]

S. Sarawagi and A. Bhamidipaty, “Interactive Deduplication Using Active Learning,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 269-278, 2002.

Digital Library

[16]

Y.R. Wang and S.E. Madnick, “The Inter-Database Instance Identification Problem in Integrating Autonomous Systems,” Proc. Fifth IEEE Int'l Conf. Data Eng. (ICDE '89), pp. 46-55, 1989.

Digital Library

[17]

W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft Information Sources,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 255-259, 2000.

Digital Library

[18]

M. Bilenko, R.J. Mooney, W.W. Cohen, P. Ravikumar, and S.E. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.

Digital Library

[19]

R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, 2004.

Digital Library

[20]

IEEE Data Eng. Bull., E. Rundensteiner, ed., special issue on date transformation, vol. 22, no. 1, Jan. 1999.

[21]

A. McCallum, D. Freitag, and F.C.N. Pereira, “Maximum Entropy Markov Models for Information Extraction and Segmentation,” Proc. 17th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000.

Digital Library

[22]

V.R. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic Segmentation of Text into Structured Records,” Proc. 2001 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '01), pp. 175-186, 2001.

Digital Library

[23]

E. Agichtein and V. Ganti, “Mining Reference Tables for Automatic Text Segmentation,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 20-29, 2004.

Digital Library

[24]

C. Sutton, K. Rohanimanesh, and A. McCallum, “Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data,” Proc. 21st Int'l Conf. Machine Learning (ICML '04), 2004.

Digital Library

[25]

V. Raman and J.M. Hellerstein, “Potter's Wheel: An Interactive Data Cleaning System,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 381-390, 2001.

Digital Library

[26]

M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld, “Learning to Understand Information on the Internet: An Example-Based Approach,” J. Intelligent Information Systems, vol. 8, no. 2, pp. 133-153, Mar. 1997.

Digital Library

[27]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, “Mining Database Structure; or, How to Build a Data Quality Browser,” Proc. 2002 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '02), pp. 240-251, 2002.

Digital Library

[28]

V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845-848, 1965, original in Russian—translation in Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.

[29]

G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.

Digital Library

[30]

G.M. Landau and U. Vishkin, “Fast Parallel and Serial Approximate String Matching,” J. Algorithms, vol. 10, no. 2, pp. 157-169, June 1989.

Digital Library

[31]

S.B. Needleman and C.D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins,” J. Molecular Biology, vol. 48, no. 3, pp. 443-453, Mar. 1970.

[32]

E.S. Ristad and P.N. Yianilos, “Learning String Edit Distance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522-532, May 1998.

Digital Library

[33]

M.S. Waterman, T.F. Smith, and W.A. Beyer, “Some Biological Sequence Metrics,” Advances in Math., vol. 20, no. 4, pp. 367-387, 1976.

[34]

T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.

[35]

S.F. Altschula, W. Gisha, W. Millerb, E.W. Meyersc, and D.J. Lipmana, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, Oct. 1990.

[36]

R. Baeza-Yates and G.H. Gonnet, “A New Approach to Text Searching,” Comm. ACM, vol. 35, no. 10, pp. 74-82, Oct. 1992.

Digital Library

[37]

S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Comm. ACM, vol. 35, no. 10, pp. 83-91, Oct. 1992.

Digital Library

[38]

J.C. Pinheiro and D.X. Sun, “Methods for Linking and Mining Heterogeneous Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 309-313, 1998.

[39]

M.A. Jaro, “Unimatch: A Record Linkage System: User's Manual,” technical report, US Bureau of the Census, Washington, D.C., 1976.

[40]

W.E. Winkler and Y. Thibaudeau, “An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 US Decennial Census,” Technical Report Statistical Research Report Series RR91/09, US Bureau of the Census, Washington, D.C., 1991.

[41]

J.R. Ullmann, “A Binary $n{\hbox{-}}{\rm{Gram}}$ Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words,” The Computer J., vol. 20, no. 2, pp. 141-147, 1977.

[42]

E. Ukkonen, “Approximate String Matching with $q{\hbox{-}}{\rm{Grams}}$ and Maximal Matches,” Theoretical Computer Science, vol. 92, no. 1, pp.191-211, 1992.

Digital Library

[43]

K. Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, Dec. 1992.

Digital Library

[44]

E. Sutinen and J. Tarhio, “On Using $q{\hbox{-}}{\rm{Gram}}$ Locations in Approximate String Matching,” Proc. Third Ann. European Symp. Algorithms (ESA '95), pp. 327-340, 1995.

Digital Library

[45]

L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate String Joins in a Database (Almost) for Free,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 491-500, 2001.

Digital Library

[46]

L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava, “Using $q{\hbox{-}}{\rm{Grams}}$ in a DBMS for Approximate String Processing,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 28-34, Dec. 2001.

[47]

A.E. Monge and C.P. Elkan, “The Field Matching Problem: Algorithms and Applications,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 267-270, 1996.

[48]

W.W. Cohen, “Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 201-212, 1998.

Digital Library

[49]

L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava, “Text Joins in an RDBMS for Web Data Integration,” Proc. 12th Int'l World Wide Web Conf. (WWW12), pp. 90-101, 2003.

Digital Library

[50]

R.C. Russell Index, U.S. Patent 1,261,167,

[51]

R.C. Russell Index, U.S. Patent 1,435,663,

[52]

R.L. Taft, “Name Search Techniques,” Technical Report Special Report No. 1, New York State Identification and Intelligence System, Albany, N.Y., Feb. 1970.

[53]

L.E. Gill, “OX-LINK: The Oxford Medical Record Linkage System,” Proc. Int'l Record Linkage Workshop and Exposition, pp.15-33, 1997.

[54]

L. Philips, “Hanging on the Metaphone,” Computer Language Magazine, vol. 7, no. 12, pp. 39-44, Dec. 1990,

[55]

L. Philips, “The Double Metaphone Search Algorithm,” C/C++ Users J., vol. 18, no. 5, June 2000.

Digital Library

[56]

N. Koudas, A. Marathe, and D. Srivastava, “Flexible String Matching against Large Databases in Practice,” Proc. 30th Int'l Conf. Very Large Databases (VLDB '04), pp. 1078-1086, 2004.

Digital Library

[57]

R. Agrawal and R. Srikant, “Searching with Numbers,” Proc. 11th Int'l World Wide Web Conf. (WWW11), pp. 420-431, 2002.

Digital Library

[58]

W.E. Yancey, “Evaluating String Comparator Performance for Record Linkage,” Technical Report Statistical Research Report Series RRS2005/05, US Bureau of the Census, Washington, D.C., June 2005.

[59]

S. Tejada, C.A. Knoblock, and S. Minton, “Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.

Digital Library

[60]

T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning. Springer Verlag, Aug. 2001.

[61]

M.A. Jaro, “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” J. Am. Statistical Assoc., vol. 84, no. 406, pp. 414-420, June 1989.

[62]

A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. B, no. 39, pp. 1-38, 1977.

[63]

W.E. Winkler, “Improved Decision Rules in the Felligi-Sunter Model of Record Linkage,” Technical Report Statistical Research Report Series RR93/12, US Bureau of the Census, Washington, D.C., 1993.

[64]

W.E. Winkler, “Methods for Record Linkage and Bayesian Networks,” Technical Report Statistical Research Report Series RRS2002/05, US Bureau of the Census, Washington, D.C., 2002.

[65]

K. Nigam, A. McCallum, S. Thrun, and T.M. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, nos. 2/3, pp. 103-134, 2000.

Digital Library

[66]

N.S.D. Du Bois Jr., “A Solution to the Problem of Linking Multivariate Documents,” J. Am. Statistical Assoc., vol. 64, no. 325, pp. 163-174, Mar. 1969.

[67]

R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. Wiley, 1973.

Digital Library

[68]

V.S. Verykios, G.V. Moustakides, and M.G. Elfeky, “A Bayesian Decision Model for Cost Optimal Record Matching,” VLDB J., vol. 12, no. 1, pp. 28-40, May 2003.

Digital Library

[69]

V.S. Verykios and G.V. Moustakides, “A Generalized Cost Optimal Decision Model for Record Matching,” Proc. 2004 Int'l Workshop Information Quality in Information Systems, pp. 20-26, 2004.

Digital Library

[70]

M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, “Efficient Data Reconciliation,” Information Sciences, vol. 137, nos. 1-4, pp. 1-15, Sept. 2001.

Digital Library

[71]

L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. CRC Press, July 1984.

[72]

T. Joachims, “Making Large-Scale SVM Learning Practical,” Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C.J.C. Burges, and A.J. Smola, eds., MIT Press, 1999.

Digital Library

[73]

A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD '97), pp. 23-29, 1997.

[74]

N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,” Machine Learning, vol. 56, nos. 1-3, pp. 89-113, 2004.

Digital Library

[75]

W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.

Digital Library

[76]

A. McCallum and B. Wellner, “Conditional Models of Identity Uncertainty with Application to Noun Coreference,” Advances in Neural Information Processing Systems (NIPS '04), 2004.

[77]

P. Singla and P. Domingos, “Multi-Relational Record Linkage,” Proc. KDD-2004 Workshop Multi-Relational Data Mining, pp. 31-48, 2004.

[78]

H. Pasula, B. Marthi, B. Milch, S.J. Russell, and I. Shpitser, “Identity Uncertainty and Citation Matching,” Advances in Neural Information Processing Systems (NIPS '02), pp. 1401-1408, 2002.

[79]

D.A. Cohn, L. Atlas, and R.E. Ladner, “Improving Generalization with Active Learning,” Machine Learning, vol. 15, no. 2, pp. 201-221, 1994.

[80]

S. Tejada, C.A. Knoblock, and S. Minton, “Learning Object Identification Rules for Information Integration,” Information Systems, vol. 26, no. 8, pp. 607-633, 2001.

Digital Library

[81]

W.W. Cohen, “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Trans. Information Systems, vol. 18, no. 3, pp. 288-321, 2000.

Digital Library

[82]

D. Dey, S. Sarkar, and P. De, “Entity Matching in Heterogeneous Databases: A Distance Based Decision Model,” Proc. 31st Ann. Hawaii Int'l Conf. System Sciences (HICSS '98), pp. 305-313, 1998.

Digital Library

[83]

S. Guha, N. Koudas, A. Marathe, and D. Srivastava, “Merging the Results of Approximate Match Operations,” Proc. 30th Int'l Conf. Very Large Databases (VLDB '04), pp. 636-647, 2004.

Digital Library

[84]

R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications, first ed. Prentice Hall, Feb. 1993.

Digital Library

[85]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th Int'l Conf. Very Large Databases (VLDB '02), 2002.

Digital Library

[86]

S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE Int'l Conf. Data Eng. (ICDE '05), pp. 865-876, 2005.

Digital Library

[87]

E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson, “Entity Identification in Database Integration,” Proc. Ninth IEEE Int'l Conf. Data Eng. (ICDE '93), pp. 294-301, 1993.

Digital Library

[88]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, “Declarative Data Cleaning: Language, Model, and Algorithms,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 371-380, 2001.

Digital Library

[89]

V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the Approximate Record Matching Process,” Information Sciences, vol. 126, nos. 1-4, pp. 83-98, July 2000.

Digital Library

[90]

A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” COLT '98: Proc. 11th Ann. Conf. Computational Learning Theory, pp. 92-100, 1998.

Digital Library

[91]

P. Cheeseman and J. Sturz, “Bayesian Classification (Autoclass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, pp. 153-180, AAAI Press/The MIT Press, 1996.

Digital Library

[92]

M.G. Elfeky, A.K. Elmagarmid, and V.S. Verykios, “TAILOR: A Record Linkage Tool Box,” Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE '02), pp. 17-28, 2002.

Digital Library

[93]

P. Ravikumar and W.W. Cohen, “A Hierarchical Graphical Model for Record Linkage,” 20th Conf. Uncertainty in Artificial Intelligence (UAI '04), 2004.

Digital Library

[94]

I. Bhattacharya and L. Getoor, “Latent Dirichlet Allocation Model for Entity Resolution,” Technical Report CS-TR-4740, Computer Science Dept., Univ. of Maryland, Aug. 2005.

[95]

A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 169-178, 2000.

Digital Library

[96]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 313-324, 2003.

Digital Library

[97]

R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast Blocking Methods for Record Linkage,” Proc. ACM SIGKDD '03 Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003.

[98]

A. Soffer, D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, and Y.S. Maarek, “Static Index Pruning for Information Retrieval Systems,” Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, (SIGIR '01), pp. 43-50, 2001.

Digital Library

[99]

N. Mamoulis, “Efficient Processing of Joins on Set-Valued Attributes,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 157-168, 2003.

Digital Library

[100]

J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted Files versus Signature Files for Text Indexing,” ACM Trans. Database Systems, vol. 23, no. 4, pp. 453-490, Dec. 1998.

Digital Library

[101]

S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity Predicates,” Proc. 2004 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 743-754, 2004.

Digital Library

[102]

D. Koller and M. Sahami, “Hierarchically Classifying Documents Using Very Few Words,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 170-178, 1997.

Digital Library

[103]

W.E. Yancey, “Bigmatch: A Program for Extracting Probable Matches from a Large File for Record Linkage,” Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C., Mar. 2002.

[104]

W.E. Winkler, “Overview of Record Linkage and Current Research Directions,” Technical Report Statistical Research Report Series RRS2006/02, US Bureau of the Census, Washington, D.C., 2006.

[105]

IEEE Data Eng. Bull., N. Koudas, ed., special issue on data quality, vol. 29, no. 2, June 2006.

[106]

W.E. Winkler, “The State of Record Linkage and Current Research Problems,” Technical Report Statistical Research Report Series RR99/04, US Bureau of the Census, Washington, D.C., 1999.

Cited By

Alves ABaptista CBarbosa LAraujo CHong JPark J(2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3636001
Belhajjame KBarhamgi MCamacho D(2024)Exploring Data Preparation Modules by ExamplesIntelligent Information and Database Systems10.1007/978-981-97-4982-9_5(52-69)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1007/978-981-97-4982-9_5
Karapiperis DTjortjis CVerykios V(2023)A Randomized Blocking Structure for Streaming Record LinkageProceedings of the VLDB Endowment10.14778/3611479.361148716:11(2783-2791)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611487
Show More Cited By

Index Terms

Duplicate Record Detection: A Survey

Recommendations

Data Preparation for Duplicate Detection
On the Horizon and Regular Articles

Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors ...
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Record Matching over Query Results from Multiple Web Databases

Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 19, Issue 1

January 2007

144 pages

ISSN:1041-4347

Issue’s Table of Contents

Copyright © 2007.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 January 2007

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

500
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alves ABaptista CBarbosa LAraujo CHong JPark J(2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3636001
Belhajjame KBarhamgi MCamacho D(2024)Exploring Data Preparation Modules by ExamplesIntelligent Information and Database Systems10.1007/978-981-97-4982-9_5(52-69)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1007/978-981-97-4982-9_5
Karapiperis DTjortjis CVerykios V(2023)A Randomized Blocking Structure for Streaming Record LinkageProceedings of the VLDB Endowment10.14778/3611479.361148716:11(2783-2791)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611487
Dixon AThengo LKitsao EMatiya KBarasa MNyirongo RMuli JKamanga FKachimanga CMunyaneza FNgari PMakungwa HChimpukuso JAmulele MKarari EMbae S(2023)Community and Facility Health Information System Integration in Malawi: A Comparison of Machine Learning and Probabilistic Record Linkage MethodsACM Journal on Computing and Sustainable Societies10.1145/36247731:2(1-16)Online publication date: 12-Oct-2023
https://dl.acm.org/doi/10.1145/3624773
Huang JSun ZChen QXu XRen WHu W(2023)Deep Active Alignment of Knowledge Graph Entities and SchemataProceedings of the ACM on Management of Data10.1145/35893041:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589304
Boinski PSienkiewicz MWrembel RBebel BAndrzejewski WHong JLanperne MPark JCerny TShahriar H(2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3578724
Boiński PAndrzejewski WBębel BWrembel R(2023)On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication PipelineDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_11(164-178)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39847-6_11
Chang YSu HMilani AWoelfel P(2022)Narrowing the LOCAL-CONGEST Gaps in Sparse Networks via Expander DecompositionsProceedings of the 2022 ACM Symposium on Principles of Distributed Computing10.1145/3519270.3538423(301-312)Online publication date: 20-Jul-2022
https://dl.acm.org/doi/10.1145/3519270.3538423
Akbarian Rastaghi MKamalloo ERafiei DAl Hasan MXiong L(2022)Probing the Robustness of Pre-trained Language Models for Entity MatchingProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557673(3786-3790)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557673
Trabelsi MHeflin JCao JSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)DAMEProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498486(1016-1024)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498486
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents