Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Duplicate Record Detection: A Survey

Published: 01 January 2007 Publication History
  • Get Citation Alerts
  • Abstract

    Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.

    References

    [1]
    A. Chatterjee and A. Segev, “Data Manipulation in Heterogeneous Databases,” ACM SIGMOD Record, vol. 20, no. 4, pp. 64-68, Dec. 1991.
    [2]
    IEEE Data Eng. Bull., S. Sarawagi, ed., special issue on data cleaning, vol. 23, no. 4, Dec. 2000.
    [3]
    J. Widom, “Research Problems in Data Warehousing,” Proc. 1995 ACM Conf. Information and Knowledge Management (CIKM '95), pp.25-30, 1995.
    [4]
    A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth Int'l World Wide Web Conf. (WWW6), pp. 1157-1166, 1997.
    [5]
    J. Cho, N. Shivakumar, and H. Garcia-Molina, “Finding Replicated Web Collections,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '00), pp. 355-366, 2000.
    [6]
    R. Mitkov, Anaphora Resolution, first ed. Longman, Aug. 2002.
    [7]
    A. McCallum, “Information Extraction: Distilling Structured Data from Unstructured Text,” ACM Queue, vol. 3, no. 9, pp. 48-57, 2005.
    [8]
    H.B. Newcombe, J.M. Kennedy, S. Axford, and A. James, “Automatic Linkage of Vital Records,” Science, vol. 130, no. 3381, pp. 954-959, Oct. 1959.
    [9]
    H.B. Newcombe and J.M. Kennedy, “Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,” Comm. ACM, vol. 5, no. 11, pp. 563-566, Nov. 1962.
    [10]
    H.B. Newcombe, “Record Linking: The Design of Efficient Systems for Linking Records into Individual and Family Histories,” Am. J. Human Genetics, vol. 19, no. 3, pp. 335-359, May 1967.
    [11]
    B.J. Tepping, “A Model for Optimum Linkage of Records,” J. Am. Statistical Assoc., vol. 63, no. 324, pp. 1321-1332, Dec. 1968.
    [12]
    I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Assoc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969.
    [13]
    H.B. Newcombe, Handbook of Record Linkage. Oxford Univ. Press, 1988.
    [14]
    M.A. Hernández and S.J. Stolfo, “Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, Jan. 1998.
    [15]
    S. Sarawagi and A. Bhamidipaty, “Interactive Deduplication Using Active Learning,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 269-278, 2002.
    [16]
    Y.R. Wang and S.E. Madnick, “The Inter-Database Instance Identification Problem in Integrating Autonomous Systems,” Proc. Fifth IEEE Int'l Conf. Data Eng. (ICDE '89), pp. 46-55, 1989.
    [17]
    W.W. Cohen, H. Kautz, and D. McAllester, “Hardening Soft Information Sources,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 255-259, 2000.
    [18]
    M. Bilenko, R.J. Mooney, W.W. Cohen, P. Ravikumar, and S.E. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, vol. 18, no. 5, pp. 16-23, Sept./Oct. 2003.
    [19]
    R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, 2004.
    [20]
    IEEE Data Eng. Bull., E. Rundensteiner, ed., special issue on date transformation, vol. 22, no. 1, Jan. 1999.
    [21]
    A. McCallum, D. Freitag, and F.C.N. Pereira, “Maximum Entropy Markov Models for Information Extraction and Segmentation,” Proc. 17th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000.
    [22]
    V.R. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic Segmentation of Text into Structured Records,” Proc. 2001 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '01), pp. 175-186, 2001.
    [23]
    E. Agichtein and V. Ganti, “Mining Reference Tables for Automatic Text Segmentation,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 20-29, 2004.
    [24]
    C. Sutton, K. Rohanimanesh, and A. McCallum, “Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data,” Proc. 21st Int'l Conf. Machine Learning (ICML '04), 2004.
    [25]
    V. Raman and J.M. Hellerstein, “Potter's Wheel: An Interactive Data Cleaning System,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 381-390, 2001.
    [26]
    M. Perkowitz, R.B. Doorenbos, O. Etzioni, and D.S. Weld, “Learning to Understand Information on the Internet: An Example-Based Approach,” J. Intelligent Information Systems, vol. 8, no. 2, pp. 133-153, Mar. 1997.
    [27]
    T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, “Mining Database Structure; or, How to Build a Data Quality Browser,” Proc. 2002 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '02), pp. 240-251, 2002.
    [28]
    V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Doklady Akademii Nauk SSSR, vol. 163, no. 4, pp. 845-848, 1965, original in Russian—translation in Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.
    [29]
    G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001.
    [30]
    G.M. Landau and U. Vishkin, “Fast Parallel and Serial Approximate String Matching,” J. Algorithms, vol. 10, no. 2, pp. 157-169, June 1989.
    [31]
    S.B. Needleman and C.D. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins,” J. Molecular Biology, vol. 48, no. 3, pp. 443-453, Mar. 1970.
    [32]
    E.S. Ristad and P.N. Yianilos, “Learning String Edit Distance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 5, pp. 522-532, May 1998.
    [33]
    M.S. Waterman, T.F. Smith, and W.A. Beyer, “Some Biological Sequence Metrics,” Advances in Math., vol. 20, no. 4, pp. 367-387, 1976.
    [34]
    T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.
    [35]
    S.F. Altschula, W. Gisha, W. Millerb, E.W. Meyersc, and D.J. Lipmana, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, no. 3, pp. 403-410, Oct. 1990.
    [36]
    R. Baeza-Yates and G.H. Gonnet, “A New Approach to Text Searching,” Comm. ACM, vol. 35, no. 10, pp. 74-82, Oct. 1992.
    [37]
    S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Comm. ACM, vol. 35, no. 10, pp. 83-91, Oct. 1992.
    [38]
    J.C. Pinheiro and D.X. Sun, “Methods for Linking and Mining Heterogeneous Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 309-313, 1998.
    [39]
    M.A. Jaro, “Unimatch: A Record Linkage System: User's Manual,” technical report, US Bureau of the Census, Washington, D.C., 1976.
    [40]
    W.E. Winkler and Y. Thibaudeau, “An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 US Decennial Census,” Technical Report Statistical Research Report Series RR91/09, US Bureau of the Census, Washington, D.C., 1991.
    [41]
    J.R. Ullmann, “A Binary $n{\hbox{-}}{\rm{Gram}}$ Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words,” The Computer J., vol. 20, no. 2, pp. 141-147, 1977.
    [42]
    E. Ukkonen, “Approximate String Matching with $q{\hbox{-}}{\rm{Grams}}$ and Maximal Matches,” Theoretical Computer Science, vol. 92, no. 1, pp.191-211, 1992.
    [43]
    K. Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, Dec. 1992.
    [44]
    E. Sutinen and J. Tarhio, “On Using $q{\hbox{-}}{\rm{Gram}}$ Locations in Approximate String Matching,” Proc. Third Ann. European Symp. Algorithms (ESA '95), pp. 327-340, 1995.
    [45]
    L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate String Joins in a Database (Almost) for Free,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 491-500, 2001.
    [46]
    L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava, “Using $q{\hbox{-}}{\rm{Grams}}$ in a DBMS for Approximate String Processing,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 28-34, Dec. 2001.
    [47]
    A.E. Monge and C.P. Elkan, “The Field Matching Problem: Algorithms and Applications,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 267-270, 1996.
    [48]
    W.W. Cohen, “Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 201-212, 1998.
    [49]
    L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava, “Text Joins in an RDBMS for Web Data Integration,” Proc. 12th Int'l World Wide Web Conf. (WWW12), pp. 90-101, 2003.
    [50]
    R.C. Russell Index, U.S. Patent 1,261,167,
    [51]
    R.C. Russell Index, U.S. Patent 1,435,663,
    [52]
    R.L. Taft, “Name Search Techniques,” Technical Report Special Report No. 1, New York State Identification and Intelligence System, Albany, N.Y., Feb. 1970.
    [53]
    L.E. Gill, “OX-LINK: The Oxford Medical Record Linkage System,” Proc. Int'l Record Linkage Workshop and Exposition, pp.15-33, 1997.
    [54]
    L. Philips, “Hanging on the Metaphone,” Computer Language Magazine, vol. 7, no. 12, pp. 39-44, Dec. 1990,
    [55]
    L. Philips, “The Double Metaphone Search Algorithm,” C/C++ Users J., vol. 18, no. 5, June 2000.
    [56]
    N. Koudas, A. Marathe, and D. Srivastava, “Flexible String Matching against Large Databases in Practice,” Proc. 30th Int'l Conf. Very Large Databases (VLDB '04), pp. 1078-1086, 2004.
    [57]
    R. Agrawal and R. Srikant, “Searching with Numbers,” Proc. 11th Int'l World Wide Web Conf. (WWW11), pp. 420-431, 2002.
    [58]
    W.E. Yancey, “Evaluating String Comparator Performance for Record Linkage,” Technical Report Statistical Research Report Series RRS2005/05, US Bureau of the Census, Washington, D.C., June 2005.
    [59]
    S. Tejada, C.A. Knoblock, and S. Minton, “Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
    [60]
    T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning. Springer Verlag, Aug. 2001.
    [61]
    M.A. Jaro, “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” J. Am. Statistical Assoc., vol. 84, no. 406, pp. 414-420, June 1989.
    [62]
    A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. B, no. 39, pp. 1-38, 1977.
    [63]
    W.E. Winkler, “Improved Decision Rules in the Felligi-Sunter Model of Record Linkage,” Technical Report Statistical Research Report Series RR93/12, US Bureau of the Census, Washington, D.C., 1993.
    [64]
    W.E. Winkler, “Methods for Record Linkage and Bayesian Networks,” Technical Report Statistical Research Report Series RRS2002/05, US Bureau of the Census, Washington, D.C., 2002.
    [65]
    K. Nigam, A. McCallum, S. Thrun, and T.M. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, nos. 2/3, pp. 103-134, 2000.
    [66]
    N.S.D. Du Bois Jr., “A Solution to the Problem of Linking Multivariate Documents,” J. Am. Statistical Assoc., vol. 64, no. 325, pp. 163-174, Mar. 1969.
    [67]
    R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. Wiley, 1973.
    [68]
    V.S. Verykios, G.V. Moustakides, and M.G. Elfeky, “A Bayesian Decision Model for Cost Optimal Record Matching,” VLDB J., vol. 12, no. 1, pp. 28-40, May 2003.
    [69]
    V.S. Verykios and G.V. Moustakides, “A Generalized Cost Optimal Decision Model for Record Matching,” Proc. 2004 Int'l Workshop Information Quality in Information Systems, pp. 20-26, 2004.
    [70]
    M. Cochinwala, V. Kurien, G. Lalk, and D. Shasha, “Efficient Data Reconciliation,” Information Sciences, vol. 137, nos. 1-4, pp. 1-15, Sept. 2001.
    [71]
    L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. CRC Press, July 1984.
    [72]
    T. Joachims, “Making Large-Scale SVM Learning Practical,” Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C.J.C. Burges, and A.J. Smola, eds., MIT Press, 1999.
    [73]
    A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD '97), pp. 23-29, 1997.
    [74]
    N. Bansal, A. Blum, and S. Chawla, “Correlation Clustering,” Machine Learning, vol. 56, nos. 1-3, pp. 89-113, 2004.
    [75]
    W.W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
    [76]
    A. McCallum and B. Wellner, “Conditional Models of Identity Uncertainty with Application to Noun Coreference,” Advances in Neural Information Processing Systems (NIPS '04), 2004.
    [77]
    P. Singla and P. Domingos, “Multi-Relational Record Linkage,” Proc. KDD-2004 Workshop Multi-Relational Data Mining, pp. 31-48, 2004.
    [78]
    H. Pasula, B. Marthi, B. Milch, S.J. Russell, and I. Shpitser, “Identity Uncertainty and Citation Matching,” Advances in Neural Information Processing Systems (NIPS '02), pp. 1401-1408, 2002.
    [79]
    D.A. Cohn, L. Atlas, and R.E. Ladner, “Improving Generalization with Active Learning,” Machine Learning, vol. 15, no. 2, pp. 201-221, 1994.
    [80]
    S. Tejada, C.A. Knoblock, and S. Minton, “Learning Object Identification Rules for Information Integration,” Information Systems, vol. 26, no. 8, pp. 607-633, 2001.
    [81]
    W.W. Cohen, “Data Integration Using Similarity Joins and a Word-Based Information Representation Language,” ACM Trans. Information Systems, vol. 18, no. 3, pp. 288-321, 2000.
    [82]
    D. Dey, S. Sarkar, and P. De, “Entity Matching in Heterogeneous Databases: A Distance Based Decision Model,” Proc. 31st Ann. Hawaii Int'l Conf. System Sciences (HICSS '98), pp. 305-313, 1998.
    [83]
    S. Guha, N. Koudas, A. Marathe, and D. Srivastava, “Merging the Results of Approximate Match Operations,” Proc. 30th Int'l Conf. Very Large Databases (VLDB '04), pp. 636-647, 2004.
    [84]
    R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications, first ed. Prentice Hall, Feb. 1993.
    [85]
    R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. 28th Int'l Conf. Very Large Databases (VLDB '02), 2002.
    [86]
    S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identification of Fuzzy Duplicates,” Proc. 21st IEEE Int'l Conf. Data Eng. (ICDE '05), pp. 865-876, 2005.
    [87]
    E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson, “Entity Identification in Database Integration,” Proc. Ninth IEEE Int'l Conf. Data Eng. (ICDE '93), pp. 294-301, 1993.
    [88]
    H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, “Declarative Data Cleaning: Language, Model, and Algorithms,” Proc. 27th Int'l Conf. Very Large Databases (VLDB '01), pp. 371-380, 2001.
    [89]
    V.S. Verykios, A.K. Elmagarmid, and E.N. Houstis, “Automating the Approximate Record Matching Process,” Information Sciences, vol. 126, nos. 1-4, pp. 83-98, July 2000.
    [90]
    A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” COLT '98: Proc. 11th Ann. Conf. Computational Learning Theory, pp. 92-100, 1998.
    [91]
    P. Cheeseman and J. Sturz, “Bayesian Classification (Autoclass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, pp. 153-180, AAAI Press/The MIT Press, 1996.
    [92]
    M.G. Elfeky, A.K. Elmagarmid, and V.S. Verykios, “TAILOR: A Record Linkage Tool Box,” Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE '02), pp. 17-28, 2002.
    [93]
    P. Ravikumar and W.W. Cohen, “A Hierarchical Graphical Model for Record Linkage,” 20th Conf. Uncertainty in Artificial Intelligence (UAI '04), 2004.
    [94]
    I. Bhattacharya and L. Getoor, “Latent Dirichlet Allocation Model for Entity Resolution,” Technical Report CS-TR-4740, Computer Science Dept., Univ. of Maryland, Aug. 2005.
    [95]
    A. McCallum, K. Nigam, and L.H. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 169-178, 2000.
    [96]
    S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 313-324, 2003.
    [97]
    R. Baxter, P. Christen, and T. Churches, “A Comparison of Fast Blocking Methods for Record Linkage,” Proc. ACM SIGKDD '03 Workshop Data Cleaning, Record Linkage, and Object Consolidation, pp. 25-27, 2003.
    [98]
    A. Soffer, D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, and Y.S. Maarek, “Static Index Pruning for Information Retrieval Systems,” Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, (SIGIR '01), pp. 43-50, 2001.
    [99]
    N. Mamoulis, “Efficient Processing of Joins on Set-Valued Attributes,” Proc. 2003 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 157-168, 2003.
    [100]
    J. Zobel, A. Moffat, and K. Ramamohanarao, “Inverted Files versus Signature Files for Text Indexing,” ACM Trans. Database Systems, vol. 23, no. 4, pp. 453-490, Dec. 1998.
    [101]
    S. Sarawagi and A. Kirpal, “Efficient Set Joins on Similarity Predicates,” Proc. 2004 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 743-754, 2004.
    [102]
    D. Koller and M. Sahami, “Hierarchically Classifying Documents Using Very Few Words,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 170-178, 1997.
    [103]
    W.E. Yancey, “Bigmatch: A Program for Extracting Probable Matches from a Large File for Record Linkage,” Technical Report Statistical Research Report Series RRC2002/01, US Bureau of the Census, Washington, D.C., Mar. 2002.
    [104]
    W.E. Winkler, “Overview of Record Linkage and Current Research Directions,” Technical Report Statistical Research Report Series RRS2006/02, US Bureau of the Census, Washington, D.C., 2006.
    [105]
    IEEE Data Eng. Bull., N. Koudas, ed., special issue on data quality, vol. 29, no. 2, June 2006.
    [106]
    W.E. Winkler, “The State of Record Linkage and Current Research Problems,” Technical Report Statistical Research Report Series RR99/04, US Bureau of the Census, Washington, D.C., 1999.

    Cited By

    View all
    • (2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
    • (2024)Exploring Data Preparation Modules by ExamplesIntelligent Information and Database Systems10.1007/978-981-97-4982-9_5(52-69)Online publication date: 15-Apr-2024
    • (2023)A Randomized Blocking Structure for Streaming Record LinkageProceedings of the VLDB Endowment10.14778/3611479.361148716:11(2783-2791)Online publication date: 24-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Knowledge and Data Engineering
    IEEE Transactions on Knowledge and Data Engineering  Volume 19, Issue 1
    January 2007
    144 pages

    Publisher

    IEEE Educational Activities Department

    United States

    Publication History

    Published: 01 January 2007

    Author Tags

    1. Duplicate detection
    2. data cleaning
    3. data deduplication
    4. data integration
    5. database hardening
    6. entity matching.
    7. entity resolution
    8. fuzzy duplicate detection
    9. identity uncertainty
    10. instance identification
    11. name matching
    12. record linkage

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Lingual Learning Strategies for Improving Product Matching QualityProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636001(313-320)Online publication date: 8-Apr-2024
    • (2024)Exploring Data Preparation Modules by ExamplesIntelligent Information and Database Systems10.1007/978-981-97-4982-9_5(52-69)Online publication date: 15-Apr-2024
    • (2023)A Randomized Blocking Structure for Streaming Record LinkageProceedings of the VLDB Endowment10.14778/3611479.361148716:11(2783-2791)Online publication date: 24-Aug-2023
    • (2023)Community and Facility Health Information System Integration in Malawi: A Comparison of Machine Learning and Probabilistic Record Linkage MethodsACM Journal on Computing and Sustainable Societies10.1145/36247731:2(1-16)Online publication date: 12-Oct-2023
    • (2023)Deep Active Alignment of Knowledge Graph Entities and SchemataProceedings of the ACM on Management of Data10.1145/35893041:2(1-26)Online publication date: 20-Jun-2023
    • (2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
    • (2023)On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication PipelineDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_11(164-178)Online publication date: 28-Aug-2023
    • (2022)Narrowing the LOCAL-CONGEST Gaps in Sparse Networks via Expander DecompositionsProceedings of the 2022 ACM Symposium on Principles of Distributed Computing10.1145/3519270.3538423(301-312)Online publication date: 20-Jul-2022
    • (2022)Probing the Robustness of Pre-trained Language Models for Entity MatchingProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557673(3786-3790)Online publication date: 17-Oct-2022
    • (2022)DAMEProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498486(1016-1024)Online publication date: 11-Feb-2022
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media