Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

Abstract

The wide exploitation of new techniques and systems for generating, collecting and storing data has made available growing volumes of information. Large quantities of such information are stored as free texts. The lack of explicit structure in free text is a major issue in the categorization of such kind of data for more effective and efficient information retrieval, search and filtering. The abundance of structured data is problematic too. Several databases are available, that contain data of the same type. Unfortunately, they often conform to different schemas, which avoids the unified management of even structured information. The Entity Resolution process plays a fundamental role in the context of information integration and management, aimed to infer a uniform and common structure from various large-scale data collections, with which to suitably organize, match and consolidate the information of the individual repositories into one data set. De-duplication is a key step of the Entity Resolution process, whose goal is discovering duplicates within the integrated data, i.e., different tuples that, as a matter of facts, refer to the same real-world entity. This attenuates the redundancy of the integrated data and, also, enables more effective information handling and knowledge extraction through a unified access to reconciled and de-duplicated data. Duplicate detection is an active research area that benefits from contributions from diverse research fields, such as, machine learning, data mining and knowledge discovery, databases as well as information retrieval and extraction. This chapter presents an overview of research on data de-duplication, with the goal of providing a general understanding and useful references to fundamental concepts concerning the recognition of similarities in very large data collections. For this purpose, a variety of state-of-the-art approaches to de-duplication is reviewed. The discussion of the state-of-the-art conforms to a taxonomy that, at the highest level, divides the existing approaches into two broad classes, i.e., unsupervised and supervised approaches. Both classes are further divided into sub-classes according to the common peculiarities of the involved approaches. The strengths and weaknesses of each group of approaches are presented. Meaningful research developments to further advance the current state-of-the-art are covered as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agichtein, E., Ganti, V.: Mining Reference Tables for Automatic Text Segmentation. In: Proc. of ACM SIGKDD Int. Conf. On Knowledge Discovery and Data Mining, Seattle, Washington, USA, pp. 20–29 (2004)

    Google Scholar 

  2. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proc. of Int. Conf. on Very Large Databases, Hong Kong, China, pp. 586–597 (2002)

    Google Scholar 

  3. Andoni, A., Indyk, P.: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In: Proc. of IEEE Symposium on Foundations of Computer Science, Las Vegas, Nevada, USA, pp. 459–468 (2006)

    Google Scholar 

  4. Andoni, A., Indyk, P.: Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  5. Arasu, A., Ganti, V., Kaushik, R.: Efficient Exact Set-Similarity Joins. In: Proc. of Int. Conf. on Very Large Databases, Seoul, Korea, pp. 918–929 (2006)

    Google Scholar 

  6. Axford, S.J., Newcombe, H.B., Kennedy, J.M., James, A.P.: Automatic Linkage of Vital Records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  7. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  8. Bansal, N., Blum, A., Chawla, S.: Correlation Clustering. Machine Learning 56(1-3), 89–113 (2004)

    Article  MATH  Google Scholar 

  9. Bawa, M., Tyson Condie, S., Ganesan, P.: LSH Forest: Self-Tuning Indexes for Similarity Search. In: Proc. of Int. Conf. on World Wide Web, Chiba, Japan, pp. 651–660 (2005)

    Google Scholar 

  10. Bayardo, R.J., Srikant, R., Ma, Y.: Scaling Up All Pairs Similarity Search. In: Proc. of Int. Conf. on World Wide Web, Banff, Alberta, Canada, pp. 131–140 (2007)

    Google Scholar 

  11. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB Journal 18(1), 255–276 (2009)

    Article  Google Scholar 

  12. Berson, T.A.: Differential Cryptanalysis Mod 232 with Applications to MD5. In: Proc. of Ann. Conf. on Theory and Applications of Cryptographic Techniques, pp. 71–80 (1992)

    Google Scholar 

  13. Bhattacharya, I., Getoor, L.: Collective Entity Resolution in Relational Data. ACM Trans. Knowl. Discovery from Data 1(1), 1–35 (2007)

    Article  Google Scholar 

  14. Bhattacharya, I., Getoor, L., Licamele, Louis: QueryTime Entity Resolution. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Philadelphia, Pennsylvania, USA, pp. 529–534 (2006)

    Google Scholar 

  15. Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Washington, DC, USA, pp. 39–48 (2003)

    Google Scholar 

  16. Christen, P.: Towards Parameter-free Blocking for Scalable Record Linkage. Tech. Rep. TR-CS-07-03, Australian National University, Canberra, Australia (2007)

    Google Scholar 

  17. Christen, P.: Febrl - An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. In: Proc. of ACM Int. Conf. on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)

    Google Scholar 

  18. Borkar, V.R., Deshmukh, K., Sarawagi, S.: Automatic Segmentation of Text into Structured Records. In: Proc. of ACM SIGMOD Int. Conf. on Management of Data, Santa Barbara, California, USA, pp. 175–186 (2001)

    Google Scholar 

  19. Broder, A., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Minwise Independent Permutations. In: Proc. of ACM Symposium on Theory of Computing, Dallas, Texas, USA, pp. 327–336 (1998)

    Google Scholar 

  20. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering on the Web. In: Proc. of Int. Conf. on World Wide Web, Santa Clara, California, USA, pp. 1157–1166 (1997)

    Google Scholar 

  21. Cesario, E., Folino, F., Locane, A., Manco, G., Ortale, R.: Boosting Text Segmentation Via Progressive Classification. Knowl. and Inf. Syst. 15(3), 285–320 (2008)

    Article  Google Scholar 

  22. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: Proc. of ACM SIGMOD Conf. on Management of Data, San Diego, California, USA, pp. 313–324 (2003)

    Google Scholar 

  23. Chaudhuri, S., Ganti, V., Motwani, R.: Robust Identification of Fuzzy Duplicates. In: Proc. of Int. Conf. on Data Engineering, Tokyo, Japan, pp. 865–876 (2005)

    Google Scholar 

  24. Chavez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in Metric Spaces. ACM Comput. Surv. 33(3), 273–321 (2001)

    Article  Google Scholar 

  25. Ciaccia, P., Patella, M., Zezula, P.: M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: Proc. of Int. Conf. on Very Large Databases, Athens, Greece, pp. 426–435 (1997)

    Google Scholar 

  26. Cochinwala, M., Dalal, S., Elmagarmid, A.K., Verykios, V.S.: Record Matching: Past, Present and Future. Technical Report, number CSD-TR #01-013. Department of Computer Sciences, Purdue University (2001)

    Google Scholar 

  27. Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient Data Reconciliation. Information Sciences 137(1-4), 1–15 (2001)

    Article  MATH  Google Scholar 

  28. Cohen, W.W.: Data Integration using Similarity Joins and a Word-based Information Representation Language. ACM Trans. on Inf. Syst. 18(3), 228–321 (2000)

    Google Scholar 

  29. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proc. of IJCAI Workshop on Information Integration on the Web, Acapulco, Mexico, pp. 73–78 (2003)

    Google Scholar 

  30. Cohen, W.W., Richman, J.: Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 475–480 (2002)

    Google Scholar 

  31. Cohn, D.A., Atlas, L., Ladner, R.E.: Improving Generalization with Active Learning. Machine Learning 15(2), 201–221 (1994)

    Google Scholar 

  32. Costa, G., Manco, G., Ortale, R.: An Incremental Clustering Scheme for Data De-duplication. Data Min. and Knowl. Discovery 20(1), 152–187 (2010)

    Article  MathSciNet  Google Scholar 

  33. Database Group Leipzig. Benchmark datasets for entity resolution, http://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution

  34. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–28 (2001)

    MathSciNet  Google Scholar 

  35. Dittrich, J.-P., Seeger, B.: Data Redundancy and Duplicate Detection in Spatial Join Processing. In: Proc. of IEEE Int. Conf. on Data Engineering, pp. 535–546 (2000)

    Google Scholar 

  36. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE Transanctions on Knowledge and Data Engineering 19(1), 1–16 (2007)

    Article  Google Scholar 

  37. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proc. of Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, USA, pp. 226–231 (1996)

    Google Scholar 

  38. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Widener, T.: The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM 39(11), 27–34 (1996)

    Article  Google Scholar 

  39. Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Am. Stat. Assoc. 64, 1183–1210 (1969)

    Article  Google Scholar 

  40. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.: Clustering Large Datasets in Arbitrary Metric Spaces. In: Proc. of Int. Conf. on Data Engineering, Sydney, Austrialia, pp. 502–511 (1999)

    Google Scholar 

  41. Garcia-Molina, H.: Entity resolution: Overview and challenges. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 1–2. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  42. Gilks, W.R., Richardson, S., Spiegelhalter, D.J.: Markov Chain Monte Carlo in Practice. Chapman and Hall, Boca Raton (1996)

    MATH  Google Scholar 

  43. Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via Hashing. In: Proc. of Int. Conf. on Very Large Databases, Edinburgh, Scotland, pp. 518–529 (1999)

    Google Scholar 

  44. Goiser, K., Christen, P.: Towards Automated Record Linkage. In: Proc. of Australasian Data Mining Conf., pp. 23–31 (2006)

    Google Scholar 

  45. Grabmeier, J., Rudolph, A.: Techniques of Cluster Algorithms in Data Mining. Data Min. and Knowl. Discovery 6(4), 303–360 (2002)

    Article  MathSciNet  Google Scholar 

  46. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (Almost) for Free. In: Proc of Int. Conf. on Very Large Databases, Rome, Italy, pp. 491–500 (2001)

    Google Scholar 

  47. Gu, L., Baxter, R.A., Vickers, D., Rainsford, C.: Record Linkage: Current Practice and Future Directions. Technical Report, number 03/83. CSIRO Mathematical and Information Sciences (2001)

    Google Scholar 

  48. Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. In: Proc. of ACM SIGMOD Int. Conf. on Management of Data, Seattle, Washington, USA, pp. 73–84 (1998)

    Google Scholar 

  49. Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Inf. Syst. 25(5), 345–366 (2001)

    Article  Google Scholar 

  50. Gunsfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Davis (1997)

    Book  Google Scholar 

  51. Hassanzadeh, O., Chiang, F., Lee, H.C., Miller, R.J.: Framework for Evaluating Clustering Algorithms in Duplicate Detection. Proceedings of VLDB 2(1), 1282–1293 (2009)

    Google Scholar 

  52. Hassanzadeh, O., Miller, R.J.: Creating Probabilistic Databases from Duplicated Data. The VLDB Journal 18(5), 1141–1166 (2009)

    Article  Google Scholar 

  53. Hernández, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: Proc. of ACM SIGMOD Int. Conf. on Management of Data, San Jose, California, USA, pp. 127–138 (1995)

    Google Scholar 

  54. Hernández, M.A., Stolfo, J.: Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem. Data Min. and Knowl. Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  55. Herschel, M., Naumann, N.: Scaling up Duplicate Detection in Graph Data. In: Proc. of ACM Int. Conf. on Information and Knowledge Management, pp. 1325–1326 (2008)

    Google Scholar 

  56. Hjatason, G.R., Samet, H.: Index-Driven Similarity Search in Metric Spaces. ACM Trans. on Database Syst. 28(4), 517–518 (2003)

    Article  Google Scholar 

  57. Indyk, P., Motwani, R.: Approximate Nearest Neighbor - Towards Removing the Curse of Dimensionality. In: Proc. of Symposium on Theory of Computing, Dallas, Texas, USA, pp. 604–613 (1998)

    Google Scholar 

  58. Ipeirotis, P.G., Verykios, V.S., Elmagarmid, A.K.: Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  59. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1998)

    Google Scholar 

  60. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  61. Jaro, M.A.: Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society 84, 420–424 (1989)

    Google Scholar 

  62. Kingsbury, N.R., et al.: Record Linkage and Privacy: Issues in Creating New Federal Research and Statistical Information. U.S. General Accounting Office (2001)

    Google Scholar 

  63. Kopcke, H., Rahm, E.: Frameworks for Entity Matching: A Comparison Data and Know. Engineering 69(2), 197–210 (2010)

    Google Scholar 

  64. Kopcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. of the VLDB Endowment 3(1), 484–493 (2010)

    Google Scholar 

  65. Kopcke, H., Thor, A., Rahm, E.: Evaluation of Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing 14(4), 23–31 (2010)

    Article  Google Scholar 

  66. McCallum, A.: MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu

  67. Koudas, N., Sevcik, K.C.: Size Separation Spatial Join. In: Proc. of ACM Int. Conf. on Management of Data, pp. 324–335 (1997)

    Google Scholar 

  68. Lawrence, S., Bollacker, K., Giles, C.L.: Autonomous Citation Matching. In: Proc. of ACM Int. Conf. on Autonomous Agents, pp. 392–393 (1999)

    Google Scholar 

  69. Low, W.L., Lee, M.L., Ling, T.W.: A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26(8), 585–606 (2001)

    Article  MATH  Google Scholar 

  70. Lwin, T., Nyunt, T.T.S.: An Efficient Duplicate Detection System for XML Documents. In: Proc. of IEEE Int. Conf. on Computer Engineering and Applications, pp. 178–182 (2010)

    Google Scholar 

  71. McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proc. of Int. Conf. on Machine Learning, Standord, California, USA, pp. 591–598 (2000)

    Google Scholar 

  72. McCallum, A., Nigam, K., Ungar, L.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA, pp. 169–178 (2000)

    Google Scholar 

  73. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic Entity Resolution with Data Confidences. In: Int. VLDB Workshop on Clean Databases, Seoul, Korea (2006)

    Google Scholar 

  74. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  75. Monge, A.E., Elkan, C.P.: An Efficient Domain-Independent Algorithm For Detecting Approximately Duplicate Database Records. In: Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona, USA, pp. 23–29 (1997)

    Google Scholar 

  76. Monge, A.E., Elkan, C.P.: The Field Matching Problem: Algorithms and Applications. In: Proc. of Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, USA, pp. 267–270 (1996)

    Google Scholar 

  77. Mukherjee, S., Ramakrishnan, I.V.: Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences. In: Proc. of CoopIS/DOA/ODBASE Int. Conf., Agia Napa, Cyprus, pp. 909–926 (2004)

    Google Scholar 

  78. Muse, A.G., Mikl, J., Smith, P.F.: Evaluating the quality of anonymous record linkage using deterministic procedures with the New York State AIDS registry and a hospital discharge file. Statistics in Medicine 14, 499–509 (1995)

    Article  Google Scholar 

  79. Neiling, M., Jurk, S.: The Object Identification Framework. In: Proc. KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, USA, pp. 37–39 (2003)

    Google Scholar 

  80. Neutel, C.I.: Privacy Issues in Research Using Record Linkage. Pharmcoepidemiology and Drug Safety 6, 367–369 (1997)

    Article  Google Scholar 

  81. Newcombe, H.B.: Record Linking: The Design of Efficient Systems for Linking Records into Individual and Family Histories. American Journal of Human Genetics 19, 335–359 (1967)

    Google Scholar 

  82. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic Linkage of Vital Records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  83. Patel, J., DeWitt, D.J.: Partition Based Spatial-Merge Join. In: Proc. of ACM Int. Conf. on Management of Data, pp. 259–270 (1996)

    Google Scholar 

  84. Pasula, H., Marthi, B., Milch, B., Russell, S.J., Shpitser, I.: Identity Uncertainty and Citation Matching. In: Proc. of Ann. Conf. on Neural Information Processing Systems, pp. 1401–1408 (2002)

    Google Scholar 

  85. Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication using Active Learning. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 269–278 (2002)

    Google Scholar 

  86. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proc. of SIGMOD Int. Conf. on Management of Data, Paris, France, pp. 743–754 (2004)

    Google Scholar 

  87. Shen, H., Zhang, Y.: Improved Approximate Detection of Duplicates for Data Streams over Sliding Windows. Journal of Computer Science and Technology 23(6), 973–987 (2008)

    Article  Google Scholar 

  88. Singla, P., Domingos, P.: Multi-Relational Record Linkage. In: Proc. of ACM Int. Ws. on Multi-Relational Data Mining, pp. 31–38 (2004)

    Google Scholar 

  89. Smith, S., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)

    Article  Google Scholar 

  90. Statistical Linkage Key Working Group. Statistical Data Linkage in Community Services Data Collections (2002)

    Google Scholar 

  91. Tejada, S., Knoblock, C.A., Minton, S.: Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. In: Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 350–359 (2002)

    Google Scholar 

  92. Tepping, J.B.: A Model for Optimum Linkage of Records. Journal of the American Statistical Association 63, 1321–1332 (1968)

    Article  Google Scholar 

  93. Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-matching process. Inf. Sci. 126(1-4), 83–98 (2000)

    Article  MATH  Google Scholar 

  94. Weber, R., Schek, H.J., Blott, S.: A Quantitative Analsysis and Performance Study for Similarity Search in High-Dimensional Spaces. In: Proc. of Int. Conf. on Very Large Databases, New York City, USA, pp. 194–205 (1998)

    Google Scholar 

  95. Weis, M., Naumann, N.: Detecting Duplicates in Complex XML Data. In: Proc. of IEEE Int. Conf. on Data Engineering, p. 109 (2006)

    Google Scholar 

  96. Weis, M., Naumann, N.: Space and Time Scalability of Duplicate Detection in Graph Data. Tech. Rep. 25, Hasso-Plattner Institut, Potsdam, Germany (2007)

    Google Scholar 

  97. Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proc. Section on Survey Research Methods, American Statistical Association, pp. 354–359 (1990)

    Google Scholar 

  98. Winkler, W.E.: Overview of Record Linkage and Current Research Directions. Technical Report. Statistical Research Division, U.S. Census Bureau (1999)

    Google Scholar 

  99. Winkler, W.E.: Methods for Record Linkage and Bayesian Networks. Tech. Rep. RRS2002/05, U.S. Bureau of the Census, Washington, D.C., USA (2002)

    Google Scholar 

  100. Zhang, Y., Lin, X., Yuan, Y., Kitsuregawa, M., Zhou, X., Yu, J.X.: Duplicate-insensitive Order Statistics Computation over Data Streams. IEEE Transanctions on Knowledge and Data Engineering 22(4), 493–507 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Costa, G., Cuzzocrea, A., Manco, G., Ortale, R. (2011). Data De-duplication: A Review. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22913-8_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22912-1

  • Online ISBN: 978-3-642-22913-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics