An incremental clustering scheme for data de-duplication

Costa, Gianni; Manco, Giuseppe; Ortale, Riccardo

doi:10.1007/s10618-009-0155-0

An incremental clustering scheme for data de-duplication

Published: 28 October 2009

Volume 20, pages 152–187, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Gianni Costa¹,
Giuseppe Manco¹ &
Riccardo Ortale¹

393 Accesses
25 Citations
3 Altmetric
Explore all metrics

Abstract

We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entity Recognition for Duplicate Filtering

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

Efficient Search of Cosine and Tanimoto Near Duplicates among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 20–29
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the international conference on very large databases, pp 586–597
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases, pp 918–929
Bawa M, Tyson S, Condie, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the international conference on world wide web, pp 651–660
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the international conference on world wide web, pp 131–140
Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, pp 11–18
Bilenko M, Mooney RJ (2003a) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48
Bilenko M, Mooney RJ (2003b) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD workshop on data cleaning, record linkage, and object consolidation, pp 7–12
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering on the Web. In: Proceedings of the international conference on World Wide Web, pp 1157–1166
Broder A, Charikar M, Frieze AM, Mitzenmacher M (1998) Minwise independent permutations. In: Proceedings of the ACM symposium on theory of computing, pp 327–336
Cesario E, Folino F, Manco G, Pontieri L (2005) An incremental clustering scheme for duplicate detection in large databases. In: Proceedings of the international conference databases and applications symposium, pp 89–95
Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. J Knowl Inf Syst 15(3): 285–320
Article Google Scholar
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD conference on management of data, pp 313–324
Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering, pp 865–876
Chavez E, Navarro G, Baeza-Yates R, Luis Marroquin J (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321
Article Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the international conference on very large databases, pp 426–435
Cochinwala M, Dalal S, Elmagarmid AK, Verykios VS (2005) Record matching: past, present and future
Cohen W, Richman J (2001) Learning to match and cluster entity names. In: Proceedings of the ACM SIGIR workshop on mathematical/formal methods in information retrieval, pp 13–18
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 475–480
Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI workshop on information integration on the web, pp 73–78
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the international conference on knowledge discovery and data mining, pp 226–231
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64: 1183–1210
Article Google Scholar
Ganti V et al (1999) Clustering large datasets in arbitrary metric spaces. In: Proceedings of the international conference on data engineering, pp 502–511
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large databases, pp 518–529
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (Almost) for free. In: Proceedings of the international conference on very large databases, pp 491–500
Gu L, Baxter RA, Vickers D, Rainsford C (2003) Record linkage: current practice and future directions. Technical Report, number 03/83. CSIRO Mathematical and Information Sciences
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 73–84
Guha S, Rastogi R, Shim K (2001) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5): 345–366
Article Google Scholar
Gunsfield D (1997) Algorithms on strings, trees and sequences. Cambridge University Press, Cambridge
Google Scholar
Hernández MA, Stolfo SJ (1995) The Merge/Purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 127–138
Hjatason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4): 517–518
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbor-towards removing the curse of dimensionality. In: Proceedings of symposium on theory of computing, pp 604–613
Ipeirotis PG, Verykios VS, Elmagarmid AK (2007) Duplicate record detection: a review. IEEE Trans Knowl Data Eng 18(1): 1–16
Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Article Google Scholar
Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain independent data cleaning. In: Proceedings of the SIAM conference on data mining, pp 262–273
McCallum AK, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 169–178
Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Proceedings of the international conference on knowledge discovery and data mining, pp 267–270
Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, pp 23–29
Monge AE, Elkan CP (2001) Automatic segmentation of text into structured records. In: Proceedings of the ACM SIGMOD conference on management of data
Neiling M, Jurk S (2003) The object identification framework. In: Proceedings of the KDD workshop on data cleaning, record linkage, and object consolidation, pp 37–39
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278
Sarawagi S, Kirpal A (2004) Efficient exact set-similarity joins. In: Proceedings of the SIGMOD international conference on management of data, pp 743–754
Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 350–359
Ukkonen E (1982) Approximate string matching using q-grams and maximal matches. Theor Comput Sci 92(1): 191–211
Article MathSciNet Google Scholar
Weber R, Schek HJ, Blott S (1998) A quantitative analsysis and performance study for similarity search in high-dimensional spaces. In: Proceedings of the international conference on very large databases, pp 194–205
Winkler WE (1990) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the section on survey research methods, American Statistical Association, pp 354–359
Winkler WE (1999) The state of record linkage and current research problems. Technical Report. Statistical Research Division, US Census Bureau

Download references

Author information

Authors and Affiliations

ICAR-CNR, Via P. Bucci 41c, 87036, Rende, CS, Italy
Gianni Costa, Giuseppe Manco & Riccardo Ortale

Authors

Gianni Costa
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Manco
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Ortale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giuseppe Manco.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Costa, G., Manco, G. & Ortale, R. An incremental clustering scheme for data de-duplication. Data Min Knowl Disc 20, 152–187 (2010). https://doi.org/10.1007/s10618-009-0155-0

Download citation

Received: 10 November 2007
Accepted: 05 October 2009
Published: 28 October 2009
Issue Date: January 2010
DOI: https://doi.org/10.1007/s10618-009-0155-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An incremental clustering scheme for data de-duplication

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Entity Recognition for Duplicate Filtering

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

Efficient Search of Cosine and Tanimoto Near Duplicates among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An incremental clustering scheme for data de-duplication

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Entity Recognition for Duplicate Filtering

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

Efficient Search of Cosine and Tanimoto Near Duplicates among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation