Article

Eliminating fuzzy duplicates in data warehouses

Authors:

Rohit Ananthakrishna,

Surajit Chaudhuri,

Venkatesh GantiAuthors Info & Claims

VLDB '02: Proceedings of the 28th international conference on Very Large Data Bases

Pages 586 - 597

Published: 20 August 2002 Publication History

Abstract

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

References

[1]

{AEP01} A. N. Arslan, O. Egecioglu, and P. A. Pevzner. A new approach to sequence comparison: Normalized local alignment. Bioinformatics, 17(4): 327-337, 2001.

[2]

{BDS01} Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In Proceedings of ACM Sigmod Conference, Santa Barbara, CA, May 2001.

Digital Library

[3]

{BGM+97} A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Proc. Sixth Int'l. World Wide Web Conference, World Wide Web Consortium, Cambridge, pages 391-404, 1997.

Digital Library

[4]

{BGRS99} K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? International Conference on Database Theory, pages 217-235. January 1999.

Digital Library

[5]

{BL94} V. Barnett and R. Lewis. Outliers in statistical data. John Wiley and Sons, 1994.

[6]

{BYRN99} Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.

Digital Library

[7]

{Coh98} W. Cohen. Integration of heterogeneous databases without common domains using queries based in textual similarity. In Proceedings of ACM SIGMOD, pages 201-212, Seattle, WA, June 1998.

Digital Library

[8]

{For01} Ronald Forino. Data ?e.quality: A behind the scenes perspective on data cleansing. http://www.dmreview.com/, March 2001.

[9]

{FS69} I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64: 1183-1210, 1969.

[10]

{Gal} Helena Galhardas. Data cleaning commercial tools. http://caravel.inria.fr/~galharda/cleaning.html.

[11]

{GFS+01} Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian Saita. Declarative data cleaning: Language, model, and algorithms. In Proceedings of the 27th International Conference on Very Large Databases, pages 371-380, Roma, Italy, September 11-14 2001.

Digital Library

[12]

{GFSS99} Helena Galhardas, Daniela Florescu, Dennis Shasha, and Eric Simon. An extensible framework for data cleaning. In ACM Sigmod, May 1999.

[13]

{GIJ+01} L Gravano, P Ipeirotis, H V Jagadish, N Koudas, S Muthukrishnan and D Srivastava. Approximate String Joins in a Database (Almost) for Free. In Proceedings of the VLDB 2001.

Digital Library

[14]

{GGR99} Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. Cactus-clustering categorical data using summaries. In Proceedings of the ACM SIGKDD fifth international conference on knowledge discovery in databases, pages 73-83, August 15-18 1999.

Digital Library

[15]

{GKR98} David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB 1998, New York City, New York, August 24-27.

Digital Library

[16]

{GRS99} Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the IEEE International Conference on Data Engineering, Sydney, March 1999.

Digital Library

[17]

{HKPT98} Yka Huhtala, Juha Karkkainen, Pasi Porkka, and Hannu Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In proceedings of the 14th international conference on data engineering (ICDE), pages 392-401, Orlando, Florida, February 1998.

Digital Library

[18]

{HS95} M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, pages 127-138, San Jose, CA, May 1995.

Digital Library

[19]

{KA85} B. Kilss and W. Alvey. Record linkage techniques-1985. Statistics of income division. Internal revenue service publication, 1985. Available from http://www.bts.gov/fcsm/methodology/.

[20]

{KM95} J. Kivinen and H. Mannila. Approximate dependency inference from relations. Theoretical Computer Science, 149(1): 129-149, September 1995.

Digital Library

[21]

{MBR01} J Madhavan, P Bernstein, E Rahm. Generic Schema Matching with Cupid. VLDB 2001, pages 49-58, Roma, Italy.

Digital Library

[22]

{ME96} Alvaro Monge and Charles Elkan. The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and databases (KDD), 1996.

[23]

{ME97} A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.

[24]

{MR94} H. Mannila and K.-J. Raiha. Algorithms for inferring functional dependencies. Data and Knowledge Engineering, 12(1): 83-99, February 1994.

Digital Library

[25]

{NR99} Felix Naumann and Claudia Rolker. Do metadata models meet iq requirements? In Proceedings of the international conference on data quality (IQ), MIT, Cambridge, 1999.

[26]

{Pro} MIT Total Data Quality Management Program. Information quality. http://web.mit.edu/tdqm/www/iqc.

[27]

{RD00} Erhard Rahm and H. Hai Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4): 3-13, December 2000.

[28]

{RH01} Vijayshankar Raman and Joe Hellerstein. Potter's wheel: An interactive data cleaning system. VLDB 2001, pages 381-390, Roma, Italy.

Digital Library

[29]

{Zipf49} G. K. Zipf. Human behaviour and the principle of least effort. Addison-Wesley, Reading, MA, 1949.

Cited By

Hildebrant RLe QTa DVu HGeerts FNgo HSintos S(2023)Towards Better Bounds for Finding Quasi-IdentifiersProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588668(155-167)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3584372.3588668
Fan WTian CWang YYin Q(2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457400
Lee TWang ZWang HHwang S(2020)Web scale taxonomy cleansingProceedings of the VLDB Endowment10.14778/3402755.34027634:12(1295-1306)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402755.3402763
Show More Cited By

Index Terms

Eliminating fuzzy duplicates in data warehouses
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Probabilistic reasoning
      2. Vagueness and fuzzy logic
2. Information systems

Recommendations

Advanced Techniques for Scientific Data Warehouses
ICACC '09: Proceedings of the 2009 International Conference on Advanced Computer Control

Data warehouses using a multidimensional view of data have become very popular in both business and science in recent years. Data warehouses for scientific purposes such as medicine and bio-chemistry1 pose several great challenges to existing data ...
Designing Data Warehouses for Supply Chain Management
CEC '04: Proceedings of the IEEE International Conference on E-Commerce Technology

Data warehouses are used to support subject-oriented decision-making in a company. When a company joins a supply chain partnership to increase competitiveness, its data warehouse has to be re-designed. The design of data warehouses for supply chain ...
Specification-based data reduction in dimensional data warehouses

Many data warehouses contain massive amounts of data, accumulated over long periods of time. In some cases, it is necessary or desirable to either delete ''old'' data or to maintain the data at an aggregate level. This may be due to privacy concerns, in ...

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

VLDB '02: Proceedings of the 28th international conference on Very Large Data Bases

August 2002

1110 pages

Publisher

VLDB Endowment

Publication History

Published: 20 August 2002

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
783
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hildebrant RLe QTa DVu HGeerts FNgo HSintos S(2023)Towards Better Bounds for Finding Quasi-IdentifiersProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588668(155-167)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3584372.3588668
Fan WTian CWang YYin Q(2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457400
Lee TWang ZWang HHwang S(2020)Web scale taxonomy cleansingProceedings of the VLDB Endowment10.14778/3402755.34027634:12(1295-1306)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402755.3402763
Xu PLu J(2019)Towards a unified framework for string similarity joinsProceedings of the VLDB Endowment10.14778/3342263.334226812:11(1289-1302)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.14778/3342263.3342268
Sadiq SDasu TDong XFreire JIlyas ILink SMiller MNaumann FZhou XSrivastava D(2018)Data QualityACM SIGMOD Record10.1145/3186549.318655946:4(35-43)Online publication date: 22-Feb-2018
https://dl.acm.org/doi/10.1145/3186549.3186559
Fernández-Álvarez DGayo JGayo-Avello DOrdóñez de Pablos P(2017)MERAInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.201710010313:4(42-67)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.4018/IJSWIS.2017100103
(2017)Stream-based live entity resolution approach with adaptive duplicate count strategyInternational Journal of Web and Grid Services10.1504/IJWGS.2017.08516713:3(351-373)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1504/IJWGS.2017.085167
Altwaijry HKalashnikov DMehrotra S(2017)QDAIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.262360729:2(402-417)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1109/TKDE.2016.2623607
Goswami AKumar A(2017)Challenges in the Analysis of Online Social NetworksWireless Personal Communications: An International Journal10.1007/s11277-017-4712-397:3(4015-4061)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s11277-017-4712-3
Ioannou EVelegrakis Y(2016)Searching Web 2.0 Data Through Entity-Based AggregationTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090183(159-174)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.5555/3090176.3090183
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten