Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1287369.1287420dlproceedingsArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

Eliminating fuzzy duplicates in data warehouses

Published: 20 August 2002 Publication History

Abstract

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

References

[1]
{AEP01} A. N. Arslan, O. Egecioglu, and P. A. Pevzner. A new approach to sequence comparison: Normalized local alignment. Bioinformatics, 17(4): 327-337, 2001.
[2]
{BDS01} Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In Proceedings of ACM Sigmod Conference, Santa Barbara, CA, May 2001.
[3]
{BGM+97} A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Proc. Sixth Int'l. World Wide Web Conference, World Wide Web Consortium, Cambridge, pages 391-404, 1997.
[4]
{BGRS99} K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? International Conference on Database Theory, pages 217-235. January 1999.
[5]
{BL94} V. Barnett and R. Lewis. Outliers in statistical data. John Wiley and Sons, 1994.
[6]
{BYRN99} Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.
[7]
{Coh98} W. Cohen. Integration of heterogeneous databases without common domains using queries based in textual similarity. In Proceedings of ACM SIGMOD, pages 201-212, Seattle, WA, June 1998.
[8]
{For01} Ronald Forino. Data ?e.quality: A behind the scenes perspective on data cleansing. http://www.dmreview.com/, March 2001.
[9]
{FS69} I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64: 1183-1210, 1969.
[10]
{Gal} Helena Galhardas. Data cleaning commercial tools. http://caravel.inria.fr/~galharda/cleaning.html.
[11]
{GFS+01} Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian Saita. Declarative data cleaning: Language, model, and algorithms. In Proceedings of the 27th International Conference on Very Large Databases, pages 371-380, Roma, Italy, September 11-14 2001.
[12]
{GFSS99} Helena Galhardas, Daniela Florescu, Dennis Shasha, and Eric Simon. An extensible framework for data cleaning. In ACM Sigmod, May 1999.
[13]
{GIJ+01} L Gravano, P Ipeirotis, H V Jagadish, N Koudas, S Muthukrishnan and D Srivastava. Approximate String Joins in a Database (Almost) for Free. In Proceedings of the VLDB 2001.
[14]
{GGR99} Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. Cactus-clustering categorical data using summaries. In Proceedings of the ACM SIGKDD fifth international conference on knowledge discovery in databases, pages 73-83, August 15-18 1999.
[15]
{GKR98} David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB 1998, New York City, New York, August 24-27.
[16]
{GRS99} Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the IEEE International Conference on Data Engineering, Sydney, March 1999.
[17]
{HKPT98} Yka Huhtala, Juha Karkkainen, Pasi Porkka, and Hannu Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In proceedings of the 14th international conference on data engineering (ICDE), pages 392-401, Orlando, Florida, February 1998.
[18]
{HS95} M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, pages 127-138, San Jose, CA, May 1995.
[19]
{KA85} B. Kilss and W. Alvey. Record linkage techniques-1985. Statistics of income division. Internal revenue service publication, 1985. Available from http://www.bts.gov/fcsm/methodology/.
[20]
{KM95} J. Kivinen and H. Mannila. Approximate dependency inference from relations. Theoretical Computer Science, 149(1): 129-149, September 1995.
[21]
{MBR01} J Madhavan, P Bernstein, E Rahm. Generic Schema Matching with Cupid. VLDB 2001, pages 49-58, Roma, Italy.
[22]
{ME96} Alvaro Monge and Charles Elkan. The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and databases (KDD), 1996.
[23]
{ME97} A. Monge and C. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.
[24]
{MR94} H. Mannila and K.-J. Raiha. Algorithms for inferring functional dependencies. Data and Knowledge Engineering, 12(1): 83-99, February 1994.
[25]
{NR99} Felix Naumann and Claudia Rolker. Do metadata models meet iq requirements? In Proceedings of the international conference on data quality (IQ), MIT, Cambridge, 1999.
[26]
{Pro} MIT Total Data Quality Management Program. Information quality. http://web.mit.edu/tdqm/www/iqc.
[27]
{RD00} Erhard Rahm and H. Hai Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4): 3-13, December 2000.
[28]
{RH01} Vijayshankar Raman and Joe Hellerstein. Potter's wheel: An interactive data cleaning system. VLDB 2001, pages 381-390, Roma, Italy.
[29]
{Zipf49} G. K. Zipf. Human behaviour and the principle of least effort. Addison-Wesley, Reading, MA, 1949.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
VLDB '02: Proceedings of the 28th international conference on Very Large Data Bases
August 2002
1110 pages

Publisher

VLDB Endowment

Publication History

Published: 20 August 2002

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Towards Better Bounds for Finding Quasi-IdentifiersProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588668(155-167)Online publication date: 18-Jun-2023
  • (2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
  • (2020)Web scale taxonomy cleansingProceedings of the VLDB Endowment10.14778/3402755.34027634:12(1295-1306)Online publication date: 3-Jun-2020
  • (2019)Towards a unified framework for string similarity joinsProceedings of the VLDB Endowment10.14778/3342263.334226812:11(1289-1302)Online publication date: 1-Jul-2019
  • (2018)Data QualityACM SIGMOD Record10.1145/3186549.318655946:4(35-43)Online publication date: 22-Feb-2018
  • (2017)MERAInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.201710010313:4(42-67)Online publication date: 1-Oct-2017
  • (2017)Stream-based live entity resolution approach with adaptive duplicate count strategyInternational Journal of Web and Grid Services10.1504/IJWGS.2017.08516713:3(351-373)Online publication date: 1-Jan-2017
  • (2017)QDAIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.262360729:2(402-417)Online publication date: 1-Feb-2017
  • (2017)Challenges in the Analysis of Online Social NetworksWireless Personal Communications: An International Journal10.1007/s11277-017-4712-397:3(4015-4061)Online publication date: 1-Dec-2017
  • (2016)Searching Web 2.0 Data Through Entity-Based AggregationTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090183(159-174)Online publication date: 1-Jan-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media