Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- ArticleApril 2005
Robust Identification of Fuzzy Duplicates
ICDE '05: Proceedings of the 21st International Conference on Data EngineeringPages 865–876https://doi.org/10.1109/ICDE.2005.125Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that ...
- ArticleJune 2003
Robust and efficient fuzzy match for online data cleaning
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of dataPages 313–324https://doi.org/10.1145/872757.872796To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a ...
- ArticleMay 2000
Towards estimation error guarantees for distinct values
PODS '00: Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsPages 268–279https://doi.org/10.1145/335168.335230We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We ...
- ArticleJune 1999
On random sampling over joins
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of dataPages 263–274https://doi.org/10.1145/304182.304206A major bottleneck in implementing sampling as a primitive relational operation is the inefficiency of sampling the output of a query. It is not even known whether it is possible to generate a sample of a join tree without first evaluating the join tree ...
Also Published in:
ACM SIGMOD Record: Volume 28 Issue 2 - ArticleJune 1998
Random sampling for histogram construction: how much is enough?
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataPages 436–447https://doi.org/10.1145/276304.276343Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address ...
Also Published in:
ACM SIGMOD Record: Volume 27 Issue 2