Estimating the Number of Distinct Items in a Database by Sampling
Pages 2473 - 2476
Abstract
Counting the number of distinct items in a dataset is a well known computational problem with numerous applications. Sometimes, exact counting is infeasible, and one must use some approximation method. One approach to approximation is to estimate the number of distinct items from a random sample. This approach is useful, for example, when the dataset is too big, or when only a sample is available, but not the entire data. Moreover, it can considerably speed up the computation. In statistics, this problem is known as the \em Unseen Species Problem. In this paper, we propose an estimation method for this problem, which is especially suitable for cases where the sample is much smaller than the entire set, and the number of repetitions of each item is relatively small. Our method is simple in comparison to known methods, and gives good enough estimates to make it useful in certain real life datasets that arise in data mining scenarios. We demonstrate our method on real data where the task at hand is to estimate the number of duplicate URLs.
References
[1]
. Chung, M. L. Mortensen, C. Binnig, T. Kraska,Estimating the Impact of Unknown Unknowns on Aggregate Query Results,SIGMOD 2016, 861--876, 2016.
[2]
. Efron and R. Thisted, Estimating the number of unseen species (How many words did Shakespeare know?) Biometrika 63(3), 435--447, 1976.
[3]
. A. Fisher, A. S. Corbet, and C. B. Williams, The relation between the number of species and the number of individuals in a random sample of an animal population, Journal of Animal Ecology 12(1), 42--58, 1943.
[4]
. J. Good and G. H. Toulmin. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43(1--2), 45--63, 1956.
[5]
. M. Kane, J. Nelson, D. P. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. Proceedings of the 29-th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems 41--52, 2010.
[6]
. Orlitsky, A. T. Suresh and Y. Wu, Optimal prediction of the number of unseen species, PNAS 113(47), 13283--13288, 2016. Proceedings of the SIGMOD Conference 2016.
Recommendations
Set Reconciliation with Cuckoo Filters
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
Set reconciliation is a common and fundamental task in distributed systems. In many cases, given set A on $Host_A$ and set B on $Host_B$, applications need to identify those elements that appear in set A but not in set B, and vice versa. However, ...
Comments
Information & Contributors
Information
Published In

November 2019
3373 pages
ISBN:9781450369763
DOI:10.1145/3357384
- General Chairs:
- Wenwu Zhu,
- Dacheng Tao,
- Xueqi Cheng,
- Program Chairs:
- Peng Cui,
- Elke Rundensteiner,
- David Carmel,
- Qi He,
- Jeffrey Xu Yu
Copyright © 2019 ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 03 November 2019
Check for updates
Author Tags
Qualifiers
- Short-paper
Conference
CIKM '19: The 28th ACM International Conference on Information and Knowledge Management
November 3 - 7, 2019
Beijing, China
Acceptance Rates
CIKM '19 Paper Acceptance Rate 202 of 1,031 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 118Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Reflects downloads up to 03 Mar 2025
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in