Learning-based Support Estimation in Sublinear Time

Eden, Talya; Indyk, Piotr; Narayanan, Shyam; Rubinfeld, Ronitt; Silwal, Sandeep; Wagner, Tal

Computer Science > Machine Learning

arXiv:2106.08396 (cs)

[Submitted on 15 Jun 2021]

Title:Learning-based Support Estimation in Sublinear Time

Authors:Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner

View PDF

Abstract:We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

Comments:	17 pages. Published as a conference paper in ICLR 2021
Subjects:	Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST)
Cite as:	arXiv:2106.08396 [cs.LG]
	(or arXiv:2106.08396v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2106.08396

Submission history

From: Shyam Narayanan [view email]
[v1] Tue, 15 Jun 2021 19:53:12 UTC (161 KB)

Computer Science > Machine Learning

Title:Learning-based Support Estimation in Sublinear Time

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Learning-based Support Estimation in Sublinear Time

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators