A new approach for testing properties of discrete distributions

I Diakonikolas, DM Kane - 2016 IEEE 57th Annual Symposium …, 2016 - ieeexplore.ieee.org
2016 IEEE 57th Annual Symposium on Foundations of Computer Science …, 2016ieeexplore.ieee.org
We study problems in distribution property testing: Given sample access to one or more
unknown discrete distributions, we want to determine whether they have some global
property or are epsilon-far from having the property in L1 distance (equivalently, total
variation distance, or" statistical distance"). In this work, we give a novel general approach
for distribution testing. We describe two techniques: our first technique gives sample-optimal
testers, while our second technique gives matching sample lower bounds. As a …
We study problems in distribution property testing: Given sample access to one or more unknown discrete distributions, we want to determine whether they have some global property or are epsilon-far from having the property in L1 distance (equivalently, total variation distance, or "statistical distance").In this work, we give a novel general approach for distribution testing. We describe two techniques: our first technique gives sample-optimal testers, while our second technique gives matching sample lower bounds. As a consequence, we resolve the sample complexity of a wide variety of testing problems. Our upper bounds are obtained via a modular reduction-based approach. Our approach yields optimal testers for numerous problemsby using a standard L2-identity tester as a black-box. Using this recipe, we obtain simple estimators for a wide range of problems, encompassing many problems previously studied in the TCS literature, namely: (1) identity testing to a fixed distribution, (2) closeness testing between two unknown distributions (with equal/unequal sample sizes), (3) independence testing (in any number of dimensions), (4) closeness testing for collections of distributions, and(5) testing histograms. For all of these problems, our testers are sample-optimal, up to constant factors. With the exception of (1), ours are the first sample-optimal testers for the corresponding problems. Moreover, our estimators are significantly simpler to state and analyze compared to previous results. As an important application of our reduction-based technique, we obtain the first adaptive algorithm for testing equivalence betweentwo unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions - as opposed to merely their domain size -and is significantly better compared to the worst-case optimal L1-tester in many natural instances. Moreover, our technique naturally generalizes to other metrics beyond the L1-distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence testerunder the Hellinger distance. Our lower bounds are obtained via a direct information-theoretic approach: Given a candidate hard instance, our proof proceeds by boundingthe mutual information between appropriate random variables. While this is a classical method in information theory, prior to our work, it had not been used in this context. Previous lower bounds relied either on the birthday paradox, oron moment-matching and were thus restricted to symmetric properties. Our lower bound approach does not suffer from any such restrictions and gives tight sample lower bounds for the aforementioned problems.
ieeexplore.ieee.org