Consistent Subset Sampling

Kutzkov, Konstantin; Pagh, Rasmus

doi:10.1007/978-3-319-08404-6_26

Konstantin Kutzkov¹⁷ &
Rasmus Pagh¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8503))

Included in the following conference series:

Scandinavian Workshop on Algorithm Theory

1117 Accesses

Abstract

Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset $\mathcal{I}\subseteq U$ it should be possible to quickly compute $\mathcal{I}\cap S$, i.e., the elements in $\mathcal{I}$ satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream.

In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(b ^k). Using a carefully designed hash function, for a given sampling probability p ∈ (0,1], we show how to improve the time complexity to Θ(b ^⌈k/2⌉loglogb + pb ^k) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(b ^⌈k/4⌉).

We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

This work is supported by the Danish National Research Foundation under the Sapere Aude program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Weighted Random Sampling over Data Streams

Sampling in Space Restricted Settings

Efficient Sampling Methods for Discrete Distributions

Article Open access 29 August 2016

References

Baran, I., Demaine, E.D., Pǎtraşcu, M.: Subquadratic Algorithms for 3SUM. Algorithmica 50(4), 584–596 (2008)
Article MATH MathSciNet Google Scholar
Boley, M., Grosskreutz, H.: A Randomized Approach for Approximating the Number of Frequent Sets. In: ICDM 2008, pp. 43–52 (2008)
Google Scholar
Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: KDD 2008, pp. 16–24 (2008)
Google Scholar
Bordino, I., Donato, D., Gionis, A., Leonardi, S.: Mining Large Networks with Subgraph Counting. In: ICDM 2008, pp. 737–742 (2008)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Article MATH MathSciNet Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks 29(8-13), 1157–1166 (1997)
Article Google Scholar
Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: PODS 2006, pp. 253–262 (2006)
Google Scholar
Buriol, L.S., Frahling, G., Leonardi, S., Sohler, C.: Estimating Clustering Indexes in Data Streams. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 618–632. Springer, Heidelberg (2007)
Chapter Google Scholar
Campagna, A., Kutzkov, K., Pagh, R.: On Parallelizing Matrix Multiplication by the Column-Row Method. In: ALENEX 2013, pp. 122–132 (2013)
Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)
Article MATH MathSciNet Google Scholar
Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MATH MathSciNet Google Scholar
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial Hash Functions Are Reliable (Extended Abstract). In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992)
Chapter Google Scholar
Dinur, I., Dunkelman, O., Keller, N., Shamir, A.: Efficient Dissection of Composite Problems, with Applications to Cryptanalysis, Knapsacks, and Combinatorial Search Problems. In: Safavi-Naini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol. 7417, pp. 719–740. Springer, Heidelberg (2012)
Chapter Google Scholar
Geerts, F., Goethals, B., Van den Bussche, J.: Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst. 30(2), 333–363 (2005)
Article Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)
Google Scholar
Han, Y., Thorup, M.: Integer Sorting in $O(n \sqrt{\log \log n})$ Expected Time and Linear Space. In: FOCS 2002, pp. 135–144 (2002)
Google Scholar
Impagliazzo, R., Paturi, R., Zane, F.: Which Problems Have Strongly Exponential Complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001)
Article MATH MathSciNet Google Scholar
Indyk, P.: A Small Approximately Min-Wise Independent Family of Hash Functions. J. Algorithms 38(1), 84–90 (2001)
Article MATH MathSciNet Google Scholar
Jin, R., McCallen, S., Breitbart, Y., Fuhry, D., Wang, D.: Estimating the number of frequent itemsets in a large database. In: EDBT, pp. 505–516 (2009)
Google Scholar
Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting Arbitrary Subgraphs in Data Streams. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) ICALP 2012, Part II. LNCS, vol. 7392, pp. 598–609. Springer, Heidelberg (2012)
Chapter Google Scholar
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS 2010, pp. 41–52 (2010)
Google Scholar
Pǎtraşcu, M., Williams, R.: On the Possibility of Faster SAT Algorithms. In: SODA 2010, pp. 1065–1075 (2010)
Google Scholar
Schroeppel, R., Shamir, A.: A T = O(2^n/2), S = O(2^n/4) Algorithm for Certain NP-Complete Problems. SIAM J. Comput. 10(3), 456–464 (1981)
Article MATH MathSciNet Google Scholar
Willard, D.E.: Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N). Inf. Process. Lett. 17(2), 81–84 (1983)
Article MATH MathSciNet Google Scholar
Woeginger, G.J.: Space and Time Complexity of Exact Algorithms: Some Open Problems (Invited Talk). In: Downey, R.G., Fellows, M.R., Dehne, F. (eds.) IWPEC 2004. LNCS, vol. 3162, pp. 281–290. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

IT University of Copenhagen, Denmark
Konstantin Kutzkov & Rasmus Pagh

Authors

Konstantin Kutzkov
View author publications
You can also search for this author in PubMed Google Scholar
Rasmus Pagh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Tepper School of Business, Carnegie Mellon University, 15213, Pittsburgh, PA, USA
R. Ravi
DTU Informatics, 2800, Kongens Lyngby, Denmark
Inge Li Gørtz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kutzkov, K., Pagh, R. (2014). Consistent Subset Sampling. In: Ravi, R., Gørtz, I.L. (eds) Algorithm Theory – SWAT 2014. SWAT 2014. Lecture Notes in Computer Science, vol 8503. Springer, Cham. https://doi.org/10.1007/978-3-319-08404-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-08404-6_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08403-9
Online ISBN: 978-3-319-08404-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Consistent Subset Sampling

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Weighted Random Sampling over Data Streams

Sampling in Space Restricted Settings

Efficient Sampling Methods for Discrete Distributions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Consistent Subset Sampling

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Weighted Random Sampling over Data Streams

Sampling in Space Restricted Settings

Efficient Sampling Methods for Discrete Distributions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation