Abstract
Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset \(\mathcal{I}\subseteq U\) it should be possible to quickly compute \(\mathcal{I}\cap S\), i.e., the elements in \(\mathcal{I}\) satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream.
In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(b k). Using a carefully designed hash function, for a given sampling probability p ∈ (0,1], we show how to improve the time complexity to Θ(b ⌈k/2⌉loglogb + pb k) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(b ⌈k/4⌉).
We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.
This work is supported by the Danish National Research Foundation under the Sapere Aude program.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baran, I., Demaine, E.D., Pǎtraşcu, M.: Subquadratic Algorithms for 3SUM. Algorithmica 50(4), 584–596 (2008)
Boley, M., Grosskreutz, H.: A Randomized Approach for Approximating the Number of Frequent Sets. In: ICDM 2008, pp. 43–52 (2008)
Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: KDD 2008, pp. 16–24 (2008)
Bordino, I., Donato, D., Gionis, A., Leonardi, S.: Mining Large Networks with Subgraph Counting. In: ICDM 2008, pp. 737–742 (2008)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks 29(8-13), 1157–1166 (1997)
Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: PODS 2006, pp. 253–262 (2006)
Buriol, L.S., Frahling, G., Leonardi, S., Sohler, C.: Estimating Clustering Indexes in Data Streams. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 618–632. Springer, Heidelberg (2007)
Campagna, A., Kutzkov, K., Pagh, R.: On Parallelizing Matrix Multiplication by the Column-Row Method. In: ALENEX 2013, pp. 122–132 (2013)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial Hash Functions Are Reliable (Extended Abstract). In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992)
Dinur, I., Dunkelman, O., Keller, N., Shamir, A.: Efficient Dissection of Composite Problems, with Applications to Cryptanalysis, Knapsacks, and Combinatorial Search Problems. In: Safavi-Naini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol. 7417, pp. 719–740. Springer, Heidelberg (2012)
Geerts, F., Goethals, B., Van den Bussche, J.: Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst. 30(2), 333–363 (2005)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)
Han, Y., Thorup, M.: Integer Sorting in \(O(n \sqrt{\log \log n})\) Expected Time and Linear Space. In: FOCS 2002, pp. 135–144 (2002)
Impagliazzo, R., Paturi, R., Zane, F.: Which Problems Have Strongly Exponential Complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001)
Indyk, P.: A Small Approximately Min-Wise Independent Family of Hash Functions. J. Algorithms 38(1), 84–90 (2001)
Jin, R., McCallen, S., Breitbart, Y., Fuhry, D., Wang, D.: Estimating the number of frequent itemsets in a large database. In: EDBT, pp. 505–516 (2009)
Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting Arbitrary Subgraphs in Data Streams. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) ICALP 2012, Part II. LNCS, vol. 7392, pp. 598–609. Springer, Heidelberg (2012)
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS 2010, pp. 41–52 (2010)
Pǎtraşcu, M., Williams, R.: On the Possibility of Faster SAT Algorithms. In: SODA 2010, pp. 1065–1075 (2010)
Schroeppel, R., Shamir, A.: A T = O(2n/2), S = O(2n/4) Algorithm for Certain NP-Complete Problems. SIAM J. Comput. 10(3), 456–464 (1981)
Willard, D.E.: Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N). Inf. Process. Lett. 17(2), 81–84 (1983)
Woeginger, G.J.: Space and Time Complexity of Exact Algorithms: Some Open Problems (Invited Talk). In: Downey, R.G., Fellows, M.R., Dehne, F. (eds.) IWPEC 2004. LNCS, vol. 3162, pp. 281–290. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kutzkov, K., Pagh, R. (2014). Consistent Subset Sampling. In: Ravi, R., Gørtz, I.L. (eds) Algorithm Theory – SWAT 2014. SWAT 2014. Lecture Notes in Computer Science, vol 8503. Springer, Cham. https://doi.org/10.1007/978-3-319-08404-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-08404-6_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08403-9
Online ISBN: 978-3-319-08404-6
eBook Packages: Computer ScienceComputer Science (R0)