Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
tutorial

Lp Samplers and Their Applications: A Survey

Published: 13 February 2019 Publication History

Abstract

The notion of Lp sampling, and corresponding algorithms known as Lp samplers, has found a wide range of applications in the design of data stream algorithms and beyond. In this survey, we present some of the core algorithms to achieve this sampling distribution based on ideas from hashing, sampling, and sketching. We give results for the special cases of insertion-only inputs, lower bounds for the sampling problems, and ways to efficiently sample multiple elements. We describe a range of applications of Lp sampling, drawing on problems across the domain of computer science, from matrix and graph computations, as well as to geometric and vector streaming problems.

References

[1]
Nesreen K. Ahmed, Jennifer Neville, and Ramana Rao Kompella. 2013. Network sampling: From static to streaming graphs. Trans. Knowl. Discov. Data 8, 2 (2013), 7:1--7:56.
[2]
Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. 2012. Analyzing graph structure via linear measurements. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms. 459--467. Retrieved from http://portal.acm.org/citation.cfm?id=20951568CFID=638386768CFTOKEN=79617016.
[3]
Mehmet Akçakaya and Vahid Tarokh. 2008. A frame construction and a universal distortion bound for sparse representations. IEEE Trans. Signal Processing 56, 6 (2008), 2443--2450.
[4]
Ahmed El Alaoui and Michael W. Mahoney. 2015. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28: Proceedings of the Annual Conference on Neural Information Processing Systems. 775--783.
[5]
Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1 (1999), 137--147.
[6]
Alexandr Andoni. 2017. High frequency moments via max-stability. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). 6364--6368.
[7]
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. 2011. Streaming algorithms via precision sampling. In Proceedings of the IEEE 52nd Annual Symposium on Foundations of Computer Science. 363--372.
[8]
Alexandr Andoni and Huy L. Nguyên. 2016. Width of points in the streaming model. ACM Trans. Algor. 12, 1 (2016), 5:1--5:10.
[9]
Neta Barkay, Ely Porat, and Bar Shalem. 2013. Efficient sampling of non-strict turnstile data streams. In Fundamentals of Computation Theory. Springer, 48--59.
[10]
Avrim Blum, John Hopcroft, and Ravindran Kannan. 2018. Foundations of Data Science. Retrieved from http://www.cs.cornell.edu/jeh/book.pdf.
[11]
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. 2017. BPTree: An &ell;<sub>2</sub> heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. Retrieved from http://arxiv.org/abs/1603.00759.
[12]
Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. 2017. Clustering high dimensional dynamic data streams. In Proceedings of the 34th International Conference on Machine Learning (ICML’17). 576--585. Retrieved from http://proceedings.mlr.press/v70/braverman17a.html.
[13]
A. Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Conference on Compression and Complexity of Sequences (SEQUENCES’97). 21--29.
[14]
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. 1998. Min-wise independent permutations. In Proceedings of the ACM Symposium on Theory of Computing. 327--336.
[15]
Emmanuel J. Candès, Mark Rudelson, Terence Tao, and Roman Vershynin. 2005. Error correction via linear programming. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science. 295--308.
[16]
J. L. Carter and M. N. Wegman. 1979. Universal classes of hash functions. J. Comput. Syst. Sci. 18, 2 (1979), 143--154.
[17]
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3--15.
[18]
Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. 2005. Approximating the minimum spanning tree weight in sublinear time. SIAM J. Comput. 34, 6 (2005), 1370--1379.
[19]
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 2007. Sketching unaggregated data streams for subpopulation-size queries. In Proceedings of the 26th ACM Symp. on Principles of Database Systems (PODS’07).
[20]
Edith Cohen and Haim Kaplan. 2007. Bottom-k sketches: Better and more efficient estimation of aggregates. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). 353--354.
[21]
Michael B. Cohen, Cameron Musco, and Christopher Musco. 2017. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17). 1758--1777.
[22]
Don Coppersmith and Ravi Kumar. 2004. An improved data stream algorithm for frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms. 151--156. Retrieved from http://dl.acm.org/citation.cfm?id&equals;982792.982815.
[23]
Graham Cormode and Donatella Firmani. 2014. A unifying framework for l0-sampling algorithms. Distrib. Parallel Databases 32, 3 (2014), 315--335.
[24]
Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1 (2005), 58--75.
[25]
Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. 2005. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In Proceedings of the 31st International Conference on Very Large Data Bases. 25--36. Retrieved from http://www.vldb2005.org/program/paper/tue/p25-cormode.pdf.
[26]
Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. 2009. Sampling algorithms and coresets for &ell;<sub>p</sub> regression. SIAM J. Comput. 38, 5 (2009), 2060--2078.
[27]
Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. 2012. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13 (2012), 3475--3506.
[28]
Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. 2006. Sampling algorithms for l<sub>2</sub> regression and applications. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 1127--1136.
[29]
Nick Duffield. 2012. Fair sampling across network flow measurements. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’12). ACM, 367--378.
[30]
N. G. Duffield, C. Lund, and M. Thorup. 2005. Learn more, sample less: Control of volume and variance in network measurements. IEEE Trans. Info. Theory 51, 5 (2005), 1756--1775.
[31]
N. G. Duffield, C. Lund, and M. Thorup. December, 2007. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (Dec. 2007), Article 32.
[32]
P. S. Efraimidis and P. G. Spirakis. 2006. Weighted random sampling with a reservoir. Information Processing Letters 97 (2006), 181--185.
[33]
David Eppstein and Michael T. Goodrich. 2007. Space-efficient straggler identification in round-trip data streams via Newton’s identities and invertible bloom filters. In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADS’07). 637--648.
[34]
C. Estan and G. Varghese. 2002. New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM (Computer Communication Review), Vol. 32, 4. 323--338.
[35]
C. T. Fan, M. E. Muller, and I. Rezucha. 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Stat. Assoc. 57 (1962), 387--402.
[36]
Gereon Frahling. 2006. Algorithms for Dynamic Geometric Data Streams. Ph.D. Dissertation. University of Paderborn, Germany. Retrieved from http://ubdata.uni-paderborn.de/ediss/17/2006/frahling/.
[37]
Gereon Frahling, Piotr Indyk, and Christian Sohler. 2005. Sampling in dynamic data streams and applications. In Proceedings of the 21st ACM Symposium on Computational Geometry. 142--149.
[38]
Gereon Frahling and Christian Sohler. 2005. Coresets in dynamic geometric data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing. 209--217.
[39]
Alan M. Frieze, Ravi Kannan, and Santosh Vempala. 2004. Fast monte-carlo algorithms for finding low-rank approximations. J. ACM 51, 6 (2004), 1025--1041.
[40]
Sumit Ganguly. 2007. Counting distinct items over update streams. Theor. Comput. Sci. 378, 3 (2007), 211--222.
[41]
Sumit Ganguly and Anirban Majumder. 2006. Deterministic k-set structure. In Proceedings of the 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 280--289.
[42]
P. Gibbons and Y. Matias. 1998. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 331--342.
[43]
Sharad Goel and Matthew J. Salganik. 2009. Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28, 17 (2009), 2202--2229.
[44]
David Haussler and Emo Welzl. 1987. epsilon-nets and simplex range queries. Discrete Comput. Geom. 2 (1987), 127--151.
[45]
Piotr Indyk. 2001. A small approximately min-wise independent family of hash functions. J. Algor. 38, 1 (2001), 84--90.
[46]
Rajesh Jayaram and David P. Woodruff. 2018. Perfect L<sub>P</sub> sampling in a data stream. In Proceedings of the Symposium on Foundations of Computer Science (FOCS’18). Retrieved from http://arxiv.org/abs/1808.05497.
[47]
T. S. Jayram and David P. Woodruff. 2009. The data stream space complexity of cascaded norms. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science. 765--774.
[48]
Madhav Jha, C. Seshadhri, and Ali Pinar. 2015. A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. Trans. Knowl. Discov. Data 9, 3 (2015), 15:1--15:21.
[49]
Hossein Jowhari, Mert Saglam, and Gábor Tardos. 2011. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 49--58.
[50]
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. On the exact space complexity of sketching and streaming small norms. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 1161--1178.
[51]
Michael Kapralov, Jelani Nelson, Jakub Pachocki, Zhengyu Wang, David P. Woodruff, and Mobin Yahyazadeh. 2017. Optimal lower bounds for universal relation, and for samplers and finding duplicates in streams. CoRR abs/1704.00633. Retrieved from http://arxiv.org/abs/1704.00633.
[52]
Mauricio Karchmer. 1989. Communication Complexity—A New Approach to Circuit Depth. MIT Press.
[53]
Mauricio Karchmer, Ran Raz, and Avi Wigderson. 1995. Super-logarithmic depth lower bounds via the direct sum in communication complexity. Comput. Complex. 5, 3/4 (1995), 191--204.
[54]
Mauricio Karchmer and Avi Wigderson. 1988. Monotone circuits for connectivity require super-logarithmic depth. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing. 539--550.
[55]
Donald E. Knuth. 1969. The Art of Computer Programming, Volume 2: (2nd Ed.) Seminumerical Algorithms. Addison Wesley Longman Publishing Co., Inc.
[56]
D. E. Knuth. 1998. The Art of Computer Programming, Vol 3, Sorting and Searching (2nd ed.). Addison-Wesley.
[57]
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06).
[58]
Florence Jessie MacWilliams and N. J. A. Sloane. 1977. The Theory of Error Correcting Codes. North-Holland Pub. Co. New York, Amsterdam, New York. Retrieved from http://opac.inria.fr/record&equals;b1084490 Includes index.
[59]
Andrew McGregor. 2014. Graph stream algorithms: A survey. SIGMOD Rec. 43, 1 (2014), 9--20.
[60]
Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu. 2016. Better algorithms for counting triangles in data streams. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 401--411.
[61]
Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. 1995. On data structures and asymmetric communication complexity. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing. 103--111.
[62]
Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.
[63]
Morteza Monemizadeh and David P. Woodruff. 2010. 1-pass relative-error L-sampling with applications. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. 1143--1160.
[64]
R. Motwani and P. Raghavan. 1995. Randomized Algorithms. Cambridge University Press.
[65]
Jelani Nelson and Huacheng Yu. 2018. Optimal lower bounds for distributed and streaming spanning forest computation. CoRR abs/1807.05135. Retrieved from http://arxiv.org/abs/1807.05135.
[66]
A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. 2013. Counting and sampling triangles from a graph stream. Proc. VLDB Endow. 6, 14 (2013), 1870--1881. Retrieved from http://www.vldb.org/pvldb/vol6/p1870-aduri.pdf.
[67]
Bruno Ribeiro and Don Towsley. 2010. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the ACM Internet Measurement Conference (IMC’10). 390--403.
[68]
Matthew J. Salganik and Douglas D. Heckathorn. 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34, 1 (2004), 193--240.
[69]
Jeanette P. Schmidt, Alan Siegel, and Aravind Srinivasan. 1995. Chernoff-Hoeffding bounds for applications with limited independence. SIAM J. Discrete Math. 8, 2 (1995), 223--250.
[70]
Xiaoming Sun and David P. Woodruff. 2015. Tight bounds for graph problems in insertion streams. In Proceedings of the Conference on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM’15). 435--448.
[71]
Gábor Tardos and Uri Zwick. 1997. The communication complexity of the universal relation. In Proceedings of the 12th Annual IEEE Conference on Computational Complexity. 247--259.
[72]
Mikkel Thorup. 2013. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the Symposium on Theory of Computing Conference. 371--380.
[73]
V. N. Vapnik and A. Ya. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 (1971), 264--280.
[74]
J. S. Vitter. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37--57.
[75]
David P. Woodruff and Peilin Zhong. 2016. Distributed low rank approximation of implicit functions of a matrix. In Proceedings of the 32nd IEEE International Conference on Data Engineering. 847--858.

Cited By

View all
  • (2024)Streaming Graph Algorithms in the Massively Parallel Computation ModelProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662770(496-507)Online publication date: 17-Jun-2024
  • (2023)Towards Optimal Moment Estimation in Streaming and Distributed ModelsACM Transactions on Algorithms10.1145/359649419:3(1-35)Online publication date: 24-Jun-2023
  • (2023)Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00093(1515-1550)Online publication date: 6-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 52, Issue 1
January 2020
758 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3309872
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2019
Accepted: 01 November 2018
Revised: 01 November 2018
Received: 01 September 2017
Published in CSUR Volume 52, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Lp sampling
  2. data stream algorithms

Qualifiers

  • Tutorial
  • Research
  • Refereed

Funding Sources

  • Royal Society Wolfson Research Merit Award
  • Alan Turing Institute under EPSRC
  • European Research Council

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)3
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Streaming Graph Algorithms in the Massively Parallel Computation ModelProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662770(496-507)Online publication date: 17-Jun-2024
  • (2023)Towards Optimal Moment Estimation in Streaming and Distributed ModelsACM Transactions on Algorithms10.1145/359649419:3(1-35)Online publication date: 24-Jun-2023
  • (2023)Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00093(1515-1550)Online publication date: 6-Nov-2023
  • (2023)Global triangle estimation based on first edge sampling in large graph streamsThe Journal of Supercomputing10.1007/s11227-023-05205-379:13(14079-14116)Online publication date: 3-Apr-2023
  • (2022)Truly Perfect Samplers for Data Streams and Sliding WindowsProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3524139(29-40)Online publication date: 12-Jun-2022
  • (2022)Secure Sampling with Sublinear CommunicationTheory of Cryptography10.1007/978-3-031-22365-5_13(348-377)Online publication date: 7-Nov-2022
  • (2020)Fast and Accurate Traffic Measurement With Hierarchical FilteringIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.299100731:10(2360-2374)Online publication date: 1-Oct-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media