Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1989284.1989289acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Tight bounds for Lp samplers, finding duplicates in streams, and related problems

Published: 13 June 2011 Publication History

Abstract

In this paper, we present near-optimal space bounds for Lp-samplers. Given a stream of updates (additions and subtraction) to the coordinates of an underlying vector x in Rn, a perfect Lp sampler outputs the i-th coordinate with probability xipxpp. In SODA 2010, Monemizadeh and Woodruff showed polylog space upper bounds for approximate Lp-samplers and demonstrated various applications of them. Very recently, Andoni, Krauthgamer and Onak improved the upper bounds and gave a O-plog3n) space ε relative error and constant failure rate Lp-sampler for p є [1,2]. In this work, we give another such algorithm requiring only O(ε-plog2n) space for p є (1,2). For p є (0,1), our space bound is O(ε-1log2n), while for the p=1 case we have an O(log(1/ε)ε-log2n) space algorithm. We also give a O(log2n) bits zero relative error L0-sampler, improving the O(log3n) bits algorithm due to Frahling, Indyk and Sohler.
As an application of our samplers, we give better upper bounds for the problem of finding duplicates in data streams. In case the length of the stream is longer than the alphabet size, L1 sampling gives us an O(log2n) space algorithm, thus improving the previous O(log3n) bound due to Gopalan and Radhakrishnan.
In the second part of our work, we prove an Ω (log2n) lower bound for sampling from 0, ± 1 vectors (in this special case, the parameter p is not relevant for Lp sampling). This matches the space of our sampling algorithms for constant ε>0. We also prove tight space lower bounds for the finding duplicates and heavy hitters problems. We obtain these lower bounds using reductions from the communication complexity problem augmented indexing.

References

[1]
Alex Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. Manuscript, 2010.
[2]
Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. In SODA, pages 1190--1197, 2010.
[3]
Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a moving window over streaming data. In SODA, pages 633--634, 2002.
[4]
Radu Berinde, Graham Cormode, Piotr Indyk, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. In PODS, pages 157--166, 2009.
[5]
Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. In PODS, pages 147--156, 2009.
[6]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004.
[7]
Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Stream sampling for variance-optimal estimation of subset sums. In SODA, pages 1255--1264, 2009.
[8]
Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005.
[9]
Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, pages 25--36, 2005.
[10]
Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. Optimal sampling from distributed streams. In PODS, pages 77--86, 2010.
[11]
Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6), 2007.
[12]
Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry, SCG '05, pages 142--149, New York, NY, USA, 2005. ACM.
[13]
Anna Gilbert and Piotr Indyk. Sparse recovery using sparse matrices. In Proceeding of IEEE, 2010.
[14]
Parikshit Gopalan and Jaikumar Radhakrishnan. Finding duplicates in a data stream. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '09, pages 402--411, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics.
[15]
T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In FOCS, pages 765--774, 2009.
[16]
Daniel M. Kane, Jelani Nelson, Ely Porat, and Woodruff David P. Fast moment estimation in data streams in optimal space. Manuscript, 2010.
[17]
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In SODA, pages 1161--1178, 2010.
[18]
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data, PODS '10, pages 41--52, New York, NY, USA, 2010. ACM.
[19]
Mauricio Karchmer. A New Approach to Circuit Depth. PhD thesis, MIT, 1989.
[20]
Mauricio Karchmer and Avi Wigderson. Monotone circuits for connectivity require super-logarithmic depth. In Proceedings of the twentieth annual ACM symposium on Theory of computing, STOC '88, pages 539--550, New York, NY, USA, 1988. ACM.
[21]
Donald E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms. Addison-Wesley, 1969.
[22]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Duplicate detection in click streams. In WWW, pages 12--21, 2005.
[23]
Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, STOC '95, pages 103--111, New York, NY, USA, 1995. ACM.
[24]
Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error lp-sampling with applications. In SODA, pages 1143--1160, 2010.
[25]
S. Muthukrishnan. Data Streams: Algorithms and Applications.
[26]
N. Nisan. Pseudorandom generators for space-bounded computations. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, STOC '90, pages 204--212, New York, NY, USA, 1990. ACM.
[27]
Gabor Tardos and Uri Zwick. The communication complexity of the universal relation. In Proceedings of the 12th Annual IEEE Conference on Computational Complexity, pages 247--, Washington, DC, USA, 1997. IEEE Computer Society.
[28]
Jun Tarui. Finding a duplicate and a missing item in a stream. In Jin-Yi Cai, S. Cooper, and Hong Zhu, editors, Theory and Applications of Models of Computation, volume 4484 of Lecture Notes in Computer Science, pages 128--135. Springer Berlin / Heidelberg, 2007.
[29]
David Woodruff and T. S. Jayram. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with low error. In SODA, 2011.

Cited By

View all
  • (2024)Streaming Algorithms with Few State ChangesProceedings of the ACM on Management of Data10.1145/36511452:2(1-28)Online publication date: 14-May-2024
  • (2024)Optimal Multi-pass Lower Bounds for MST in Dynamic StreamsProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649755(835-846)Online publication date: 10-Jun-2024
  • (2023)Model Counting Meets F0 EstimationACM Transactions on Database Systems10.1145/360349648:3(1-28)Online publication date: 9-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2011
332 pages
ISBN:9781450306607
DOI:10.1145/1989284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. finding duplicates
  2. sampling
  3. streaming algorithms

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '11
Sponsor:

Acceptance Rates

PODS '11 Paper Acceptance Rate 25 of 113 submissions, 22%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)3
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Streaming Algorithms with Few State ChangesProceedings of the ACM on Management of Data10.1145/36511452:2(1-28)Online publication date: 14-May-2024
  • (2024)Optimal Multi-pass Lower Bounds for MST in Dynamic StreamsProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649755(835-846)Online publication date: 10-Jun-2024
  • (2023)Model Counting Meets F0 EstimationACM Transactions on Database Systems10.1145/360349648:3(1-28)Online publication date: 9-Aug-2023
  • (2023)Applications of Sketching and Pathways to ImpactProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3589937(5-10)Online publication date: 18-Jun-2023
  • (2023)Streaming Euclidean Max-Cut: Dimension vs Data ReductionProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585170(170-182)Online publication date: 2-Jun-2023
  • (2023)Streaming Euclidean MST to a Constant FactorProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585168(156-169)Online publication date: 2-Jun-2023
  • (2023)KVSAgg: Secure Aggregation of Distributed Key-Value Sets2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00139(1775-1789)Online publication date: Apr-2023
  • (2023)Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS57990.2023.00093(1515-1550)Online publication date: 6-Nov-2023
  • (2022)SpaceSaving±Proceedings of the VLDB Endowment10.14778/3514061.351406815:6(1215-1227)Online publication date: 1-Feb-2022
  • (2022)New streaming algorithms for high dimensional EMD and MSTProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519979(222-233)Online publication date: 9-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media