Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Parallel Weighted Random Sampling

Published: 10 September 2022 Publication History

Abstract

Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps. We give efficient, fast, and practicable parallel and distributed algorithms for building data structures that support sampling single items (alias tables, compressed data structures). This also yields a simplified and more space-efficient sequential algorithm for alias table construction. Our approaches to sampling k out of n items with/without replacement and to subset (Poisson) sampling are output-sensitive, i.e., the sampling algorithms use work linear in the number of different samples. This is also interesting in the sequential case. Weighted random permutation can be done by sorting appropriate random deviates. We show that this is possible with linear work. Finally, we give a communication-efficient, highly scalable approach to (weighted and unweighted) reservoir sampling. This algorithm is based on a fully distributed model of streaming algorithms that might be of independent interest. Experiments for alias tables and sampling with replacement show near linear speedups using up to 158 threads of shared-memory machines. An experimental evaluation of distributed weighted reservoir sampling on up to 5,120 cores also shows good speedups.

References

[1]
Joachim H. Ahrens and Ulrich Dieter. 1985. Sequential random sampling. ACM Transactions on Mathematical Software (TOMS) 11, 2 (June 1985), 157–169.
[2]
Yaroslav Akhremtsev and Peter Sanders. 2016. Fast parallel operations on search trees. In 23rd Intl. Conference on High Performance Computing (HiPC). IEEE, 291–300.
[3]
Richard Arratia. 2002. On the amount of dependence in the prime factorization of a uniform random integer. Contemporary Combinatorics 10 (2002), 29–91. Page 36.
[4]
Kenneth E. Batcher. 1968. Sorting networks and their applications. In American Federation of Information Processing Societies (AFIPS) Conference, Vol. 32. 307–314.
[5]
Petra Berenbrink et al. 2020. Simulating population protocols in sub-constant time per interaction. In 28th European Symposium on Algorithms (ESA)(LIPIcs, Vol. 173).
[6]
Timo Bingmann. 2018. TLX: Collection of Sophisticated C++ Data Structures, Algorithms, and Miscellaneous Helpers. https://panthema.net/tlx.
[7]
Timo Bingmann et al. 2016. Thrill: High-performance algorithmic distributed batch data processing with C\(++\). In 2016 IEEE International Conference on Big Data. IEEE, 172–183.
[8]
Guy E. Blelloch. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. 38, 11 (Nov. 1989), 1526–1538.
[9]
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
[10]
Vladimir Braverman, Rafail Ostrovsky, and Gregory Vorsanger. 2015. Weighted sampling without replacement from data streams. Inform. Process. Lett. 115, 12 (2015), 923–926.
[11]
Ken R. W. Brewer and Muhammad Hanif. 1983. Sampling with Unequal Probabilities. Lecture Notes in Statistics, Vol. 15. Springer Science & Business Media.
[12]
Karl Bringmann and Kasper Green Larsen. 2013. Succinct sampling from discrete distributions. In 45th ACM Symposium on Theory of Computing (STOC). ACM, 775–782.
[13]
Karl Bringmann and Konstantinos Panagiotou. 2017. Efficient sampling methods for discrete distributions. Algorithmica 79, 2 (2017), 484–508.
[14]
M. T. Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (1982), 653–656.
[15]
Fan R. K. Chung and Linyuan Lu. 2003. The average distance in a random graph with given expected degrees. Internet Mathematics 1, 1 (2003), 91–113.
[16]
Yung-Yu Chung, Srikanta Tirthapura, and David P. Woodruff. 2016. A simple message-optimal algorithm for random sampling from a distributed stream. IEEE Transactions on Knowledge and Data Engineering 28, 6 (2016), 1356–1368.
[17]
Edith Cohen and Haim Kaplan. 2007. Summarizing data using bottom-\(k\) sketches. In 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). ACM, 225–234.
[18]
Richard Cole. 1988. Parallel merge sort. SIAM J. Comput. 17, 4 (1988), 770–785.
[19]
Graham Cormode. 2013. The continuous distributed monitoring model. ACM SIGMOD Record 42, 1 (2013), 5–14.
[20]
Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. 2010. Optimal sampling from distributed streams. In 29th ACM Symposium on Principles of Database Systems (PODS’10). ACM, 77–86.
[21]
Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. 2012. Continuous sampling from distributed streams. Journal of the ACM (JACM) 59, 2 (2012), 10.
[22]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107–113.
[23]
Luc Devroye. 1986. Non-Uniform Random Variate Generation. Springer.
[24]
D. P. Dubhashi and D. Ranjan. 1998. Balls and bins: A study in negative dependence. RSA: Random Structures & Algorithms 13 (1998), 99–124.
[25]
Pavlos S. Efraimidis. 2015. Weighted random sampling over data streams. In Algorithms, Probability, Networks, and Games: Scientific Papers and Essays Dedicated to Paul G. Spirakis on the Occasion of His 60th Birthday. Springer, 183–195.
[26]
Pavlos S. Efraimidis and Paul G. Spirakis. 1999. Fast Parallel Weighted Random Sampling. Technical Report TR99.04.02. CTI Patras.
[27]
Pavlos S. Efraimidis and Paul G. Spirakis. 2006. Weighted random sampling with a reservoir. Inform. Process. Lett. 97, 5 (2006), 181–185.
[28]
C. T. Fan, Mervin E. Muller, and Ivan Rezucha. 1962. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Statist. Assoc. 57, 298 (1962), 387–402.
[29]
Mark Galassi, Jim Davies, James Theiler, Brian Gough, Gerard Jungmann, Patrick Alken, Michael Booth, Fabrice Rossi, and Rhys Ulerich. 2009. GNU Scientific Library: Reference Manual (3rd ed.). Network Theory.
[30]
Bruno Galerne, Ares Lagae, Sylvain Lefebvre, and George Drettakis. 2012. Gabor noise by example. ACM Trans. Graph. 31, 4 (2012), 73:1–73:9.
[31]
Allan Gottlieb et al. 1983. The NYU ultracomputer – designing an MIMD shared memory parallel computer. IEEE Trans. Comput. 32, 2 (1983), 175–189.
[32]
Torben Hagerup. 1991. Fast parallel generation of random permutations. In 18th International Colloquium on Automata, Languages and Programming (ICALP). Springer, 405–416.
[33]
Torben Hagerup, Kurt Mehlhorn, and J. Ian Munro. 1993. Maintaining discrete probability distributions optimally. In 20th International Colloquium on Automata, Languages, and Programming (ICALP). Springer, 253–264.
[34]
Jaroslav Hájek. 1964. Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics (1964), 1491–1523.
[35]
Morris H. Hansen and William N. Hurwitz. 1943. On the theory of sampling from finite populations. The Annals of Mathematical Statistics 14, 4 (1943), 333–362.
[36]
Rodney R. Howell. 2008. On Asymptotic Notation with Multiple Variables. Technical Report 2007-4. Kansas State University.
[37]
Lorenz Hübschle-Schneider and Peter Sanders. 2016. Communication efficient algorithms for top-\(k\) selection problems. In 30th International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 659–668.
[38]
Lorenz Hübschle-Schneider and Peter Sanders. 2019. Parallel weighted random sampling. In 27th European Symposium on Algorithms (ESA).
[39]
Lorenz Hübschle-Schneider. 2020. Communication-Efficient Probabilistic Algorithms: Selection, Sampling, and Checking. Ph. D. Dissertation. Karlsruher Institut für Technologie (KIT).
[40]
Lorenz Hübschle-Schneider and Peter Sanders. 2020. Communication-efficient weighted reservoir sampling from fully distributed data streams. In 32nd ACM Symp. on Parallelism in Algorithms and Architectures (SPAA).
[41]
Lorenz Hübschle-Schneider, Peter Sanders, and Ingo Müller. 2015. Communication efficient algorithms for top-\(k\) selection problems. Computing Research Repository (CoRR) (2 2015). arxiv:1502.03942 [cs.DS]
[42]
Intel. 2019. Intel Math Kernel Library 2019. Intel. https://software.intel.com/en-us/mkl-reference-manual-for-c.
[43]
Joseph JáJá. 1992. An Introduction to Parallel Algorithms. Addison Wesley.
[44]
Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, and David P. Woodruff. 2019. Weighted reservoir sampling from distributed streams. In 38th ACM Symposium on Principles of Database Systems (PODS’19). ACM, 218–235.
[45]
Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing. Design and Analysis of Algorithms. Benjamin/Cummings.
[46]
Kevin J. Lang. 2014. Practical algorithms for generating a random ordering of the elements of a weighted set. Theory of Computing Systems 54, 4 (2014), 659–688.
[47]
Hans-Peter Lehmann. 2020. Weighted Random Sampling: Alias Tables on the GPU. Master’s thesis. Karlsruhe Institute of Technology (KIT).
[48]
Kim-Hung Li. 1994. Reservoir-sampling algorithms of time complexity \(\mathcal {O}\!\left(n(1+\log (N/n))\right)\). ACM Transactions on Mathematical Software (TOMS) 20, 4 (1994), 481–493.
[49]
George Marsaglia, Wai Wan Tsang, Jingbo Wang, et al. 2004. Fast generation of discrete random variables. Journal of Statistical Software 11, 3 (2004), 1–11.
[50]
Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. 2003. Dynamic generation of discrete random variates. Theory of Computing Systems 36, 4 (2003), 329–358.
[51]
M. Matsumoto and T. Nishimura. 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACMTMCS: ACM Transactions on Modeling and Computer Simulation 8 (1998), 3–30.
[52]
Jens Maue and Peter Sanders. 2007. Engineering algorithms for approximate weighted matching. In 6th Workshop on Experimental Algorithms (WEA). Springer, 242–255.
[53]
Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press.
[54]
Kirill Müller. 2016. Accelerating weighted random sampling without replacement. Arbeitsberichte Verkehrs-und Raumplanung 1141 (2016).
[55]
Frank Olken and Doron Rotem. 1995. Random sampling from databases: A survey. Statistics and Computing 5, 1 (1995), 25–42.
[56]
R Core Team. 2019. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org.
[57]
Martin Raab and Angelika Steger. 1998. “Balls into bins”-A simple and tight analysis. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 159–170.
[58]
Sanguthevar Rajasekaran and John H. Reif. 1989. Optimal and sublogarithmic time randomized parallel sorting algorithms. SIAM J. Comput. 18, 3 (1989), 594–607.
[59]
Abhiram G. Ranade. 1991. How to emulate shared memory. J. Comput. System Sci. 42, 3 (1991), 307–326.
[60]
Peter Sanders. 1996. On the competitive analysis of randomized static load balancing. In First Workshop on Randomized Parallel Algorithms, S. Rajasekaran (Ed.). Honolulu, Hawaii. http://algo2.iti.kit.edu/sanders/papers/rand96.pdf.
[61]
Peter Sanders. 1998. Random permutations on distributed, external and hierarchical memory. Inform. Process. Lett. 67, 6 (1998), 305–310.
[62]
Peter Sanders, Sebastian Lamm, Lorenz Hübschle-Schneider, Emanuel Schrade, and Carsten Dachsbacher. 2018. Efficient random sampling – parallel, vectorized, cache-efficient, and online. ACM Transactions on Mathematical Software (TOMS) 44, 3 (2018), 29:1–29:14.
[63]
Peter Sanders, Kurt Mehlhorn, Martin Dietzfelbinger, and Roman Dementiev. 2019. Sequential and Parallel Algorithms and Data Structures – The Basic Toolbox. Springer.
[64]
Peter Sanders, Sebastian Schlag, and Ingo Müller. 2013. Communication efficient algorithms for fundamental big data problems. In 2013 IEEE International Conference on Big Data. IEEE, 15–23.
[65]
Julian Shun. 2017. Improved parallel construction of wavelet trees and rank/select structures. In 2017 Data Compression Conference (DCC). IEEE, 92–101.
[66]
A. B. Sunter. 1977. List sequential sampling with equal or unequal probabilities without replacement. Journal of the Royal Statistical Society: Series C (Applied Statistics) 26, 3 (1977), 261–268.
[67]
Kanat Tangwongsan and Srikanta Tirthapura. 2019. Parallel streaming random sampling. In Euro-Par 2019: Parallel Processing. Springer, 451–465.
[68]
Yves Tillé. 2006. Sampling Algorithms. Springer.
[69]
Srikanta Tirthapura and David P. Woodruff. 2011. Optimal random sampling from distributed streams revisited. In 25th International Symposium on Distributed Computing (DISC’11). Springer, 283–297.
[70]
U.S. Census Bureau. 2021. Annual Survey of Manufactures Methodology. https://www.census.gov/programs-surveys/asm/technical-documentation/methodology.html.
[71]
Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11, 1 (March 1985), 37–57.
[72]
Michael D. Vose. 1991. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on Software Engineering (TSE) 17, 9 (1991), 972–975.
[73]
Alastair J. Walker. 1977. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 253–256.
[74]
Chak-Kuen Wong and Malcolm C. Easton. 1980. An efficient method for weighted sampling without replacement. SIAM J. Comput. 9, 1 (1980), 111–113.
[75]
Matei Zaharia et al. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In 24th ACM Symposium on Operating Systems Principles (SOSP). ACM, 423–438.
[76]
Matei Zaharia et al. 2016. Apache spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.

Cited By

View all
  • (2024)FlowWalker: A Memory-Efficient and High-Performance GPU-Based Dynamic Graph Random Walk FrameworkProceedings of the VLDB Endowment10.14778/3659437.365943817:8(1788-1801)Online publication date: 31-May-2024
  • (2024)Federated Learning in Heterogeneous Networks With Unreliable CommunicationIEEE Transactions on Wireless Communications10.1109/TWC.2023.331182423:4(3823-3838)Online publication date: Apr-2024
  • (2024)Internet of Things intrusion detection: Research and practice of NSENet and LSTM fusion modelsEgyptian Informatics Journal10.1016/j.eij.2024.10047626(100476)Online publication date: Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 48, Issue 3
September 2022
357 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3551652
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 September 2022
Online AM: 22 July 2022
Accepted: 09 May 2022
Revised: 05 May 2022
Received: 03 June 2020
Published in TOMS Volume 48, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Categorical distribution
  2. multinoulli distribution
  3. parallel algorithm
  4. alias method
  5. PRAM
  6. communication efficient algorithm
  7. Poisson sampling
  8. reservoir sampling

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)183
  • Downloads (Last 6 weeks)24
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FlowWalker: A Memory-Efficient and High-Performance GPU-Based Dynamic Graph Random Walk FrameworkProceedings of the VLDB Endowment10.14778/3659437.365943817:8(1788-1801)Online publication date: 31-May-2024
  • (2024)Federated Learning in Heterogeneous Networks With Unreliable CommunicationIEEE Transactions on Wireless Communications10.1109/TWC.2023.331182423:4(3823-3838)Online publication date: Apr-2024
  • (2024)Internet of Things intrusion detection: Research and practice of NSENet and LSTM fusion modelsEgyptian Informatics Journal10.1016/j.eij.2024.10047626(100476)Online publication date: Jun-2024
  • (2024)Point cluster analysis using weighted random labelingJournal of Geographical Systems10.1007/s10109-024-00447-yOnline publication date: 10-Sep-2024
  • (2024)Modeling the Invisible InternetComplex Networks & Their Applications XII10.1007/978-3-031-53472-0_30(359-370)Online publication date: 21-Feb-2024
  • (2024)Algorithms for generating small random samplesSoftware: Practice and Experience10.1002/spe.3379Online publication date: 18-Sep-2024
  • (2024) Accelerating the Screening of Modified MA 2 Z 4 Catalysts for Hydrogen Evolution Reaction by Deep Learning‐Based Local Geometric Analysis ENERGY & ENVIRONMENTAL MATERIALS10.1002/eem2.12743Online publication date: 15-May-2024
  • (2022)Rejection-Free Monte Carlo Simulation of QUBO and Lechner–Hauke–Zoller Optimization ProblemsIEEE Access10.1109/ACCESS.2022.319717610(84279-84301)Online publication date: 2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media