Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3580305.3599458acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

Optimal Dynamic Subset Sampling: Theory and Applications

Published: 04 August 2023 Publication History

Abstract

We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of n distinct events S=x1, …, xn, in which each event xi has an associated probability p(xi). The subset sampling problem aims to sample a subset T ⊆ S, such that every xi is independently included in T with probability p(xi). A naive solution is to flip a coin for each event, which takes O(n) time. However, an ideal solution is a data structure that allows drawing a subset sample in time proportional to the expected output size μ=∑i=1n p(xi), which can be significantly smaller than n in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade.
However, the majority of existing subset sampling methods are designed for a static setting, where the events in set S or their associated probabilities remain unchanged over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with varying probabilities in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms.
In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: Influence Maximization. We empirically show that our ODSS can improve the complexities of existing Influence Maximization algorithms on large real-world evolving social networks.

Supplementary Material

MP4 File (0443-2min-promo.mp4)
Presentation video (short version), including an overview of the subset sampling problem and the contributions of the paper "Optimal Dynamic Subset Sampling: Theory and Applications"
MP4 File (0443-20min-video.mp4)
Presentation Video (long version) for "Optimal Dynamic Subset Sampling: Theory and Applications"

References

[1]
https://arxiv.org/abs/2305.18785.
[2]
Aleksandar Bojchevski, Johannes Gasteiger, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2464--2473, 2020.
[3]
Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. Maxi-mizing social influence in nearly optimal time. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 946--957. SIAM, 2014.
[4]
Karl Bringmann and Konstantinos Panagiotou. Efficient sampling methods for discrete distributions. In International colloquium on automata, languages, and programming, pages 133--144. Springer, 2012.
[5]
Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1029--1038, 2010.
[6]
Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199--208, 2009.
[7]
Suqi Cheng, Huawei Shen, Junming Huang, Wei Chen, and Xueqi Cheng. Imrank: influence maximization via finding self-consistent ranking. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 475--484, 2014.
[8]
Luc Devroye. Nonuniform random variate generation. Handbooks in operations research and management science, 13:83--121, 2006.
[9]
Pedro Domingos and Matt Richardson. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 57--66, 2001.
[10]
Michael L Fredman and Dan E Willard. Surpassing the information theoretic bound with fusion trees. Journal of computer and system sciences, 47(3):424--436, 1993.
[11]
Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
[12]
Timothy C Germann, Kai Kadau, Ira M Longini Jr, and Catherine A Macken. Mitigation strategies for pandemic influenza in the united states. Proceedings of the National Academy of Sciences, 103(15):5935--5940, 2006.
[13]
Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Celf optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pages 47--48, 2011.
[14]
Qintian Guo, Sibo Wang, Zhewei Wei, and Ming Chen. Influence maximization revisited: Efficient reverse reachable set generation with bound tightened. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2167--2181, 2020.
[15]
Torben Hagerup, Kurt Mehlhorn, and J Ian Munro. Maintaining discrete probability distributions optimally. In International Colloquium on Automata, Languages, and Programming, pages 253--264. Springer, 1993.
[16]
Torben Hagerup, Kurt Mehlhorn, and James Ian Munro. Optimal algorithms for generating discrete random variables with changing distributions. Lecture Notes in Computer Science, 700:253--264, 1993.
[17]
Kyomin Jung, Wooram Heo, and Wei Chen. Irie: Scalable and robust influence maximization in social networks. In 2012 IEEE 12th international conference on data mining, pages 918--923. IEEE, 2012.
[18]
David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137--146, 2003.
[19]
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[20]
Donald Knuth. Seminumerical algorithms. The art of computer programming, 2, 1981.
[21]
Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 420--429, 2007.
[22]
Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
[23]
Yuchen Li, Ju Fan, Yanhao Wang, and Kian-Lee Tan. Influence maximization on social graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 30(10):1852--1872, 2018.
[24]
Qi Liu, Biao Xiang, Enhong Chen, Hui Xiong, Fangshuang Tang, and Jeffrey Xu Yu. Influence maximization over large-scale social networks: A bounded linear approach. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pages 171--180, 2014.
[25]
Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. Dynamic generation of discrete random variates. Theory of Computing Systems, 36:329--358, 2003.
[26]
Binghui Peng. Dynamic influence maximization. Advances in Neural Information Processing Systems, 34:10718--10731, 2021.
[27]
Jing Tang, Xueyan Tang, Xiaokui Xiao, and Junsong Yuan. Online processing algorithms for influence maximization. In Proceedings of the 2018 International Conference on Management of Data, pages 991--1005, 2018.
[28]
Youze Tang, Xiaokui Xiao, and Yanchen Shi. Influence maximization: Near-optimal time complexity meets practical efficiency. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 75--86, 2014.
[29]
Meng-Tsung Tsai, Da-Wei Wang, Churn-Jung Liau, and Tsan-sheng Hsu. Heterogeneous subset sampling. In Computing and Combinatorics: 16th Annual International Conference, COCOON 2010, Nha Trang, Vietnam, July 19--21, 2010. Proceedings 16, pages 500--509. Springer, 2010.
[30]
Alastair J Walker. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electronics Letters, 8(10):127--128, 1974.
[31]
Alastair J Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software (TOMS), 3(3):253--256, 1977.
[32]
Hanzhi Wang, Mingguo He, Zhewei Wei, Sibo Wang, Ye Yuan, Xiaoyong Du, and Ji-Rong Wen. Approximate graph propagation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1686--1696, 2021.
[33]
Yu Wang, Gao Cong, Guojie Song, and Kunqing Xie. Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1039--1048, 2010.
[34]
Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861--6871. PMLR, 2019.
[35]
Mao Ye, Xingjie Liu, and Wang-Chien Lee. Exploring social influence for recom-mendation: a generative model approach. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 671--680, 2012.
[36]
Chuan Zhou, Peng Zhang, Wenyu Zang, and Li Guo. On the upper bounds of spread for greedy algorithms in social network influence maximization. IEEE Transactions on Knowledge and Data Engineering, 27(10):2770--2783, 2015

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic probabilities
  2. optimal time cost
  3. subset sampling

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • Beijing Outstanding Young Scientist Program
  • Alibaba Group through Alibaba Innovative Research Program
  • National Natural Science Foundation of China
  • the major key project of PCL
  • Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education
  • Intelligent Social Governance Interdisciplinary Platform, Major Innovation & Planning Interdisciplinary Platform for the ?Double-First Class? Initiative, Public Policy and Decision-making Research Lab, Public Computing Cloud, Renmin University of China
  • Beijing Natural Science Foundation
  • Huawei-Renmin University joint program on Information Retrieval

Conference

KDD '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 230
    Total Downloads
  • Downloads (Last 12 months)159
  • Downloads (Last 6 weeks)14
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media