Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient Sampling Approaches to Shapley Value Approximation

Published: 30 May 2023 Publication History

Abstract

Shapley value provides a unique way to fairly assess each player's contribution in a coalition and has enjoyed many applications. However, the exact computation of Shapley value is #P-hard due to the combinatoric nature of Shapley value. Many existing applications of Shapley value are based on Monte-Carlo approximation, which requires a large number of samples and the assessment of utility on many coalitions to reach high quality approximation, and thus is still far from being efficient. Can we achieve an efficient approximation of Shapley value by smartly obtaining samples? In this paper, we treat the sampling approach to Shapley value approximation as a stratified sampling problem. Our main technical contributions are a novel stratification design and two sample allocation methods based on Neyman allocation and empirical Bernstein bound, respectively. Experimental results on several real data sets and synthetic data sets demonstrate the effectiveness and efficiency of our novel stratification design and sampling approaches.

Supplemental Material

MP4 File
Presentation video for SIGMOD 2023

References

[1]
Anish Agarwal, Munther A. Dahleh, and Tuhin Sarkar. 2019. A Marketplace for Data: An Algorithmic Solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24--28, 2019, Anna Karlin, Nicole Immorlica, and Ramesh Johari (Eds.). ACM, 701--726. https://doi.org/10.1145/3328526.3329589
[2]
Jean-Yves Audibert, Ré mi Munos, and Csaba Szepesvá ri. 2007. Tuning Bandit Algorithms in Stochastic Environments. In Algorithmic Learning Theory, 18th International Conference, ALT 2007, Sendai, Japan, October 1--4, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4754), Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto (Eds.). Springer, 150--165. https://doi.org/10.1007/978--3--540--75225--7_15
[3]
Rémi Bardenet and Odalric-Ambrym Maillard. 2015. Concentration inequalities for sampling without replacement. Bernoulli, Vol. 21, 3 (2015), 1361--1385.
[4]
Charles G. Bird. 1976. On cost allocation for a spanning tree: A game theoretic approach. Networks, Vol. 6 (1976), 335--350.
[5]
Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. 1973. Time Bounds for Selection. J. Comput. Syst. Sci., Vol. 7, 4 (1973), 448--461. https://doi.org/10.1016/S0022-0000(73)80033--9
[6]
Mark Alexander Burgess and Archie C. Chapman. 2021. Approximating the Shapley Value Using Stratified Empirical Bernstein Sampling. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19--27 August 2021, Zhi-Hua Zhou (Ed.). ijcai.org, 73--81. https://doi.org/10.24963/ijcai.2021/11
[7]
Javier Castro, Daniel Gó mez, Elisenda Molina, and Juan Tejada. 2017. Improving polynomial estimation of the Shapley value by stratified random sampling with optimum allocation. Comput. Oper. Res., Vol. 82 (2017), 180--188. https://doi.org/10.1016/j.cor.2017.01.019
[8]
Javier Castro, Daniel Gó mez, and Juan Tejada. 2009. Polynomial calculation of the Shapley value based on sampling. Computers & OR, Vol. 36, 5 (2009), 1726--1730.
[9]
Lingjiao Chen, Paraschos Koutris, and Arun Kumar. 2019a. Towards Model-based Pricing for Machine Learning in a Data Marketplace. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1535--1552. https://doi.org/10.1145/3299869.3300078
[10]
Lingjiao Chen, Hongyi Wang, Leshang Chen, Paraschos Koutris, and Arun Kumar. 2019b. Demonstration of Nimbus: Model-based Pricing for Machine Learning in a Data Marketplace. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1885--1888. https://doi.org/10.1145/3299869.3320231
[11]
Xiaotie Deng and Christos H. Papadimitriou. 1994. On the Complexity of Cooperative Solution Concepts. Math. Oper. Res., Vol. 19, 2 (1994), 257--266. https://doi.org/10.1287/moor.19.2.257
[12]
Daniel Deutch, Nave Frost, Benny Kimelfeld, and Mikaë l Monet. 2022. Computing the Shapley Value of Facts in Query Answering. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1570--1583. https://doi.org/10.1145/3514221.3517912
[13]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[14]
Eitan Farchi, Ramasuri Narayanam, and Lokesh Nagalapatti. 2021. Ranking Data Slices for ML Model Validation: A Shapley Value Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021. IEEE, 1937--1942. https://doi.org/10.1109/ICDE51399.2021.00180
[15]
Kelwin Fernandes, Pedro Vinagre, and Paulo Cortez. 2015. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. In Progress in Artificial Intelligence - 17th Portuguese Conference on Artificial Intelligence, EPIA 2015, Coimbra, Portugal, September 8--11, 2015. Proceedings (Lecture Notes in Computer Science, Vol. 9273), Francisco C. Pereira, Penousal Machado, Ernesto Costa, and Am'i lcar Cardoso (Eds.). Springer, 535--546. https://doi.org/10.1007/978--3--319--23485--4_53
[16]
Raul Castro Fernandez. 2022. Protecting Data Markets from Strategic Buyers. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1755--1769. https://doi.org/10.1145/3514221.3517855
[17]
Amirata Ghorbani, Michael P. Kim, and James Zou. 2020. A Distributional Framework For Data Valuation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3535--3544.
[18]
Amirata Ghorbani, James Zou, and Andre Esteva. 2022. Data Shapley Valuation for Efficient Batch Active Learning. In 56th Asilomar Conference on Signals, Systems, and Computers, ACSSC 2022, Pacific Grove, CA, USA, October 31 - Nov. 2, 2022. IEEE, 1456--1462. https://doi.org/10.1109/IEEECONF56349.2022.10064696
[19]
Amirata Ghorbani and James Y. Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2242--2251. http://proceedings.mlr.press/v97/ghorbani19c.html
[20]
Amirata Ghorbani and James Y. Zou. 2020. Neuron Shapley: Discovering the Responsible Neurons. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/41c542dfe6e4fc3deb251d64cf6ed2e4-Abstract.html
[21]
Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding. Springer, 409--426.
[22]
Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. 2022. FastSHAP: Real-Time Shapley Value Estimation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=Zq2G_VTV53T
[23]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gü rel, Bo Li, Ce Zhang, Costas J. Spanos, and Dawn Song. 2019a. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow., Vol. 12, 11 (2019), 1610--1623.
[24]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gü rel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. 2019b. Towards Efficient Data Valuation Based on the Shapley Value. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16--18 April 2019, Naha, Okinawa, Japan (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, 1167--1176.
[25]
Ron Kohavi. 1996. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, Evangelos Simoudis, Jiawei Han, and Usama M. Fayyad (Eds.). AAAI Press, 202--207. http://www.aaai.org/Library/KDD/1996/kdd96-033.php
[26]
Yongchan Kwon, Manuel A. Rivas, and James Zou. 2021. Efficient Computation and Analysis of Distributional Shapley Values. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13--15, 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 793--801.
[27]
Deron Liang, Chia-Chi Lu, Chih-Fong Tsai, and Guan-An Shih. 2016. Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study. Eur. J. Oper. Res., Vol. 252, 2 (2016), 561--572. https://doi.org/10.1016/j.ejor.2016.01.012
[28]
Qiongqiong Lin, Jiayao Zhang, Jinfei Liu, Kui Ren, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Demonstration of Dealer: An End-to-End Model Marketplace with Differential Privacy. Proc. VLDB Endow., Vol. 14, 12 (2021), 2747--2750.
[29]
Roy Lindelauf, Herbert Hamers, and Bart Husslage. 2013. Cooperative game theoretic centrality analysis of terrorist networks: The cases of Jemaah Islamiyah and Al Qaeda. Eur. J. Oper. Res., Vol. 229, 1 (2013), 230--238. https://doi.org/10.1016/j.ejor.2013.02.032
[30]
S.C. Littlechild and G.F. Thompson. 1977. Aircraft Landing Fees: A Game Theory Approach. Bell Journal of Economics, Vol. 8, 1 (Spring 1977), 186--204. https://ideas.repec.org/a/rje/bellje/v8y1977ispringp186--204.html
[31]
Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Dealer: An End-to-End Model Marketplace with Differential Privacy. Proc. VLDB Endow., Vol. 14, 6 (2021), 957--969.
[32]
Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 4765--4774. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
[33]
Sasan Maleki. 2015. Addressing the computational issues of the Shapley value with applications in the smart grid. Ph.,D. Dissertation. University of Southampton, UK. http://eprints.soton.ac.uk/383963/
[34]
Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. 2013. Bounding the Estimation Error of Sampling-based Shapley Value Approximation With/Without Stratifying. CoRR, Vol. abs/1306.4265 (2013). arxiv: 1306.4265 http://arxiv.org/abs/1306.4265
[35]
Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample-Variance Penalization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18--21, 2009. http://www.cs.mcgill.ca/%7Ecolt2009/papers/012.pdf#page=1
[36]
Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. Sampling Permutations for Shapley Value Estimation. J. Mach. Learn. Res., Vol. 23 (2022), 43:1--43:46. http://jmlr.org/papers/v23/21-0439.html
[37]
Volodymyr Mnih, Csaba Szepesvá ri, and Jean-Yves Audibert. 2008. Empirical Bernstein stopping. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5--9, 2008 (ACM International Conference Proceeding Series, Vol. 307), William W. Cohen, Andrew McCallum, and Sam T. Roweis (Eds.). ACM, 672--679. https://doi.org/10.1145/1390156.1390241
[38]
Jerzy Neyman. 1992. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in statistics. Springer, 123--150.
[39]
G. Owen. 2013. Game Theory. Emerald Group Publishing Limited. https://books.google.com.ph/books?id=OfnLkgEACAAJ
[40]
Lloyd S Shapley. 1953. A value for n-person games. Contributions to the Theory of Games, Vol. 2, 28 (1953), 307--317.
[41]
Tianshu Song, Yongxin Tong, and Shuyue Wei. 2019. Profit Allocation for Federated Learning. In 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9--12, 2019, Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Ye (Eds.). IEEE, 2577--2586. https://doi.org/10.1109/BigData47090.2019.9006327
[42]
Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. 2020. A Principled Approach to Data Valuation for Federated Learning. In Federated Learning - Privacy and Incentive, Qiang Yang, Lixin Fan, and Han Yu (Eds.). Lecture Notes in Computer Science, Vol. 12500. Springer, 153--167. https://doi.org/10.1007/978--3-030--63076--8_11
[43]
Jiayao Zhang, Haocheng Xia, Qiheng Sun, Jinfei Liu, Li Xiong, Jian Pei, and Kui Ren. 2023. Dynamic Shapley Value Computation. In 37th IEEE International Conference on Data Engineering, ICDE 2023, California, USA, April 3--7, 2023. IEEE.

Cited By

View all
  • (2024)P-Shapley: Shapley Values on Probabilistic ClassifiersProceedings of the VLDB Endowment10.14778/3654621.365463817:7(1737-1750)Online publication date: 30-May-2024
  • (2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
  • (2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
  • Show More Cited By

Index Terms

  1. Efficient Sampling Approaches to Shapley Value Approximation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 1, Issue 1
    PACMMOD
    May 2023
    2807 pages
    EISSN:2836-6573
    DOI:10.1145/3603164
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2023
    Published in PACMMOD Volume 1, Issue 1

    Permissions

    Request permissions for this article.

    Author Tags

    1. Shapley value
    2. data market
    3. sampling

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)557
    • Downloads (Last 6 weeks)53
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)P-Shapley: Shapley Values on Probabilistic ClassifiersProceedings of the VLDB Endowment10.14778/3654621.365463817:7(1737-1750)Online publication date: 30-May-2024
    • (2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
    • (2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
    • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
    • (2024)CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement LearningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674809(269-278)Online publication date: 24-Jul-2024
    • (2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
    • (2024)Applications and Computation of the Shapley Value in Databases and Machine LearningCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654680(630-635)Online publication date: 9-Jun-2024
    • (2024)When Data Pricing Meets Non-Cooperative Game Theory2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00443(5548-5559)Online publication date: 13-May-2024
    • (2024)Parallelization of butterfly counting on hierarchical memoryThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00856-x33:5(1453-1484)Online publication date: 7-Jun-2024
    • (2024)Lero: applying learning-to-rank in query optimizerThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00850-333:5(1307-1331)Online publication date: 25-Apr-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media