research-article

Efficient Sampling Approaches to Shapley Value Approximation

Authors:

Kui RenAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 48, Pages 1 - 24

https://doi.org/10.1145/3588728

Published: 30 May 2023 Publication History

Abstract

Shapley value provides a unique way to fairly assess each player's contribution in a coalition and has enjoyed many applications. However, the exact computation of Shapley value is #P-hard due to the combinatoric nature of Shapley value. Many existing applications of Shapley value are based on Monte-Carlo approximation, which requires a large number of samples and the assessment of utility on many coalitions to reach high quality approximation, and thus is still far from being efficient. Can we achieve an efficient approximation of Shapley value by smartly obtaining samples? In this paper, we treat the sampling approach to Shapley value approximation as a stratified sampling problem. Our main technical contributions are a novel stratification design and two sample allocation methods based on Neyman allocation and empirical Bernstein bound, respectively. Experimental results on several real data sets and synthetic data sets demonstrate the effectiveness and efficiency of our novel stratification design and sampling approaches.

Supplemental Material

MP4 File

Presentation video for SIGMOD 2023

Download
84.72 MB

References

[1]

Anish Agarwal, Munther A. Dahleh, and Tuhin Sarkar. 2019. A Marketplace for Data: An Algorithmic Solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24--28, 2019, Anna Karlin, Nicole Immorlica, and Ramesh Johari (Eds.). ACM, 701--726. https://doi.org/10.1145/3328526.3329589

Digital Library

[2]

Jean-Yves Audibert, Ré mi Munos, and Csaba Szepesvá ri. 2007. Tuning Bandit Algorithms in Stochastic Environments. In Algorithmic Learning Theory, 18th International Conference, ALT 2007, Sendai, Japan, October 1--4, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4754), Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto (Eds.). Springer, 150--165. https://doi.org/10.1007/978--3--540--75225--7_15

[3]

Rémi Bardenet and Odalric-Ambrym Maillard. 2015. Concentration inequalities for sampling without replacement. Bernoulli, Vol. 21, 3 (2015), 1361--1385.

[4]

Charles G. Bird. 1976. On cost allocation for a spanning tree: A game theoretic approach. Networks, Vol. 6 (1976), 335--350.

[5]

Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. 1973. Time Bounds for Selection. J. Comput. Syst. Sci., Vol. 7, 4 (1973), 448--461. https://doi.org/10.1016/S0022-0000(73)80033--9

Digital Library

[6]

Mark Alexander Burgess and Archie C. Chapman. 2021. Approximating the Shapley Value Using Stratified Empirical Bernstein Sampling. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19--27 August 2021, Zhi-Hua Zhou (Ed.). ijcai.org, 73--81. https://doi.org/10.24963/ijcai.2021/11

[7]

Javier Castro, Daniel Gó mez, Elisenda Molina, and Juan Tejada. 2017. Improving polynomial estimation of the Shapley value by stratified random sampling with optimum allocation. Comput. Oper. Res., Vol. 82 (2017), 180--188. https://doi.org/10.1016/j.cor.2017.01.019

[8]

Javier Castro, Daniel Gó mez, and Juan Tejada. 2009. Polynomial calculation of the Shapley value based on sampling. Computers & OR, Vol. 36, 5 (2009), 1726--1730.

Digital Library

[9]

Lingjiao Chen, Paraschos Koutris, and Arun Kumar. 2019a. Towards Model-based Pricing for Machine Learning in a Data Marketplace. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1535--1552. https://doi.org/10.1145/3299869.3300078

Digital Library

[10]

Lingjiao Chen, Hongyi Wang, Leshang Chen, Paraschos Koutris, and Arun Kumar. 2019b. Demonstration of Nimbus: Model-based Pricing for Machine Learning in a Data Marketplace. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1885--1888. https://doi.org/10.1145/3299869.3320231

Digital Library

[11]

Xiaotie Deng and Christos H. Papadimitriou. 1994. On the Complexity of Cooperative Solution Concepts. Math. Oper. Res., Vol. 19, 2 (1994), 257--266. https://doi.org/10.1287/moor.19.2.257

Digital Library

[12]

Daniel Deutch, Nave Frost, Benny Kimelfeld, and Mikaë l Monet. 2022. Computing the Shapley Value of Facts in Query Answering. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1570--1583. https://doi.org/10.1145/3514221.3517912

Digital Library

[13]

Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

[14]

Eitan Farchi, Ramasuri Narayanam, and Lokesh Nagalapatti. 2021. Ranking Data Slices for ML Model Validation: A Shapley Value Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021. IEEE, 1937--1942. https://doi.org/10.1109/ICDE51399.2021.00180

[15]

Kelwin Fernandes, Pedro Vinagre, and Paulo Cortez. 2015. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. In Progress in Artificial Intelligence - 17th Portuguese Conference on Artificial Intelligence, EPIA 2015, Coimbra, Portugal, September 8--11, 2015. Proceedings (Lecture Notes in Computer Science, Vol. 9273), Francisco C. Pereira, Penousal Machado, Ernesto Costa, and Am'i lcar Cardoso (Eds.). Springer, 535--546. https://doi.org/10.1007/978--3--319--23485--4_53

[16]

Raul Castro Fernandez. 2022. Protecting Data Markets from Strategic Buyers. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1755--1769. https://doi.org/10.1145/3514221.3517855

Digital Library

[17]

Amirata Ghorbani, Michael P. Kim, and James Zou. 2020. A Distributional Framework For Data Valuation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3535--3544.

[18]

Amirata Ghorbani, James Zou, and Andre Esteva. 2022. Data Shapley Valuation for Efficient Batch Active Learning. In 56th Asilomar Conference on Signals, Systems, and Computers, ACSSC 2022, Pacific Grove, CA, USA, October 31 - Nov. 2, 2022. IEEE, 1456--1462. https://doi.org/10.1109/IEEECONF56349.2022.10064696

[19]

Amirata Ghorbani and James Y. Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2242--2251. http://proceedings.mlr.press/v97/ghorbani19c.html

[20]

Amirata Ghorbani and James Y. Zou. 2020. Neuron Shapley: Discovering the Responsible Neurons. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/41c542dfe6e4fc3deb251d64cf6ed2e4-Abstract.html

[21]

Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding. Springer, 409--426.

[22]

Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. 2022. FastSHAP: Real-Time Shapley Value Estimation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=Zq2G_VTV53T

[23]

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gü rel, Bo Li, Ce Zhang, Costas J. Spanos, and Dawn Song. 2019a. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow., Vol. 12, 11 (2019), 1610--1623.

Digital Library

[24]

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gü rel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. 2019b. Towards Efficient Data Valuation Based on the Shapley Value. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16--18 April 2019, Naha, Okinawa, Japan (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, 1167--1176.

[25]

Ron Kohavi. 1996. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, Evangelos Simoudis, Jiawei Han, and Usama M. Fayyad (Eds.). AAAI Press, 202--207. http://www.aaai.org/Library/KDD/1996/kdd96-033.php

Digital Library

[26]

Yongchan Kwon, Manuel A. Rivas, and James Zou. 2021. Efficient Computation and Analysis of Distributional Shapley Values. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13--15, 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 793--801.

[27]

Deron Liang, Chia-Chi Lu, Chih-Fong Tsai, and Guan-An Shih. 2016. Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study. Eur. J. Oper. Res., Vol. 252, 2 (2016), 561--572. https://doi.org/10.1016/j.ejor.2016.01.012

[28]

Qiongqiong Lin, Jiayao Zhang, Jinfei Liu, Kui Ren, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Demonstration of Dealer: An End-to-End Model Marketplace with Differential Privacy. Proc. VLDB Endow., Vol. 14, 12 (2021), 2747--2750.

Digital Library

[29]

Roy Lindelauf, Herbert Hamers, and Bart Husslage. 2013. Cooperative game theoretic centrality analysis of terrorist networks: The cases of Jemaah Islamiyah and Al Qaeda. Eur. J. Oper. Res., Vol. 229, 1 (2013), 230--238. https://doi.org/10.1016/j.ejor.2013.02.032

[30]

S.C. Littlechild and G.F. Thompson. 1977. Aircraft Landing Fees: A Game Theory Approach. Bell Journal of Economics, Vol. 8, 1 (Spring 1977), 186--204. https://ideas.repec.org/a/rje/bellje/v8y1977ispringp186--204.html

[31]

Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Dealer: An End-to-End Model Marketplace with Differential Privacy. Proc. VLDB Endow., Vol. 14, 6 (2021), 957--969.

Digital Library

[32]

Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 4765--4774. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html

[33]

Sasan Maleki. 2015. Addressing the computational issues of the Shapley value with applications in the smart grid. Ph.,D. Dissertation. University of Southampton, UK. http://eprints.soton.ac.uk/383963/

[34]

Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. 2013. Bounding the Estimation Error of Sampling-based Shapley Value Approximation With/Without Stratifying. CoRR, Vol. abs/1306.4265 (2013). arxiv: 1306.4265 http://arxiv.org/abs/1306.4265

[35]

Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample-Variance Penalization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18--21, 2009. http://www.cs.mcgill.ca/%7Ecolt2009/papers/012.pdf#page=1

[36]

Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. Sampling Permutations for Shapley Value Estimation. J. Mach. Learn. Res., Vol. 23 (2022), 43:1--43:46. http://jmlr.org/papers/v23/21-0439.html

[37]

Volodymyr Mnih, Csaba Szepesvá ri, and Jean-Yves Audibert. 2008. Empirical Bernstein stopping. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5--9, 2008 (ACM International Conference Proceeding Series, Vol. 307), William W. Cohen, Andrew McCallum, and Sam T. Roweis (Eds.). ACM, 672--679. https://doi.org/10.1145/1390156.1390241

Digital Library

[38]

Jerzy Neyman. 1992. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in statistics. Springer, 123--150.

[39]

G. Owen. 2013. Game Theory. Emerald Group Publishing Limited. https://books.google.com.ph/books?id=OfnLkgEACAAJ

[40]

Lloyd S Shapley. 1953. A value for n-person games. Contributions to the Theory of Games, Vol. 2, 28 (1953), 307--317.

[41]

Tianshu Song, Yongxin Tong, and Shuyue Wei. 2019. Profit Allocation for Federated Learning. In 2019 IEEE International Conference on Big Data (IEEE BigData), Los Angeles, CA, USA, December 9--12, 2019, Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Hu, Ronay Ak, Yuanyuan Tian, Roger S. Barga, Carlo Zaniolo, Kisung Lee, and Yanfang Fanny Ye (Eds.). IEEE, 2577--2586. https://doi.org/10.1109/BigData47090.2019.9006327

[42]

Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. 2020. A Principled Approach to Data Valuation for Federated Learning. In Federated Learning - Privacy and Incentive, Qiang Yang, Lixin Fan, and Han Yu (Eds.). Lecture Notes in Computer Science, Vol. 12500. Springer, 153--167. https://doi.org/10.1007/978--3-030--63076--8_11

[43]

Jiayao Zhang, Haocheng Xia, Qiheng Sun, Jinfei Liu, Li Xiong, Jian Pei, and Kui Ren. 2023. Dynamic Shapley Value Computation. In 37th IEEE International Conference on Data Engineering, ICDE 2023, California, USA, April 3--7, 2023. IEEE.

Cited By

Xia HLi XPang JLiu JRen KXiong L(2024)P-Shapley: Shapley Values on Probabilistic ClassifiersProceedings of the VLDB Endowment10.14778/3654621.365463817:7(1737-1750)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654638
Wang THuang SBao ZCulpepper JDedeoglu VArablouei R(2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648172
Gao SLi YShen YShao YChen L(2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641215
Show More Cited By

Index Terms

Efficient Sampling Approaches to Shapley Value Approximation
1. Information systems
  1. Data management systems
    1. Information integration
      1. Mediators and data integration

Recommendations

Coalition-weighted Shapley values
Abstract
We introduce a new class of values for coalitional games: the coalition-weighted Shapley values. Weights can be assigned to coalitions, not just to players, and zero-weights are admissible. The Shapley value belongs to this class. Coalition-...
The Shapley value, the Proper Shapley value, and sharing rules for cooperative ventures
Abstract
In this note, we discuss two solutions for cooperative transferable utility games, namely the Shapley value and the Proper Shapley value. We characterize positive Proper Shapley values by affine invariance and by an axiom that requires ...
The proportional coalitional Shapley value

Research highlights A new coalitional value for monotonic games with a priori unions is defined. Axiomatic characterizations of this new value are provided. Applications to evaluating the coalition formation in bankruptcy and voting problems are ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
721
Total Downloads

Downloads (Last 12 months)557
Downloads (Last 6 weeks)53

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xia HLi XPang JLiu JRen KXiong L(2024)P-Shapley: Shapley Values on Probabilistic ClassifiersProceedings of the VLDB Endowment10.14778/3654621.365463817:7(1737-1750)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.14778/3654621.3654638
Wang THuang SBao ZCulpepper JDedeoglu VArablouei R(2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648172
Gao SLi YShen YShao YChen L(2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641215
Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Mai GHe ZYu GChen ZChen P(2024)CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement LearningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674809(269-278)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674809
Gao SLi YZhang XShen YShao YChen L(2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654977
Luo XPei JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Applications and Computation of the Shapley Value in Databases and Machine LearningCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654680(630-635)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654680
Bi YWu YLiu JRen KXiong L(2024)When Data Pricing Meets Non-Cooperative Game Theory2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00443(5548-5559)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00443
Wang ZLai LLiu YShui BTian CZhong S(2024)Parallelization of butterfly counting on hierarchical memoryThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00856-x33:5(1453-1484)Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1007/s00778-024-00856-x
Chen XZhu RDing BWang SZhou J(2024)Lero: applying learning-to-rank in query optimizerThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00850-333:5(1307-1331)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1007/s00778-024-00850-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents