Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

Published: 20 June 2023 Publication History

Abstract

Organizations that would mutually benefit from pooling their data are otherwise wary of sharing. This is because sharing data is costly-in time and effort-and, at the same time, the benefits of sharing are not clear. Without a clear cost-benefit analysis, participants default in not sharing. As a consequence, many opportunities to create valuable data-sharing consortia never materialize, and the value of data remains locked.
We introduce a new sharing model, market protocol, and algorithms to incentivize the creation of data-sharing markets. The combined contributions of this paper, which we call DSC, incentivize the creation of data-sharing markets that unleash the value of data for its participants. The sharing model introduces two incentives; one that guarantees that participating is better than not doing so and another that compensates participants according to how valuable their data is. Because operating the consortia is costly, we are also concerned with ensuring its operation is sustainable: we design a protocol that ensures that a valuable data-sharing consortium forms when it is sustainable.
We introduce algorithms to elicit the value of data from the participants, which is used first to cover the costs of operating the consortia and second to compensate for data contributions. For the latter, we challenge using the Shapley value to allocate revenue. We offer analytical and empirical evidence for this and introduce an alternative method that compensates participants better and leads to the formation of data-sharing consortia.

References

[1]
Daniel Abadi, Owen Arden, Faisal Nawab, and Moshe Shadmon. 2020. Anylog: a grand unification of the internet of things. In Conference on Innovative Data Systems Research (CIDR ?20).
[2]
Jacob D Abernethy, Rachel Cummings, Bhuvesh Kumar, Sam Taggart, and Jamie H Morgenstern. 2019. Learning auctions with robust incentive guarantees. Advances in Neural Information Processing Systems 32 (2019).
[3]
Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. 2019. Too much data: Prices and inefficiencies in data markets. Technical Report. National Bureau of Economic Research.
[4]
Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation. 701--726.
[5]
Claudia Allen, Terrisca R Des Jardins, Arvela Heider, Kristin A Lyman, Lee McWilliams, Alison L Rein, Abigail A Schachter, Ranjit Singh, Barbara Sorondo, Joan Topper, et al . 2014. Data governance and data sharing agreements for community-wide health information exchange: lessons from the beacon communities. EGEMS 2, 1 (2014).
[6]
Nuno Antonio, Ana de Almeida, and Luis Nunes. 2019. Hotel booking demand datasets. Data in brief 22 (2019), 41--49.
[7]
Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR.
[8]
Kenneth Arrow. 1962. Economic welfare and the allocation of resources for invention. In The rate and direction of inventive activity: Economic and social factors. Princeton University Press, 609--626.
[9]
Lawrence M Ausubel and Peter Cramton. 2002. Demand reduction and inefficiency in multi-unit auctions. (2002).
[10]
Amazon AWS. 2022. Amazon AWS Instance Types. https://aws.amazon.com/ec2/instance-types/
[11]
Shaimaa Bajoudah, Dong Changyu, and Paolo Missier. 2019. Toward a Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blockchain. In Procs. 2nd IEEE International Conference on Blockchain (Blockchain 2019). IEEE, Atlanta, USA.
[12]
Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel Kho, and Jennie Rogers. 2016. SMCQL: Secure querying for federated databases. arXiv preprint arXiv:1606.06808 (2016).
[13]
Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691--2705.
[14]
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, and Aditya G Parameswaran. [n. d.]. Datahub: Collaborative data science & dataset version management at scale. ([n. d.]).
[15]
Christine L Borgman. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63, 6 (2012), 1059--1078.
[16]
Steven J Brams, Steven John Brams, and Alan D Taylor. 1996. Fair Division: From cake-cutting to dispute resolution. Cambridge University Press.
[17]
Anna L Buczak and Erhan Guven. 2015. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications surveys & tutorials 18, 2 (2015), 1153--1176.
[18]
Raul Castro Fernandez. 2022. Protecting Data Markets from Strategic Participants. (2022).
[19]
Victor Chernozhukov, Hiroyuki Kasahara, and Paul Schrimpf. 2021. Causal impact of masks, policies, behavior on early covid-19 pandemic in the US. Journal of econometrics 220, 1 (2021), 23--62.
[20]
Rada Chirkova, Jun Yang, et al . 2012. Materialized views. Foundations and Trends® in Databases 4, 4 (2012), 295--405.
[21]
Feature Cloud. 2022. Transforming medical research with federated learning. https://featurecloud.eu/about/our-vision/
[22]
Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.
[23]
Ronald Cramer, Ivan Bjerre Damgård, et al. 2015. Secure multiparty computation. Cambridge University Press.
[24]
RALPH D'AGOSTINO and Egon S Pearson. 1973. Tests for departure from normality. Empirical results for the distributions of b 2 and sqrt(b). Biometrika 60, 3 (1973), 613--622.
[25]
datacoop 2021. Mozilla Research. Shifting power through data governance. https://foundation.mozilla.org/en/data-futures-lab/data-for-empowerment/shifting-power-through-data-governance/.
[26]
datadividend 2021. Data Dividend, My data, my money. https://www.datadividendproject.com/.
[27]
Sylvie Delacroix and Neil D Lawrence. 2019. Bottom-up data Trusts: disturbing the ?one size fits all'approach to data governance. International data privacy law 9, 4 (2019), 236--252.
[28]
Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).
[29]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[30]
Muhammad El-Hindi, Carsten Binnig, Arvind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019. BlockchainDB: A shared database on blockchains. Proceedings of the VLDB Endowment 12, 11 (2019), 1597--1609.
[31]
André Elisseeff, Massimiliano Pontil, et al . 2003. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences 190 (2003), 111--130.
[32]
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.
[33]
Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. 2020. Data market platforms: Trading data assets to solve data problems. arXiv preprint arXiv:2002.01047 (2020).
[34]
Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning. PMLR, 2242--2251.
[35]
Andrew V Goldberg and Jason D Hartline. 2001. Competitive auctions for multiple digital goods. In European Symposium on Algorithms. Springer, 416--427.
[36]
Google. 2022. What-If Tool - People AI Research (PAIR). https://pair-code.github.io/what-if-tool/
[37]
Robert L Grossman, Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. A case for data commons: toward data science as a service. Computing in science & engineering 18, 5 (2016), 10--20.
[38]
Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.
[39]
Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence. 43--58.
[40]
Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The orchestra collaborative data sharing system. ACM Sigmod Record 37, 3 (2008), 26--32.
[41]
Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits, adoption barriers and myths of open data and open government. Information systems management 29, 4 (2012), 258--268.
[42]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment 12, 11 (2019), 1610--1623.
[43]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. 2019. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1167--1176.
[44]
Charles I Jones and Christopher Tonetti. 2020. Nonrivalry and the Economics of Data. American Economic Review 110, 9 (2020), 2819--58.
[45]
Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 524--532.
[46]
Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.
[47]
Rob Kitchin. 2014. The data revolution: Big data, open data, data infrastructures and their consequences. Sage.
[48]
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020), 50--60.
[49]
Yifan Li, Xiaohui Yu, and Nick Koudas. 2021. Data Acquisition for Improving Machine Learning Models. VLDB 14, 10 (jun 2021), 1832--1844.
[50]
Qiongqiong Lin, Jiayao Zhang, Jinfei Liu, Kui Ren, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Demonstration of dealer: an end-to-end model marketplace with differential privacy. Proceedings of the VLDB Endowment 14, 12 (2021), 2747--2750.
[51]
Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Dealer: an end-to-end model marketplace with differential privacy. VLDB (2021).
[52]
Yu-Chen Lo, Stefano E Rensi, Wen Torng, and Russ B Altman. 2018. Machine learning in chemoinformatics and drug discovery. Drug discovery today 23, 8 (2018), 1538--1546.
[53]
RE Machol and J Rosenblatt. 1966. Confidence interval based on single observation. Proc. IEEE 54, 8 (1966), 1087--1088.
[54]
Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research 6, 1 (1981), 58--73.
[55]
Roger B Myerson and Mark A Satterthwaite. 1983. Efficient mechanisms for bilateral trading. Journal of economic theory 29, 2 (1983), 265--281.
[56]
Michael Naehrig, Kristin Lauter, and Vinod Vaikuntanathan. 2011. Can homomorphic encryption be practical?. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop. 113--124.
[57]
Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986--1989.
[58]
NIH. 2023. Final NIH Policy for Data Management and Sharing. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
[59]
Elinor Ostrom. 2008. Tragedy of the commons. The new palgrave dictionary of economics 2 (2008).
[60]
Ippokratis Pandis. 2021. The evolution of Amazon redshift. Proceedings of the VLDB Endowment 14, 12 (2021), 3162--3174.
[61]
Eric A Posner and E Glen Weyl. 2019. Radical Markets. Princeton University Press.
[62]
Swiss Re. 2022. Swiss Re to explore AI in reinsurance. https://www.lifeinsuranceinternational.com/news/swiss-re-webank/
[63]
Alvin E Roth. 1988. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press.
[64]
Yexuan Shi, Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Bolin Ding, and Lei Chen. 2021. Efficient Approximate Range Aggregation over Large-scale Spatial Data Federation. IEEE Transactions on Knowledge and Data Engineering (2021).
[65]
Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. 2017. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 587--601.
[66]
Vasilis Syrgkanis and Eva Tardos. 2013. Composable and efficient mechanisms. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. 211--220.
[67]
Ming Tang and Vincent WS Wong. 2021. An incentive mechanism for cross-silo federated learning: A public goods perspective. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1--10.
[68]
Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, et al. 2022. Hu-Fu: efficient and secure spatial queries over data federation. VLDB (2022).
[69]
USGS. 2022. USGS Data-Sharing Agreement. https://www.usgs.gov/data-management/data-sharing-agreements
[70]
Melanie M Wall, James Boen, and Richard Tweedie. 2001. An effective confidence interval for the mean with samples of size one and two. The American Statistician 55, 2 (2001), 102--105.
[71]
Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. 2020. A principled approach to data valuation for federated learning. In Federated Learning. Springer, 153--167.
[72]
Siyuan Xia, Zhiru Zhu, Chris Zhu, Jinjin Zhao, Kyle Chard, Aaron J Elmore, Ian Foster, Michael Franklin, Sanjay Krishnan, and Raul Castro Fernandez. 2022. Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow. Proceedings of the VLDB Endowment 15, 11 (2022), 3172--3185.
[73]
Liqi Xu, Silu Huang, SiLi Hui, Aaron J Elmore, and Aditya Parameswaran. 2017. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1655--1658.
[74]
Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, and Xiaowen Chu. 2021. A comprehensive survey of incentive mechanism for federated learning. arXiv preprint arXiv:2106.15406 (2021).

Cited By

View all
  • (2024)The future of cancer therapy: exploring the potential of patient-derived organoids in drug developmentFrontiers in Cell and Developmental Biology10.3389/fcell.2024.140150412Online publication date: 20-May-2024
  • (2024)Evaluating the Privacy Valuation of Personal Data on SmartphonesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785098:3(1-33)Online publication date: 9-Sep-2024
  • (2024)A Profit-Maximizing Data Marketplace with Differentially Private Federated Learning under Price CompetitionProceedings of the ACM on Management of Data10.1145/36771272:4(1-27)Online publication date: 30-Sep-2024
  • Show More Cited By

Index Terms

  1. Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 2
      PACMMOD
      June 2023
      2310 pages
      EISSN:2836-6573
      DOI:10.1145/3605748
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2023
      Published in PACMMOD Volume 1, Issue 2

      Permissions

      Request permissions for this article.

      Author Tags

      1. data markets
      2. data sharing
      3. incentives
      4. machine learning sharing

      Qualifiers

      • Research-article

      Funding Sources

      • NSF

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)235
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 03 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)The future of cancer therapy: exploring the potential of patient-derived organoids in drug developmentFrontiers in Cell and Developmental Biology10.3389/fcell.2024.140150412Online publication date: 20-May-2024
      • (2024)Evaluating the Privacy Valuation of Personal Data on SmartphonesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785098:3(1-33)Online publication date: 9-Sep-2024
      • (2024)A Profit-Maximizing Data Marketplace with Differentially Private Federated Learning under Price CompetitionProceedings of the ACM on Management of Data10.1145/36771272:4(1-27)Online publication date: 30-Sep-2024
      • (2024)Data Acquisition for Improving Model ConfidenceProceedings of the ACM on Management of Data10.1145/36549342:3(1-25)Online publication date: 30-May-2024
      • (2024)Responsible Sharing of Spatiotemporal DataCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654688(580-584)Online publication date: 9-Jun-2024
      • (2024)A Canary in the AI Coal Mine: American Jews May Be Disproportionately Harmed by Intellectual Property Dispossession in Large Language Model TrainingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642749(1-17)Online publication date: 11-May-2024
      • (2024)When Data Pricing Meets Non-Cooperative Game Theory2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00443(5548-5559)Online publication date: 13-May-2024
      • (2024)Fast, Robust and Interpretable Participant Contribution Estimation for Federated Learning2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00182(2298-2311)Online publication date: 13-May-2024
      • (2024)STAGE: Secure and Efficient Data Delivery for Exchange-Assisted Data MarketplacesICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622818(1-6)Online publication date: 9-Jun-2024
      • (2024)PPoD: Practical Proofs of Dealership for Authorized Data TradingICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622710(1340-1345)Online publication date: 9-Jun-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media