Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Linear Regression from Strategic Data Sources

Published: 13 May 2020 Publication History

Abstract

Linear regression is a fundamental building block of statistical data analysis. It amounts to estimating the parameters of a linear model that maps input features to corresponding outputs. In the classical setting where the precision of each data point is fixed, the famous Aitken/Gauss-Markov theorem in statistics states that generalized least squares (GLS) is a so-called “Best Linear Unbiased Estimator” (BLUE). In modern data science, however, one often faces strategic data sources; namely, individuals who incur a cost for providing high-precision data. For instance, this is the case for personal data, whose revelation may affect an individual’s privacy—which can be modeled as a cost—or in applications such as recommender systems, where producing an accurate estimate entails effort.
In this article, we study a setting in which features are public but individuals choose the precision of the outputs they reveal to an analyst. We assume that the analyst performs linear regression on this dataset, and individuals benefit from the outcome of this estimation. We model this scenario as a game where individuals minimize a cost composed of two components: (a) an (agent-specific) disclosure cost for providing high-precision data; and (b) a (global) estimation cost representing the inaccuracy in the linear model estimate. In this game, the linear model estimate is a public good that benefits all individuals. We establish that this game has a unique non-trivial Nash equilibrium. We study the efficiency of this equilibrium and we prove tight bounds on the price of stability for a large class of disclosure and estimation costs. Finally, we study the estimator accuracy achieved at equilibrium. We show that, in general, Aitken’s theorem does not hold under strategic data sources, though it does hold if individuals have identical disclosure costs (up to a multiplicative factor). When individuals have non-identical costs, we derive a bound on the improvement of the equilibrium estimation cost that can be achieved by deviating from GLS, under mild assumptions on the disclosure cost functions.

References

[1]
Jacob Abernethy, Yiling Chen, Chien-Ju Ho, and Bo Waggoner. 2015. Low-cost learning via active data procurement. In Proceedings of the 16th ACM Conference on Economics and Computation (EC’15). 619--636.
[2]
Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving data mining. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 439--450.
[3]
A. C. Aitken. 1935. On least squares and linear combinations of observations. Proc. Roy. Soc. Edinburgh 55 (1935), 42--48.
[4]
Mike Atallah, Elisa Bertino, Ahmed Elmagarmid, Mohamed Ibrahim, and Vassilios Verykios. 1999. Disclosure limitation of sensitive rules. In Proceedings of the Workshop on Knowledge and Data Engineering Exchange (KDEX’99). 45--52.
[5]
A. C. Atkinson, A. N. Donev, and R. D. Tobias. 2007. Optimum Experimental Designs, with SAS. Oxford University Press, New York.
[6]
Omer Ben-Porat and Moshe Tennenholtz. 2019. Regression equilibrium. In Proceedings of the ACM Conference on Economics and Computation (EC’19). 173--191.
[7]
S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[8]
Y. Cai, C. Daskalakis, and C. H. Papadimitriou. 2015. Optimum statistical estimation with strategic data sources. In Proceedings of the 28th Conference on Learning Theory (COLT’15). 40.1–40.40.
[9]
Ioannis Caragiannis, Ariel D. Procaccia, and Nisarg Shah. 2016. Truthful univariate estimators. In Proceedings of the 33rd International Conference on Machine Learning (ICML’16).
[10]
Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. 2018. Optimal data acquisition for statistical estimation. In Proceedings of the ACM Conference on Economics and Computation (EC’18). 27--44.
[11]
Yiling Chen, Chara Podimata, Ariel D. Procaccia, and Nisarg Shah. 2018. Strategyproof linear regression in high dimensions. In Proceedings of the ACM Conference on Economics and Computation (EC’18). 9--26.
[12]
Michela Chessa, Jens Grossklags, and Patrick Loiseau. 2015. A game-theoretic study on non-monetary incentives in data analytics projects with privacy implications. In Proceedings of the 28th IEEE Computer Security Foundations Symposium (CSF’15).
[13]
Michela Chessa and Patrick Loiseau. 2017. On non-monetary incentives for the provision of public goods. Retrieved from https://ideas.repec.org/p/gre/wpaper/2017-24.html.
[14]
Anil Kumar Chorppath and Tansu Alpcan. 2013. Trading privacy with incentives in mobile commerce: A game theoretic approach. Perv. Mob. Comput. 9, 4 (2013), 598--612.
[15]
Rachel Cummings, Stratis Ioannidis, and Katrina Ligett. 2015. Truthful linear regression. In Proceedings of the 28th Conference on Learning Theory (COLT’15), Vol. 40. 1--36.
[16]
Pranav Dandekar, Nadia Fawaz, and Stratis Ioannidis. 2014. Privacy auctions for recommender systems. ACM Trans. Econ. Comput. 2, 3 (July 2014), 12:1–12:22.
[17]
Anirban Dasgupta and Arpita Ghosh. 2013. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 319--330.
[18]
Ofer Dekel, Felix Fischer, and Ariel D. Procaccia. 2010. Incentive compatible regression learning. J. Comput. Syst. Sci. 76, 8 (2010), 759--777.
[19]
Josep Domingo-Ferrer. 2008. A survey of inference control methods for privacy-preserving data mining. In Privacy-preserving Data Mining. Springer, 53--80.
[20]
J. C. Duchi, M. I. Jordan, and M. J. Wainwright. 2013. Local privacy and statistical minimax rates. In Proceedings of the 54th IEEE Symposium on Foundations of Computer Science (FOCS’13). 429--438.
[21]
George T. Duncan and Sumitra Mukherjee. 2000. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. J. Amer. Statist. Assoc. 95, 451 (2000), 720--729.
[22]
Cynthia Dwork. 2006. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP’06). 1--12.
[23]
Rafael M. Frongillo, Yiling Chen, and Ian A. Kash. 2015. Elicitation for aggregation. In Proceedings of the 29th Conference on Artificial Intelligence (AAAI’15).
[24]
Arpita Ghosh and Aaron Roth. 2011. Selling privacy at auction. In Proceedings of the 12th ACM Conference on Electronic Commerce (EC’11). 199--208.
[25]
Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. 2016. Strategic classification. In Proceedings of the ACM Conference on Innovations in Theoretical Computer Science (ITCS’16). 111--122.
[26]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction (2 ed.). Springer.
[27]
Thibaut Horel, Stratis Ioannidis, and S. Muthukrishnan. 2014. Budget feasible mechanisms for experimental design. In Proceedings of the 11th Latin American Theoretical INformatics Symposium (LATIN’14). 719--730.
[28]
Safwan Hossain and Nisarg Shah. 2020. Pure nash equilibria in linear regression. In Proceedings of the 19th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS'20)
[29]
Stratis Ioannidis and Patrick Loiseau. 2013. Linear regression as a non-cooperative game. In Proceedings of the 9th International Conference on Web and Internet Economics (WINE’13). 277--290.
[30]
Ramesh Johari and John N. Tsitsiklis. 2004. Efficiency loss in a network resource allocation game. Math. Oper. Res. 29, 3 (2004), 407--435.
[31]
Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2016. Extremal mechanisms for local differential privacy. J. Mach. Learn. Res. 17, 17 (2016), 1--51.
[32]
Daniel Kifer, Adam Smith, and Abhradeep Thakurta. 2012. Private convex empirical risk minimization and high-dimensional regression. In Proceedings of the 25th Conference on Learning Theory (COLT’12). 25.1–25.40.
[33]
Katrina Ligett and Aaron Roth. 2012. Take it or Leave it: Running a survey when privacy comes at a cost. In Proceedings of the 8th International Conference on Web and Internet Economics (WINE’12). 378--391.
[34]
Yang Liu and Yiling Chen. 2016. A bandit framework for strategic regression. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’16). 1821--1829.
[35]
Yuan Luo, Nihar B. Shah, Jianwei Huang, and Jean Walrand. 2015. Parametric prediction from parametric agents. In Proceedings of the 10th Workshop on the Economics of Networks, Systems and Computation (NetEcon’15). 57--57.
[36]
Reshef Meir, Ariel D. Procaccia, and Jeffrey S. Rosenschein. 2012. Algorithms for strategyproof classification. Artif. Intell. 186 (2012), 123--156.
[37]
Dov Monderer and Lloyd S. Shapley. 1996. Potential games. Games Econ. Behav. 14, 1 (1996), 124--143.
[38]
John Morgan. 2000. Financing public goods by means of lotteries. Rev. Econ. Stud. 67, 4 (Oct. 2000), 761--84.
[39]
Kobbi Nissim, Rann Smorodinsky, and Moshe Tennenholtz. 2012. Approximately optimal mechanism design via differential privacy. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS’12). 203--213.
[40]
Stanley R. M. Oliveira and Osmar R. Zaiane. 2003. Privacy preserving clustering by data transformation. In Proceedings of the Brazilian Symposium on Databases (SBBD’03). 304--318.
[41]
Javier Perote and Juan Perote-Pena. 2004. Strategy-proof estimators for simple regression. Math. Soc. Sci. 47, 2 (2004), 153--176.
[42]
F. Pukelsheim. 2006. Opt. Des. Exper. Vol. 50. Society for Industrial Mathematics.
[43]
Tim Roughgarden and Éva Tardos. 2002. How bad is selfish routing? J. ACM 49, 2 (Mar. 2002), 236--259.
[44]
William H. Sandholm. 2010. Population Games and Evolutionary Dynamics. The MIT Press.
[45]
Guido Schäfer. 2011. Online Social Networks and Network Economics. Lecture notes, Sapienza University of Rome.
[46]
Joseph F. Traub, Yechiam Yemini, and H. Woźniakowski. 1984. The statistical security of a statistical database. ACM Trans. Datab. Syst. 9, 4 (1984), 672--679.
[47]
Jaideep Vaidya, Christopher W. Clifton, and Yu Michael Zhu. 2006. Privacy Preserving Data Mining. Springer.
[48]
Tyler Westenbroek, Roy Dong, Lillian J. Ratliff, and S. Shankar Sastry. 2020. Competitive statistical estimation with strategic data sources. IEEE Trans. Automat. Control 65, 4 (2020), 1537--1551.

Cited By

View all
  • (2024)Distributed Experimental Design NetworksIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621241(2368-2377)Online publication date: 20-May-2024
  • (2023)Learning Market Equilibria Using Performative Prediction: Balancing Efficiency and Privacy2023 European Control Conference (ECC)10.23919/ECC57647.2023.10178247(1-8)Online publication date: 13-Jun-2023
  • (2023)Learning with Exposure Constraints in Recommendation SystemsProceedings of the ACM Web Conference 202310.1145/3543507.3583320(3456-3466)Online publication date: 30-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Economics and Computation
ACM Transactions on Economics and Computation  Volume 8, Issue 2
May 2020
173 pages
ISSN:2167-8375
EISSN:2167-8383
DOI:10.1145/3397966
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2020
Online AM: 07 May 2020
Accepted: 01 February 2020
Revised: 01 December 2019
Received: 01 July 2019
Published in TEAC Volume 8, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Aitken theorem
  2. Gauss-Markov theorem
  3. Linear regression
  4. potential game
  5. price of stability
  6. strategic data sources

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Distributed Experimental Design NetworksIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621241(2368-2377)Online publication date: 20-May-2024
  • (2023)Learning Market Equilibria Using Performative Prediction: Balancing Efficiency and Privacy2023 European Control Conference (ECC)10.23919/ECC57647.2023.10178247(1-8)Online publication date: 13-Jun-2023
  • (2023)Learning with Exposure Constraints in Recommendation SystemsProceedings of the ACM Web Conference 202310.1145/3543507.3583320(3456-3466)Online publication date: 30-Apr-2023
  • (2023)Experimental Design Networks: A Paradigm for Serving Heterogeneous Learners Under Networking ConstraintsIEEE/ACM Transactions on Networking10.1109/TNET.2023.324353431:5(2236-2250)Online publication date: Oct-2023
  • (2022)Long-term Data Sharing under Exclusivity AttacksProceedings of the 23rd ACM Conference on Economics and Computation10.1145/3490486.3538311(739-759)Online publication date: 12-Jul-2022
  • (2022)Experimental Design Networks: A Paradigm for Serving Heterogeneous Learners under Networking ConstraintsIEEE INFOCOM 2022 - IEEE Conference on Computer Communications10.1109/INFOCOM48880.2022.9796907(210-219)Online publication date: 2-May-2022
  • (2021)Study on Wireless Signal Propagation in Residential Outdoor Activity Area Based on Deep Learning2021 International Conference on Computer, Control and Robotics (ICCCR)10.1109/ICCCR49711.2021.9349418(225-230)Online publication date: 8-Jan-2021
  • (2020)Nash Equilibrium as a Solution in Supervised ClassificationParallel Problem Solving from Nature – PPSN XVI10.1007/978-3-030-58112-1_37(539-551)Online publication date: 31-Aug-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media