research-article

Lessons from the AdKDD’21 Privacy-Preserving ML Challenge

Authors:

Eustache Diemert,

Alexandre Gilotte,

Basile Leparmentier,

Hui YangAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2026 - 2035

https://doi.org/10.1145/3485447.3512076

Published: 25 April 2022 Publication History

Abstract

Designing data sharing mechanisms providing performance and strong privacy guarantees is a hot topic for the Online Advertising industry. Namely, a prominent proposal discussed under the Improving Web Advertising Business Group at W3C only allows sharing advertising signals through aggregated, differentially private reports of past displays. To study this proposal extensively, an open Privacy-Preserving Machine Learning Challenge took place at AdKDD’21, a premier workshop on Advertising Science with data provided by advertising company Criteo. In this paper, we describe the challenge tasks, the structure of the available datasets, report the challenge results, and enable its full reproducibility. A key finding is that learning models on large, aggregated data in the presence of a small set of unaggregated data points can be surprisingly efficient and cheap. We also run additional experiments to observe the sensitivity of winning methods to different parameters such as privacy budget or quantity of available privileged side information. We conclude that the industry needs either alternate designs for private data sharing or a breakthrough in learning with aggregated data only to keep ad relevance at a reasonable level.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA, 308–318. https://doi.org/10.1145/2976749.2978318

Digital Library

[2]

Erik Anderson. 2021. Masked Learning, Aggregation and Reporting worKflow (Masked LARK). https://github.com/WICG/privacy-preserving-ads/blob/main/MaskedLARK.md. Accessed: 2021-05-01.

[3]

Avazu. 2014. Avazu CTR Prediction Contest. https://www.kaggle.com/c/avazu-ctr-prediction. Accessed: 2021-05-01.

[4]

Avito.ru. 2015. Avito Context Ad Clicks. https://www.kaggle.com/c/avito-context-ad-clicks. Accessed: 2021-05-01.

[5]

Luca Belli, Alykhan Tejani, Frank Portman, Alexandre Lung-Yut-Fong, Ben Chamberlain, Yuanpu Xie, Kristian Lum, Jonathan Hunt, Michael Bronstein, Vito Walter Anelli, Saikishore Kalloori, Bruce Ferwerda, and Wenzhe Shi. 2021. The 2021 RecSys Challenge Dataset: Fairness is not optional. arxiv:2109.08245 [cs.SI]

[6]

Avradeep Bhowmik, Minmin Chen, Zhengming Xing, and Suju Rajan. 2019. Estimagg: A learning framework for groupwise aggregated data. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 477–485.

[7]

Avradeep Bhowmik, Joydeep Ghosh, and Oluwasanmi Koyejo. 2016. Sparse parameter recovery from aggregated data. In International Conference on Machine Learning. PMLR, 1090–1099.

[8]

U.S. Census Bureau. 2021. Differential Privacy for Census Data Explained. https://www.ncsl.org/research/redistricting/differential-privacy-for-census-data-explained.aspx. Accessed: 2021-05-01.

[9]

Web Incubator CG. 2020. The Conversion Measurement API. https://github.com/WICG/conversion-measurement-api. Accessed: 2021-05-01.

[10]

Olivier Chapelle, Eren Manavoglu, and Romer Rosales. 2014. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST) 5, 4(2014), 1–34.

Digital Library

[11]

Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially Private Empirical Risk Minimization. J. Mach. Learn. Res. 12, null (July 2011), 1069–1109.

[12]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.

Digital Library

[13]

Criteo. 2014. Criteo Display Advertising Challenge. https://www.kaggle.com/c/criteo-display-ad-challenge. Accessed: 2021-05-01.

[14]

Damien Desfontaines. 2021. The magic of Gaussian noise. https://desfontain.es/privacy/gaussian-noise.html. Accessed: 2021-05-01.

[15]

Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. 2006. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer, 486–503.

Digital Library

[16]

Cynthia Dwork, Aaron Roth, 2014. The algorithmic foundations of differential privacy.Found. Trends Theor. Comput. Sci. 9, 3-4 (2014), 211–407.

[17]

Alexandre Gilotte and David Rohde. 2021. Learning a logistic model from aggregated data. (2021).

[18]

C. S. Harrisson. 2020. The Aggregate Reporting API. https://github.com/csharrison/aggregate-reporting-api. Accessed: 2021-05-01.

[19]

Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In Proceedings of the 10th ACM conference on recommender systems. 43–50.

Digital Library

[20]

Gary King, Martin A Tanner, and Ori Rosen. 2004. Ecological inference: New methodological strategies. Cambridge University Press.

[21]

Jakub Konečnỳ, Brendan McMahan, and Daniel Ramage. 2015. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575(2015).

[22]

Hairen Liao, Lingxiao Peng, Zhenchuan Liu, and Xuehua Shen. 2014. iPinYou global rtb bidding algorithm competition dataset. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. 1–6.

Digital Library

[23]

H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1222–1230.

Digital Library

[24]

Outbrain. 2016. Outbrain Click Prediction Challenge. https://www.kaggle.com/c/outbrain-click-prediction. Accessed: 2021-05-01.

[25]

Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2017. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. arxiv:1610.05755 [stat.ML]

[26]

Florian Pargent, Florian Pfisterer, Janek Thomas, and Bernd Bischl. 2021. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. arXiv preprint arXiv:2104.00629(2021).

[27]

Pierangela Samarati and Latanya Sweeney. 1998. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. (1998).

[28]

John Wilander. 2021. Private Click Measurement. https://webkit.org/blog/11529/introducing-private-click-measurement-pcm/. Accessed: 2021-05-01.

[29]

Yivan Zhang, Nontawat Charoenphakdee, Zhenguo Wu, and Masashi Sugiyama. 2020. Learning from Aggregate Observations. Advances in Neural Information Processing Systems 33 (2020).

Cited By

Busa-Fekete RMedina ASyed UVassilvitskii SKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Label differential privacy and private training data releaseProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618540(3233-3251)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618540

Index Terms

Lessons from the AdKDD’21 Privacy-Preserving ML Challenge

Index terms have been assigned to the content through auto-classification.

Recommendations

Preserving Genomic Privacy via Selective Sharing
WPES'20: Proceedings of the 19th Workshop on Privacy in the Electronic Society

Although genomic data has significant impact and widespread usage in medical research, it puts individuals' privacy in danger, even if they anonymously or partially share their genomic data. To address this problem, we present a framework that is ...
Practical differentially private online advertising
Abstract
Powered by machine learning technology, online advertising achieves accurate advertisement delivery to potential customers according to online user profiles. However, it raises serious privacy concerns since the learning process may ...
Customized privacy preserving for inherent data and latent data

The huge amount of sensory data collected from mobile devices has offered great potentials to promote more significant services based on user data extracted from sensor readings. However, releasing user data could also seriously threaten user privacy. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
127
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Busa-Fekete RMedina ASyed UVassilvitskii SKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Label differential privacy and private training data releaseProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618540(3233-3251)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618540

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents