Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Decision Tree Classification with Differential Privacy: A Survey

Published: 30 August 2019 Publication History

Abstract

Data mining information about people is becoming increasingly important in the data-driven society of the 21st century. Unfortunately, sometimes there are real-world considerations that conflict with the goals of data mining; sometimes the privacy of the people being data mined needs to be considered. This necessitates that the output of data mining algorithms be modified to preserve privacy while simultaneously not ruining the predictive power of the outputted model. Differential privacy is a strong, enforceable definition of privacy that can be used in data mining algorithms, guaranteeing that nothing will be learned about the people in the data that could not already be discovered without their participation. In this survey, we focus on one particular data mining algorithm—decision trees—and how differential privacy interacts with each of the components that constitute decision tree algorithms. We analyze both greedy and random decision trees, and the conflicts that arise when trying to balance privacy requirements with the accuracy of the model.

References

[1]
Nabil Adam and John Worthmann. 1989. Security-control methods for statistical databases: A comparative study. Comput. Surv. 21, 4 (1989), 515--556.
[2]
Charu Aggarwal and Philip Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining: Models and Algorithms. Springer, 11--52.
[3]
Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving data mining. In ACM SIGMOD Conference on Management of Data. ACM, Dallas, Texas, 439--450.
[4]
Xuanyu Bai, Jianguo Yao, Mingyuan Yuan, Jia Zeng, and Haibing Guan. 2018. Qualitative instead of quantitative: Towards practical data analysis under differential privacy. Database Systems for Advanced Applications, Lecture Notes in Computer Science 10828, 2 (2018), 738--751.
[5]
Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. 2005. Practical privacy: The SuLQ framework. In 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 128--138.
[6]
Avrim Blum, Katrina Ligett, and Aaron Roth. 2013. A learning theory approach to non-interactive database privacy. J. ACM 60, 2 (2013), 12.
[7]
Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, and Yann LeCun. 2014. Differentially- and non-differentially-private random decision trees. Computing Research Repository (CoRR) 1410.6973 (2014), 1--18. arxiv:1410.6973
[8]
Ljiljana Brankovic and V Estivill-Castro. 1999. Privacy issues in knowledge discovery and data mining. In Australian Institute of Computer Ethics Conference. AICE, 89--99.
[9]
Leo Breiman. 2001a. Random forests. Mach. Learn. 45, 1 (2001), 5--32.
[10]
Leo Breiman. 2001b. Statistical modeling: The two cultures. Stat. Sci. 16, 3 (2001), 199--231.
[11]
Leo Breiman, Jerome Friedman, Charles Stone, and Richard Olshen. 1984. Classification and Regression Trees. Chapman 8 Hall/CRC.
[12]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. Comput. Surv. 41, 3 (2009), 1--15.
[13]
Chris Clifton and Tamir Tassa. 2013. On syntactic anonymity and differential privacy. Trans. Data Privacy 6, 2 (2013), 161--183.
[14]
Paris Cowan. 2016a. Govt releases billion-line “de-identified” health dataset. Retrieved from http://www.itnews.com.au/news/govt-releases-billion-line-de-identified-health-dataset-433814.
[15]
Paris Cowan. 2016b. Pilgrim warns data de-identification is “rocket science”. Retrieved from http://www.itnews.com.au/news/pilgrim-warns-data-de-identification-is-rocket-science-418387.
[16]
A. Criminisi, J. Shotton, and E. Konukoglu. 2011. Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Technical Report. Microsoft. 151 pages.
[17]
Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml/.
[18]
Cynthia Dwork. 2006. Differential privacy. In Automata, Languages and Programming, Vol. 4052. Springer, Venice, Italy, 1--12.
[19]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography. Springer, 265--284.
[20]
Cynthia Dwork and Aaron Roth. 2013. The Algorithmic Foundations of Differential Privacy. Now Publishers. 277 pages.
[21]
W. Fan, H. Wang, P. S. Yu, and S. Ma. 2003. Is random model better? On its accuracy and efficiency. In 3rd International Conference on Data Mining. IEEE, 1--8.
[22]
Sam Fletcher and Md. Zahidul Islam. 2014. Quality evaluation of an anonymized dataset. In 22nd International Conference on Pattern Recognition. IEEE, Stockholm, Sweden, 3594--3599.
[23]
Sam Fletcher and Md. Zahidul Islam. 2015b. A differentially private decision forest. In 13th Australasian Data Mining Conference. Conferences in Research and Practice in Information Technology, Sydney, Australia, 1--10.
[24]
Sam Fletcher and Md. Zahidul Islam. 2015a. A differentially-private random decision forest using reliable signal-to-noise ratios. In 28th Australasian Joint Conference on Artificial Intelligence. Lecture Notes in Computer Science, Springer, 192--203.
[25]
Sam Fletcher and Md. Zahidul Islam. 2015c. An anonymization technique using intersected decision trees. J. King Saud Univ. Comput. Inf. Sci. 27, 3 (2015), 21.
[26]
Sam Fletcher and Md. Zahidul Islam. 2017. Differentially private random decision forests using smooth sensitivity. Expert Syst. Appl. 78, 1 (2017), 16--31.
[27]
Yoav Freund and Robert Schapire. 1999. A short introduction to boosting. J. Jpn Soc. Artif. Intell. 14, 5 (1999), 1612--1626.
[28]
Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy. In 16th SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Washington, D.C., 493--502.
[29]
Benjamin Fung, Ke Wang, Rui Chen, and Philip Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. Comput. Surv. 42, 4 (2010), 1--53.
[30]
Benjamin Fung, Ke Wang, and Philip Yu. 2005. Top-down specialization for information and privacy preservation. In 21st International Conference on Data Engineering. IEEE, 205--216.
[31]
Benjamin Fung, Ke Wang, and Philip Yu. 2007. Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19, 5 (2007), 711--725.
[32]
Liqiang Geng and Howard J. Hamilton. 2006. Interestingness measures for data mining: A survey. Comput. Surv. 38, 3 (2006), 1--32.
[33]
Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Mach. Learn. 63, 1 (Mar. 2006), 3--42.
[34]
Andy Greenberg. 2016. Apple’s “Differential Privacy” is about collecting your data—But not your data. Retrieved from https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/.
[35]
Fabrice Guillet and H. J. Hamilton. 2007. Quality Measures in Data Mining. Springer. 316 pages.
[36]
Jiawei Han, Micheline Kamber, and Jian Pei. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Diego. 772 pages.
[37]
Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. 2016. Principled evaluation of differentially private algorithms using DPBench. In SIGMOD International Conference on Management of Data. ACM, San Francisco, 1--16.
[38]
Torsten Hothorn, Kurt Hornik, and Achim Zeileis. 2006. Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graphical Stat. 15, 3 (2006), 651--674.
[39]
Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan, and Benjamin C Pierce. 2014. Differential privacy: An economic method for choosing epsilon. In IEEE Computer Security Foundations Symposium. IEEE, 398--410.
[40]
Xueyang Hu, Mingxuan Yuan, Jianguo Yao, Yu Deng, Lei Chen, Qiang Yang, Haibing Guan, and Jia Zeng. 2015. Differential privacy in Telco big data platform. Proceedings of the VLDB Endowment 8, 12 (2015), 1692--1703.
[41]
Johan Huysmans, Karel Dejaeger, Christophe Mues, Jan Vanthienen, and Bart Baesens. 2011. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decis. Support Syst. 51, 1 (2011), 141--154.
[42]
Md. Zahidul Islam. 2012. EXPLORE: A novel decision tree classification algorithm. Data Secur. Secur. Data 6121, 1 (2012), 55--71.
[43]
Md. Zahidul Islam and Ljiljana Brankovic. 2011. Privacy preserving data mining: A noise addition framework using a novel clustering technique. Knowledge-Based Syst. 24, 8 (2011), 1214--1223.
[44]
Md. Zahidul Islam and Helen Giggins. 2011. Knowledge discovery through SysFor: A systematically developed forest of multiple decision trees. In 9th Australasian Data Mining Conference. Australian Computer Society, Inc., Ballarat, Australia, 195--204.
[45]
Geetha Jagannathan, Claire Monteleoni, and Krishnan Pillaipakkamnatt. 2013. A semi-supervised learning approach to differential privacy. In 13th International Conference on Data Mining Workshops. IEEE, 841--848.
[46]
Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. 2012. A practical differentially private random decision tree classifier. Trans. Data Privacy 5 (2012), 273--295.
[47]
Zhanglong Ji, Zachary C. Lipton, and Charles Elkan. 2014. Differential privacy and machine learning: A survey and review. Computing Research Repository (CoRR) 1412.7584 (2014), 1--30. arxiv:1412.7584
[48]
Gilad Katz, Asaf Shabtai, Lior Rokach, and Nir Ofek. 2012. ConfDTree: Improving decision trees using confidence intervals. In 12th International Conference on Data Mining. IEEE, 339--348.
[49]
Daniel Kifer and Ashwin Machanavajjhala. 2011. No free lunch in data privacy. In 2011 International Conference on Management of Data—SIGMOD’11. ACM, 193.
[50]
Ios Kotsogiannis, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. 2017. Pythia: Data dependent differentially private algorithm selection. In SIGMOD International Conference on Management of Data. ACM, Chicago, 1323--1337.
[51]
Benjamin Letham, Cynthia Rudin, Tyler H. Mccormick, and David Madigan. 2013. Interpretable Classifiers Using Rules and Bayesian Analysis: Building a Better Stroke Prediction Model. Technical Report 609. University of Washington. 19 pages.
[52]
Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. 2014. A data- and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment 7, 5 (2014), 341--352.
[53]
Ninghui Li, Min Lyu, Dong Su, and Weining Yang. 2016. Differential Privacy: From Theory to Practice. Morgan 8 Claypool Publishers. 138 pages.
[54]
Ruey-Hsia Li and Geneva Belford. 2002. Instability of decision tree classification algorithms. In 8th SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Alberta, Canada, 570--575.
[55]
Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory meets practice on the map. In 24th International Conference on Data Engineering. IEEE, 277--286.
[56]
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. 2007. l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discovery Data 1, 1 (2007), 3.
[57]
Jesús Maudes, Juan J. Rodríguez, César García-Osorio, and Nicolás García-Pedrajas. 2012. Random feature weights for decision tree ensemble construction. Inf. Fusion 13, 1 (2012), 20--30.
[58]
Frank McSherry. 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In 35th SIGMOD International Conference on Management of Data. ACM, Providence, 19--30.
[59]
Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In 48th Symposium on Foundations of Computer Science. IEEE, 94--103.
[60]
Giovanna Menardi and Nicola Torelli. 2014. Training and assessing classification rules with imbalanced data. Data Mining Knowl. Discovery 28, 1 (2014), 92--122.
[61]
Kato Mivule, Claude Turner, and Soo Yeon Ji. 2012. Towards a differential privacy and utility preserving machine learning classifier. Procedia Comput. Sci. 12, 1 (2012), 176--181.
[62]
Noman Mohammed, Samira Barouti, Dima Alhadidi, and Rui Chen. 2015. Secure and private management of healthcare databases for data mining. In 28th Symposium on Computer-Based Medical Systems. IEEE, 191--196.
[63]
Noman Mohammed, Rui Chen, Benjamin C.M. Fung, and Philip S. Yu. 2011. Differentially private data release for data mining. In 7th SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, New York, 493--501.
[64]
Kevin Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
[65]
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In 39th Symposium on Theory of Computing. ACM, 75--84.
[66]
Abhijit Patil and Sanjay Singh. 2014. Differential private random forest. In International Conference on Advances in Computing, Communications and Informatics. IEEE, 2623--2630.
[67]
John Ross Quinlan. 1986. Induction of decision trees. Mach. Learn. 1, 1 (1986), 81--106.
[68]
John Ross Quinlan. 1993. C4.5: Programs for Machine Learning (1st ed.). Morgan Kaufmann. 312 pages.
[69]
John Ross Quinlan. 1996. Improved use of continuous attributes in C4.5. J. Artif. Intell. Res. 4 (1996), 77--90.
[70]
Santu Rana, Sunil Kumar Gupta, and Svetha Venkatesh. 2016. Differentially private random forest with high utility. In IEEE International Conference on Data Mining. IEEE, 955--960.
[71]
Anand D. Sarwate and Kamalika Chaudhuri. 2013. Signal processing and machine learning with differential privacy. IEEE Signal Process Mag. 30, 5 (2013), 86--94.
[72]
Claude Shannon. 1949. A mathematical theory of communication. ACM SIGMOBILE Mobile Comput. Commun. Rev. 5, 1 (1949), 3--55.
[73]
Latanya Sweeney. 2002. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertainty Fuzziness Knowledge-Based Syst. 10, 5 (2002), 571--588.
[74]
Wim Van Drongelen. 2006. Signal Processing for Neuroscientists: An Introduction to the Analysis of Physiological Signals. Academic Press.
[75]
Alfredo Vellido, Jose D. Martin-Guerroro, and Paulo J. G. Lisboa. 2012. Making machine learning models interpretable. In European Symposium on Artificial Neural Networks. i6doc, Bruges, Belgium, 163--172.
[76]
Duy Vu and Aleksandra Slavkovic. 2009. Differential privacy for clinical trial data: Preliminary evaluations. In IEEE International Conference on Data Mining Workshops. IEEE, 138--143.
[77]
Yin Yang, Zhenjie Zhang, Gerome Miklau, Marianne Winslett, and Xiaokui Xiao. 2012. Differential privacy in data publication and analysis. In SIGMOD International Conference on Management of Data. ACM, 601--605.
[78]
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2014. PrivBayes: Private data release via Bayesian networks. (2014), 1423--1434.
[79]
Tianqing Zhu, Gang Li, Wanlei Zhou, and Philip S. Yu. 2017b. Differential Privacy and Applications. Springer International Publishing. 235 pages.
[80]
Tianqing Zhu, Gang Li, Wanlei Zhou, and Philip S. Yu. 2017a. Differentially private data publishing and analysis: A survey. IEEE Trans. Knowledge Data Eng. 29, 8 (2017), 1619--1638.

Cited By

View all
  • (2024)Aplicación de técnicas de Inteligencia Artificial para la diferenciación del nivel socioeconómicoInnovación y Software10.48168/innosoft.s15.a1585:1(141-155)Online publication date: 30-Mar-2024
  • (2024)Survey on Knowledge Representation Models in HealthcareInformation10.3390/info1508043515:8(435)Online publication date: 26-Jul-2024
  • (2024)Predicting the Multiphotonic Absorption in Graphene by Machine LearningAI10.3390/ai50401085:4(2203-2217)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 52, Issue 4
July 2020
769 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3359984
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2019
Accepted: 01 May 2019
Revised: 01 February 2019
Received: 01 November 2016
Published in CSUR Volume 52, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Differential privacy
  2. comparisons
  3. decision forest
  4. decision tree
  5. implementations

Qualifiers

  • Survey
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)377
  • Downloads (Last 6 weeks)57
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Aplicación de técnicas de Inteligencia Artificial para la diferenciación del nivel socioeconómicoInnovación y Software10.48168/innosoft.s15.a1585:1(141-155)Online publication date: 30-Mar-2024
  • (2024)Survey on Knowledge Representation Models in HealthcareInformation10.3390/info1508043515:8(435)Online publication date: 26-Jul-2024
  • (2024)Predicting the Multiphotonic Absorption in Graphene by Machine LearningAI10.3390/ai50401085:4(2203-2217)Online publication date: 4-Nov-2024
  • (2024)Collaborative Filtering-Based Drug Recommendation System Using Machine LearningSSRN Electronic Journal10.2139/ssrn.4502186Online publication date: 2024
  • (2024)Privacy Amplification via Shuffling: Unified, Simplified, and TightenedProceedings of the VLDB Endowment10.14778/3659437.365944417:8(1870-1883)Online publication date: 1-Apr-2024
  • (2024)Manipulating Recommender Systems: A Survey of Poisoning Attacks and CountermeasuresACM Computing Surveys10.1145/3677328Online publication date: 25-Jul-2024
  • (2024)Software Bug Prediction using Machine Learning on JM1 Dataset2024 International Conference on Advances in Computing, Communication, Electrical, and Smart Systems (iCACCESS)10.1109/iCACCESS61735.2024.10499572(01-06)Online publication date: 8-Mar-2024
  • (2024)Impacts of Dataset and Codebook Sizes on ML-Driven Beam Prediction for mmWave V2I Communication2024 11th International Conference on Wireless Networks and Mobile Communications (WINCOM)10.1109/WINCOM62286.2024.10655630(1-6)Online publication date: 23-Jul-2024
  • (2024)Efficient and Privacy-Preserving Outsourcing of Gradient Boosting Decision Tree InferenceIEEE Transactions on Services Computing10.1109/TSC.2024.339592817:5(2334-2348)Online publication date: Sep-2024
  • (2024)Probabilistic Dataset Reconstruction from Interpretable Models2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)10.1109/SaTML59370.2024.00009(1-17)Online publication date: 9-Apr-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media