Abstract
This paper proposes a novel approach for privacy-preserving distributed model-based classifier training. Our approach is an important step towards supporting customizable privacy modeling and protection. It consists of three major steps. First, each data site independently learns a weak concept model (i.e., local classifier) for a given data pattern or concept by using its own training samples. An adaptive EM algorithm is proposed to select the model structure and estimate the model parameters simultaneously. The second step deals with combined classifier training by integrating the weak concept models that are shared from multiple data sites. To reduce the data transmission costs and the potential privacy breaches, only the weak concept models are sent to the central site and synthetic samples are directly generated from these shared weak concept models at the central site. Both the shared weak concept models and the synthetic samples are then incorporated to learn a reliable and complete global concept model. A computational approach is developed to automatically achieve a good trade off between the privacy disclosure risk, the sharing benefit and the data utility. The third step deals with validating the combined classifier by distributing the global concept model to all these data sites in the collaboration network while at the same time limiting the potential privacy breaches. Our approach has been validated through extensive experiments carried out on four UCI machine learning data sets and two image data sets.
Similar content being viewed by others
References
Westin AF (1967) Privacy and freedom. Atheneum, New York
Rosenthal A, Winslett M (2004) Security of shared data in large systems: state of the art and research directions. In: ACM SIGMOD
Thuraisingham BM (2002) Data mining, national security, privacy and civil liberties. SIGKDD Explor Newsl 4(2): 1–5
Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Mishra N, Motwani R, Srivastava U, Thomas D, Widom J, Xu Y (2004) Vision paper: enabling privacy for the paranoids. In: VLDB, pp 708–719
Hore B, Mehrotra S, Tsudik G (2004) A privacy-preserving index for range queries. In: VLDB, pp 720–731
Deutsch A, Papakonstantinou Y (2005) Privacy in database publishing. In ICDT, pp 230–245
Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertainty 10(5): 571–588
Kantarcioglu M, Jin J, Clifton C (2004) What do data mining results violate privacy. In: ACM SIGKDD
Liew CK, Coi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10(3): 395–411
Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24(4): 487–493
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: ACM SIGMOD, pp 439–450
Agrawal D, Aggarwal C (2001) On the design and quantification of privacy preserving data mining algorithms. In: ACM PODS
Evfimievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: ACM SIGKDD
Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: ACM PODS
Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: IEEE ICDM
Ma D, Sivakumar K, Kargupta H (2004) privacy sensitive bayesian network parameter learning. In: IEEE ICDM
Yao A (1986) How to generate and exchange secrets. In: IEEE Symp. on Foundations of Computer Science, pp 162–167
Lindell Y, Israel R, Pinkas B (2000) Privacy preserving data mining. CRYPTO, pp 36–54
Goldreich O, Micali S, Wigderson A (1987) How to play any mental game- a completeness theorem for protocols with honest majority. In: STOC
Du W, Atallah MJ (2001) Privacy-preserving cooperative statistical analysis. In: 17th Annual Computer Security Applications Conference, pp 103–110
Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: Linear regression and classification. In: SIAM Conference on Data Mining
Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitional data. In: ACM SIGKDD
Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: ACM SIGKDD
Wright R, Yang Z (2004) Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: ACM SIGKDD
Chen K, Liu L (2005) Privacy preserving data classification with rotation perturbation. In: IEEE ICDM, pp 589–592
Oliveira S, Zaiane OR (2003) Privacy preserving clustering by data transformation. In: SBBD
Domingo-Ferrer J, Mateo-Sanz JM (2001) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1): 189–201
Fienberg SE, Makov UE, Steele RJ (1998) Disclosure limitation using perturbation and related methods for categorial data. J Official Stat 14(4): 485–502
Raghunathan TJ, Reiter JP, Rubin D (2003) Multiple imputation for statistical disclosure limitation. J Official Stat 19(1): 1–16
Crises G (2004) Synthetic microdata generation for database privacy protection. Technical report, CRISES Research Group, CRIREP-04-009
Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: IEEE ICDM
Chan, P, Stolfo, S, Wolpert, D (eds) (1996) Working Notes of AAAI Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, vol 36. AAAI/MIT Press, Cambridge
Kargupta H, Datta S, Wang Q, Sivakumar K (2003) On the privacy preserving properties of random data perturbation techniques. In: IEEE ICDM
Huang Z, Du W, Chen B (2005) Deriving private information from randomized data. In: ACM SIGMOD
Zhu Y, Liu L (2004) Optimal randomization for privacy preserving data mining. In: ACM SIGKDD, pp 761–766
Xiong L, Chitti S, Liu L (2007) Mining multiple private databases using a knn classifier. In: SAC
Kim J, Winkler WE (2003) Multiplicative noise for masking continuous data. Technical report, US Bureau of Census, Statistics Research Division technical report statistics 2003-01
Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1): 92–106
Ting K, Witten I (1999) Issues in stacked generalization. J Artif Intell Res 10: 271–289
Fan J, Luo H, Hacid M-S, Bertino E (2005) A novel approach for privacy-preserving video sharing. In: ACM CIKM, pp 609–616
Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24: 381–396
McLachlan G, Krishnan T (2000) The EM algorithm and extensions. Wiley, New York
Ueda N, Nakano R, Ghahramani Z, Hinton GE (2002) Smem algorithm for mixture models. Neural Comput 12(9): 2109–2128
Luo H (2007) Concept-based large-scale video database browsing and retrieval via visualization. Ph.D. thesis, The University of North Carolina at Charlotte, pp 58–60. http://hdl.handle.net/2029/87
Hyvarinen A (1998) New approximations of dioeerential entropy for independent component analysisand projection pursuit. In: Annual Conference on Neural Information Processing Systems, vol 10, pp 273–279
Gomantam S, Karr AF, Sanil AP (2005) Data swapping as a decision problem. J Official Stat 13(4): 635–655
Lamber D (1993) Measures of disclosure risk and harm. J Official Stat 9: 313–331
Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2-3): 103–134
Joachims T (1999) Transductive inference for text classification using support vector machine. In: ICML
Hettich S, Blake C, Merz C (1998) Uci respository of machine learning databases. Technical report. http://www.ics.uci.edu/~mlearn/
Author information
Authors and Affiliations
Corresponding author
Additional information
This project is supported by National Science Foundation under 0208539-IIS and 0601542-IIS, grants from AO Foundation and CERIAS, Shanghai Pujiang Program under 08PJ1404600, National Natural Science Foundation of China under 60496325 and National Hi-tech R&D Program of China under 2006AA010111.
Rights and permissions
About this article
Cite this article
Luo, H., Fan, J., Lin, X. et al. A distributed approach to enabling privacy-preserving model-based classifier training. Knowl Inf Syst 20, 157–185 (2009). https://doi.org/10.1007/s10115-008-0167-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0167-x