Abstract
The process of collaborative data mining may sometimes expose the sensitive patterns present inside the data which may be undesirable to the data owner. Sensitive Pattern Hiding (SPH) is a subfield of data mining that addresses this problem. However, most of the existing approaches used for hiding sensitive patterns cause high side-effect on non-sensitive patterns which in-turn reduces the utility of the sanitized dataset. Furthermore, most of them are sequential in nature and are not able to cope with massive amounts of data and often results in high execution time. To resolve these identified challenges of utility and non-feasibility, two parallelized approaches have been proposed named PGVIR and PHCR based on spark parallel computing framework which modifies the data such that no sensitive patterns can be extracted while maintaining the utility of the sanitized dataset. Experiments performed using benchmark dataset shows that PGVIR scales better and PHCR causes fewer side-effects to the data compared to the existing techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Amiri A (2007) Dare to share: protecting sensitive knowledge with data sanitization. Decis Support Syst 43 (1):181–191
Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios V (1999) Disclosure limitation of sensitive rules. In: (KDEX’99) Proceedings 1999 workshop on knowledge and data engineering exchange 1999. IEEE, pp 45–52
Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using confidence and support. In: International workshop on information hiding. Springer, pp 369–383
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling of high-frequency accident locations by use of association rules. Transp Res Record: J Transp Res Board 1840:123–130
GkoulalasDivanis A, Verykios VS (2006) An integer programming approach for frequent itemset hiding. In: Proceedings of the 15th ACM international conference on information and knowledge management. ACM, pp 748–757
Lee G, Chang C-Y, Chen ALP (2004) Hiding sensitive patterns in association rules mining. In: 2004 Proceedings of the 28th annual international, computer software and applications conference, 2004. COMPSAC. IEEE, pp 424–429
Menon S, Sarkar S, Mukherjee S (2005) Maximizing accuracy of shared databases when concealing sensitive patterns, vol 16
Moustakides GV, Verykios VS (2008) A maxmin approach for hiding frequent itemsets. Data Knowl Eng 65(1):75–89
Oliveira SRM, Zaiane OR (2002) Privacy preserving frequent itemset mining. In: Proceedings of the IEEE international conference on privacy, security and data mining, vol 14. Australian Computer Society Inc., pp 43–54
Oliveira SRM, Zaiane OR (2003) Protecting sensitive knowledge by data sanitization.. In: Third IEEE International conference on data mining, 2003. ICDM 2003. IEEE, pp 613–616
Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce. Future Generation Computer Systems
Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with MapReduce. In: Proceedings of the 5th ACM conference on data and application security and privacy. ACM, New York, pp 195–6
Sun X, Yu PS (2005) A borderbased approach for hiding sensitive frequent itemsets. In: Fifth IEEE international conference on data mining. IEEE, pp 8–
Shivani S, Toshniwal D (2017) Scalable two-phase co-occurring sensitive pattern hiding using MapReduce. J Big Data 4(1):4
Sharma S, Toshniwal D (2015) Parallelization of association rule mining: survey. In: 2015 International conference on computing, communication and security (ICCCS), Pamplemousses, pp 1–6
Zhang X, et al. (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans Parallel Distrib Syst 25.2:363–373
Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2114) Parallel processing systems for big data: a survey. Proc IEEE 104(11):2016
Fung BC, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: 21st international conference on data engineering (ICDE’05). IEEE, New York
Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce. Future Generation Computer Systems
Han Z, Zhang Y (2015) Spark: a big data processing platform based on memory computing. In: 2015 Seventh international symposium on parallel architectures, algorithms and programming (PAAP), Nanjing, pp 172–176
Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with MapReduce. In: Proceedings of the 5th ACM conference on data and application security and privacy, pp 195–206
Hong TP, Lin CW, Yang KT, Wang SL (2011) A heuristic data-sanitization approach based on TF-IDF. In: International conference on industrial engineering and other applications of applied intelligent systems, pp 156–164
Cheng P, Roddick JF, Chu SC, Lin CW (2016) Privacy preservation through a Greedy, distortion-based rule-hiding method. Appl Intell 44(2):295–306
Telikani A, Shahbahrami A, Tavoli R (2015) Data sanitization in association rule mining based on impact factor. J AI Data Min 3(2):131–140
https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Comparison with existing parallel SPH techniques
Appendix: Comparison with existing parallel SPH techniques
Another set of the experiment has been performed to judge the performance of PGVIR and PHCR with respect to the parallel version of MaxFIA and SWA schemes proposed in [10, 11]. The first experiment has been set up with varying data sizes. Figure 12a plots the execution time taken by the sanitization process with varying data sizes. The MST is set to 20% and the total sensitive patterns need to be masked is 50. It can be clearly observed that due to both ways parallelization i.e. data parallelization and computing parallelization achieved in PGVIR and PHCR, the proposed scheme performs better than the existing state of art. Further, proposed schemes are implemented using the Spark platform and Parallel MaxFIA and SWA have been implemented using Hadoop MapReduce which again is the reason for the clear difference. Spark [26] platform is faster than the Hadoop MapReduce due to several reason like in-memory computation, data frame creation etc. Further the initilization time of hadoop is much higher than the Spark.
Second Set of experiment have been performed to analyze the performance in terms of running time with varying minimum support threshold value. Figure 12b presents the plot between running time and varying MST. It can be observed that with different minimum threshold value the execution time of PGVIR and PHCR is considerably less than the parallel MaxFIA and SWA. Therefore, it can be stated that due to both ways parallelization and use of the Spark platform make proposed PGVIR and PHCR a better choice for preserving the privacy of sensitive data.
Rights and permissions
About this article
Cite this article
Sharma, U., Toshniwal, D. & Sharma, S. A sanitization approach for big data with improved data utility. Appl Intell 50, 2025–2039 (2020). https://doi.org/10.1007/s10489-020-01640-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01640-4