Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Statistical limitations of sensitive itemset hiding methods

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Frequent Itemset Hiding has long been an area of study for privacy-preserving data mining. The goal is to alter a dataset so that it may be released without revealing particular sensitive aggregates (e.g., frequent itemsets or association rules.) Typically the approach is to remove items from transactions to reduce the support of the sensitive itemset(s) below a threshold, while minimizing the changes or impact on other frequent itemsets. In this paper, we ask if such hiding can be discovered: Do hiding methods lead to anomalies that suggest that a sensitive itemset likely existed in the dataset, and has been hidden? We show that a suppressed sensitive itemset may behave like an outlier among its neighboring itemsets after suppression, indicating that the dataset is likely altered. KL-divergence and \(\chi ^2\)-divergence are used to calculate the difference between expected and actual probability distributions of itemsets for observing anomalous behavior. Experimental results on four datasets show that suppressed sensitive itemsets often stand out as the most significant outlier in many cases, irrespective of the victim item selection method. We propose two defensive approaches that counter this attack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

The datasets used during the current study are available on the GitHub platform link https://github.com/ShaliniJangra/Attack_Defense and online repository https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

References

  1. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques

  2. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. ACM SIGMOD Record 22(2):207–216

    Article  Google Scholar 

  3. Martin E, Hans-Peter K, Jörg S, Xiaowei X, Evangelos Si, Jiawei H, Usama MF (eds)(1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) Portland, Oregon, USA, p 226–231, http://www.aaai.org/Library/KDD/1996/kdd96-037.php

  4. Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1):57–77

    Article  Google Scholar 

  5. Clifton C, Marks D (1996) Security and privacy implications of data mining, ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Citeseer p 15–19

  6. Agrawal R, Srikant R (2000) Privacy-preserving data mining, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p 439–450

  7. Oliveira SR, Zaiane Osmar R (2002) Privacy preserving frequent itemset mining, Proceedings of the IEEE international conference on Privacy, security and data mining, Australian Computer Society, Inc. Vol 14 p 43–54

  8. Sharma S, Toshniwal D (2018) MR-I MaxMin-scalable two-phase border based knowledge hiding technique using MapReduce, Futur Gener Comput Syst

  9. Daniel EOL (1991) Knowledge Discovery as a Threat to Database Security, Proceedings of the 1st International Conference on Knowledge Discovery and Databases, vol 107 p 516

  10. Aggarwal CC, Philip SY (2008) A general survey of privacy-preserving data mining models and algorithms, Privacy-preserving data mining p 11–52

  11. Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios V (1999) Disclosure limitation of sensitive rules, Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX’99)(Cat. No. PR00453), IEEE p 45–52

  12. Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM p 639–644

  13. Saygin Y, Verykios VS, Elmagarmid AK (2002) Privacy preserving association rule mining, Proceedings Twelfth International Workshop on Research Issues in Data Engineering: Engineering E-Commerce/E-Business Systems RIDE-2EC IEEE p 151–158

  14. Moustakides GV, Verykios VS (2008) A maxmin approach for hiding frequent itemsets. Data & Knowledge Engineering 65(1):75–89

    Article  Google Scholar 

  15. Gkoulalas-Divanis A, Verykios VS (2006) An integer programming approach for frequent itemset hiding, Proceedings of the 15th ACM international conference on Information and knowledge management, p 748–757

  16. Gkoulalas-Divanis A, Verykios VS (2008) Exact knowledge hiding through database extension. IEEE Transactions on Knowledge and Data Engineering 21(5):699–713

    Article  Google Scholar 

  17. Dinusha V, Peter C, Vassilios SV (2013) A taxonomy of privacy-preserving record linkage techniques. Info Syst 38(6):946–969. https://doi.org/10.1016/j.is.2012.11.005

    Article  Google Scholar 

  18. Verykios VS, Stavropoulos EC, Krasadakis P, Sakkopoulos E (2022) Frequent itemset hiding revisited: pushing hiding constraints into mining. Appl Intell 52(3):2539–2555

    Article  Google Scholar 

  19. Sun X, Yu PS (2007) Hiding sensitive frequent itemsets by a border-based approach. J Comput Sci Eng, Korean Institute of Information Scientists and Engineers 1(1):74–94

    Google Scholar 

  20. Hong TP, Lin CW, Yang KT, Wang SL (2013) Using TF-IDF to hide sensitive itemsets. Appl Intell 38(4):502–510

    Article  Google Scholar 

  21. Amiri A (2007) Dare to share: Protecting sensitive knowledge with data sanitization. Decision Support Systems 43(1):181–191

    Article  Google Scholar 

  22. Lin CW Hong TP, Hsu HC (2014) Reducing side effects of hiding sensitive itemsets in privacy preserving data mining, Sci World J. vol 2014

  23. Cheng P, Roddick JF, Chu SC, Lin CW (2016) Privacy preservation through a greedy, distortion-based rule-hiding method. Appl Intell. 44(2):295–306

    Article  Google Scholar 

  24. Lin CW, Hong TP, Yang KT, Wang SL (2015) The GA-based algorithms for optimizing hiding sensitive itemsets through transaction deletion. Appl Intell 42(2):210–230

    Article  Google Scholar 

  25. Lin CW, Zhang B, Yang KT, Hong TP (2014) Efficiently hiding sensitive itemsets with transaction deletion based on genetic algorithms, Sci World J. vol 2014

  26. Lin JCW, Liu Q, Fournier-Viger P, Hong TP, Voznak M, Zhan J (2016) A sanitization approach for hiding sensitive itemsets based on particle swarm optimization. Eng Appl Artif Intell 53:1–18

    Article  Google Scholar 

  27. Wu JMT, Zhan J, Lin JCW (2017) Ant colony system sanitization approach to hiding sensitive itemsets. IEEE Access 5:10024–10039

    Article  Google Scholar 

  28. Kullback S (1997) Information theory and statistics, Courier Corporation

  29. Shannon CE (1948) A mathematical theory of communication. The Bell syst tech J, Nokia Bell Labs 27(3):379–423

    Article  MathSciNet  MATH  Google Scholar 

  30. Hawkins DM (1980) Identification of outliers vol 11

  31. Ben-Gal I (2005) Outlier detection, Data mining and knowledge discovery handbook, p 131–146

  32. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif intell rev 22(2):85–126

    Article  MATH  Google Scholar 

  33. Jangra S, Toshniwal D (2022) Efficient algorithms for victim item selection in privacy-preserving utility mining. Futur Gener Comput Syst 128:219–234

    Article  Google Scholar 

  34. Jangra S, Toshniowal D (2019) A Heuristic Approach for Sensitive Pattern Hiding with Improved Data Quality, International Workshop on New Frontiers in Mining Complex Patterns, p 21–35

  35. Jangra S, Toshniwal D (2020) VIDPSO: Victim item deletion based PSO inspired sensitive pattern hiding algorithm for dense datasets. Inf Process Manag 57(5):102255

    Article  Google Scholar 

  36. Oliveira SRM, Zaiane OR (2003) Protecting sensitive knowledge by data sanitization, Third IEEE International conference on data mining, p 613–616

  37. Dasseni E, Verykios VS, Elmagarmid AK, Bertino E (2001) Hiding association rules by using confidence and support, International Workshop on Information Hiding, p 369–383

  38. Sharma S, Toshniwal D (2017) Scalable two-phase co-occurring sensitive pattern hiding using mapreduce. J. Big Data 4(1):1–18

    Article  Google Scholar 

  39. Rousseeuw PJ, Hubert M (2011) Robust statistics for outlier detection, Wiley interdisciplinary reviews: Data mining and knowledge discovery, Wiley Online. Library 1(1):73–79

    Google Scholar 

  40. Zani S, Riani M, Corbellini A (1998) Robust bivariate boxplots and multiple outlier detection. Computational Statistics & Data Analysis 28(3):257–270

    Article  MATH  Google Scholar 

  41. IBM Quest Synthetic Data Generator (2021) https://sourceforge.net/projects/ibmquestdatagen/ 20 Jun 2021

  42. Fournier-Viger P (2021) SPMF: An Open-Source Data Mining Library, https://www.philippe-fournier-viger.com/spmf/, 29 Jul 2021

  43. Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: Generalizing association rules to correlations, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p 265–276

  44. Silverstein C, Brin S, Motwani R (1998) Beyond market baskets: Generalizing association rules to dependence rules. Data Min Knowl Discov 2(1):39–68

    Article  Google Scholar 

  45. Smets K, Vreeken J (2012) Slim: Directly mining descriptive patterns, Proceedings of the SIAM international conference on data mining, p 236–247

  46. De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446

    Article  MathSciNet  MATH  Google Scholar 

  47. Guns T, Nijssen S, De Raedt L (2011) k-Pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418

    Article  MATH  Google Scholar 

  48. Smiti A (2020) A critical overview of outlier detection methods. Comput Sci Rev vol 38 p 100306

Download references

Funding

This work is funded by Science and Engineering Research Board (SERB), a statutory body under the Department of Science and Technology (DST), Government of India.

Author information

Authors and Affiliations

Authors

Contributions

Shalini Jangra: Conception and design of study, Acquisition of data, Coding and Implementation, Analysis and interpretation of Results, Writing- Original draft preparation, Review & Editing, Funding acquisition Durga Toshniwal: Review & Editing, Supervision Chris Clifton: Conceptualization, Methodology, Analysis and interpretation of Results, Review & Editing, Supervision

Corresponding author

Correspondence to Jangra Shalini.

Ethics declarations

Conflicts of interest

The authors have no competing interest to declare that are relevant to the content of this article.

Ethical and informed consent for data used

The work uses publicly available and synthetically generated datasets which do not have any identifiable information. No ethical approval was needed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shalini, J., Durga, T. & Chris, C. Statistical limitations of sensitive itemset hiding methods. Appl Intell 53, 24275–24292 (2023). https://doi.org/10.1007/s10489-023-04781-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04781-4

Keywords