Abstract
Big data privacy preservation is a critical challenge for data mining and data analysis. Existing methods for anonymizing big data streams using k-anonymity algorithms may cause high data loss, low data quality, and identity disclosure. In this paper, we propose a novel model for anonymizing big data streams using in-memory processing. The model uses a Spark framework to parallelize the anonymization process and a one-time clustering algorithm to avoid multiple iterations and allocate the data to optimal clusters. We evaluate the performance and effectiveness of the model using a real-world dataset and compare it with three popular k-anonymity algorithms: CRUE, Mean-Shift, and DBSCAN. The results show that the model has the lowest data loss and the highest data quality for different data sizes and k-values. The model is scalable, robust, adaptable, and flexible. The model can provide better data for data mining and data analysis while protecting data privacy and preventing data disclosure.
Similar content being viewed by others
Data Availability
The data that were generated and analyzed during the current study are available in the UCI Machine Learning Repository. The data are licensed under the CC BY-SA 4.0 license, and can be freely downloaded and used for research purposes: https://archive.ics.uci.edu/ml/datasets/adult.
References
Su, P., Zhao, H., & Wang, Y. (2024). A novel model based on big data environment for text content security recognition. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-023-01860-0
Banirostam, H., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2023). Providing and evaluating a comprehensive model for detecting fraudulent electronic payment card transactions with a two-level filter based on flow processing in big data. International Journal of Information Technology, 4161–4166. https://doi.org/10.1007/s41870-023-01501-6
Banirostam, H., Hedayati, A., Zadeh, A. K., & Shamsinezhad, E. (2013). A trust-based approach for increasing security in cloud computing infrastructure. In 2013 UKSim 15th International Conference on Computer Modelling and Simulation, Cambridge, UK, 717–721. https://doi.org/10.1109/UKSim.2013.39
Shamsinezhad, E., Shahbahrami, A., Hedayati, A., Zadeh, A. K., & Banirostam, H. (2013). Presentation methods for task migration in cloud computing by combination of Yu router and post-copy. International Journal of Computational Science and Engineering, 10(2), 98–102.
Shamsinejad, E., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2024). Presenting a model of data anonymization in big data in the context of in-memory processing. Journal of Electrical and Computer Engineering Innovations (JECEI), 12(1), 79–98. https://doi.org/10.22061/jecei.2023.9737.651
Banirostam, T., Shamsinejad, E., Pedram, M. M., & Rahmani, A. M. (2021). A review of anonymity algorithms in big data. Journal of Advances in Computer Engineering and Technology (JACET), 7(1), 187–196.
Mehta, B. B., & Rao, U. P. (2018). Toward scalable anonymization for privacy-preserving big data publishing. In S. B. Singh & A. K. Singh (Eds.), Advances in Intelligent Computing Techniques and Applications (pp. 297–304). Singapore: Springer. https://doi.org/10.1007/978-981-10-8636-6_31
Banirostam, H., Hedayati, A. R., & Khadem Zadeh, A. K. (2014). Using virtualization technique to increase security and reduce energy consumption in cloud computing. International Research Journal of Computer Science, 4(2), 25–30. https://doi.org/10.7815/ijorcs.42.2014.082
Banirostam, H., Shamsinezhad, E., & Banirostam, T. (2013). Functional control of users by biometric behavior features in cloud computing. In 2013 4th International Conference on Intelligent Systems, Modelling and Simulation, (pp. 94–98). Bangkok, Thailand. https://doi.org/10.1109/ISMS.2013.102
Ullah Bazai, S., & Jang-Jaccard, J. (2019). SparkDA: RDD-based high-performance data anonymization technique for spark platform. In J. Lopez, J. Zhou, & M. Soriano (Eds.), Network and System Security (pp. 646–662). Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-030-36938-5_40
Huo, Y., Ma, L., & Zhong, Y. (2018). A big data privacy respecting dissemination method for social network. Journal of Signal Processing Systems, 90(1), 467–475. https://doi.org/10.1007/s11265-017-1251-9
Zhang, X., Deng, H., Xiong, Z., et al. (2024). Secure routing strategy based on attribute-based trust access control in social-aware networks. Journal of Signal Processing Systems. https://doi.org/10.1007/s11265-023-01908-1
Xuemin, Z., Ying, R., Zenggang, X., et al. (2023). Resource-constrained and socially selfish-based incentive algorithm for socially aware networks. Journal of Signal Processing Systems, 95, 1439–1453. https://doi.org/10.1007/s11265-023-01896-2
Saadouni, R., Gherbi, C., Aliouat, Z., Harbi, Y., & Khacha, A. (2024). Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: A systematic review of the literature. Cluster Computing. https://doi.org/10.1007/s10586-024-04388-5
Banirostam, T., Banirostam, H., Pedram, M. M., & Rahmani, A. M. (2021). A review of fraud detection algorithms for electronic payment card transactions. Journal of Advances in Computer Engineering and Technology (JACET), 7(3), 157–166.
Saadouni, R., Gherbi, C., Aliouat, Z., Harbi, Y., & Khacha, A. (2024). Intrusion detection systems for IoT based on bio-inspired and machine learning techniques: A Systematic review of the literature. Cluster Computing. https://doi.org/10.1007/s10586-024-04388-5
Banirostam, H., Banirostam, T., Pedram, M. M., & Rahmani, A. M. (2023). A model to detect the fraud of electronic payment card transactions based on stream processing in big data. Journal of Signal Processing Systems, 23(1), 1–16.
Kumar, V., Sharma, D. K., & Mishra, V. K. (2021). Mille Cheval: A GPU-based in-memory high-performance computing framework for accelerated processing of big-data streams. Journal of Supercomputing, 77(10), 6936–6960. https://doi.org/10.1007/s11227-020-03508-3
Ashkouti, S., & Khamforoosh, M. (2023). A parallel method for preserving the λ-diversity privacy model using partition-based data clustering algorithms. PLoS One1, 18(1), e0285212. https://doi.org/10.1371/journal.pone.0285212
Park, K., Baek, C., & Peng, L. (2016). A development of streaming big data analysis system using in-memory cluster computing framework: Spark. In J. Park, H. Jin, Y.-S. Jeong, & M. Khan (Eds.), Advanced Multimedia and Ubiquitous Engineering: Future Information Technology (pp. 199–207). Singapore: Springer. https://doi.org/10.1007/978-981-10-1536-6_21
Li, J., Wang, Y., Liu, Q., & Li, H. (2022). Privacy-preserving federated learning over big data streams. IEEE Transactions on Big Data. https://doi.org/10.1109/TBDATA.2022.3118810
Chen, X., Zhang, J., Wang, X., & Li, Y. (2023). Anonymizing big data streams using deep reinforcement learning. Information Sciences, 583, 1–15. https://doi.org/10.1016/j.ins.2022.12.001
Tiwaskar, S., Rashid, M., & Gokhale, P. (2024). Impact of machine learning-based imputation techniques on medical datasets- a comparative analysis. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-024-19103-0
Onesimu, J. A., Karthikeyan, J., & Sei, Y. (2021). An efficient clustering-based anonymization scheme for privacy-preserving data collection in IoT based healthcare services. Peer-to-Peer Networking and Applications, 14(3), 1629–1649. https://doi.org/10.1007/s12083-021-01077-7
Gupta, H. K., & Parveen, R. (2022). An Efficient Cluster by Cluster Head Selection Approach in Big Data. 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) (pp. 1–6). Noida, India. https://doi.org/10.1109/ICRITO56286.2022.9964764
Revanesh, M., Mary, S. A. S. A., Gnaneswari, G., et al. (2023). Retracted article: Deep learning-based algorithm for optimum cluster head selection in sustainable wireless communication system. Neural Computing and Applications. https://doi.org/10.1007/s00521-023-08861-x
Canbay, Y.,Kalyoncu, A., Ercimen, M., Dogan, A., & Sagiroglu, S. (2019). A Clustering Based Anonymization Model for Big Data. In 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, (pp. 720–725). https://doi.org/10.1109/UBMK.2019.8907155
Lawrance, J. U., & Jesudhasan, J. V. N. (2021). Privacy preserving parallel clustering-based anonymization for big data using MapReduce framework. Applied Artificial Intelligence, 35(15), 1587–1620. https://doi.org/10.1080/08839514.2021.1987709
Wang, J., Cai, Z., Li, Y., Yang, D., & Li, J. (2018). Protecting query privacy with differentially private k-anonymity in location-based services. Personal and Ubiquitous Computing, 22(3), 453–469. https://doi.org/10.1007/s00779-018-1124-7
Jadhav, P. S., & Borkar, G. M. (2024). Optimal key generation for privacy preservation in big data applications based on the marine predator whale optimization algorithm. Annals of Data Science. https://doi.org/10.1007/s40745-024-00521-8
Rexa.info at the University of Massachusetts Amherst. (2024). Adult data set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/adult. Accessed 7 Jan 2024.
Sharma, A., Jain, A., Sharma, S., Gupta, A., Jain, P., & Mohanty, S. P. (2024). iPAL: A machine learning based smart healthcare framework for automatic diagnosis of attention deficit/hyperactivity disorder. SN Computer Science. https://doi.org/10.1007/s42979-024-02779-4
Domingo-Ferrer, J. (2018). Big data anonymization requirements vs privacy models. In 2018 15th International Joint Conference on e-Business and Telecommunications (ICETE), Porto, Portugal (vol. 2, pp. 305–312). https://doi.org/10.5220/0006830003050312
Canbay, Y., Vural, Y., & Sagiroglu, S. (2018). Privacy preserving big data. In 2018 International Workshop on Big Data and Information Security (IWBIS), Jakarta, Indonesia (pp. 24–29). https://doi.org/10.1109/IBIGDELFT.2018.8625358
Acknowledgements
The authors thank the anonymous reviewers and the editor for their useful comments and suggestions.
Funding
This research received no external funding.
Author information
Authors and Affiliations
Contributions
E.SH., T.B., and M.M.P. conceived and designed the study, collected and analyzed the data, and drafted the manuscript. A.M.R. critically revised the manuscript and gave final approval of the version to be published. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethical Approval
This study did not involve human or animal subjects, and therefore did not require ethical approval.
Consent to Participate
Not applicable.
Consent to Publish
Not applicable.
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shamsinejad, E., Banirostam, T., Pedram, M.M. et al. Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering. J Sign Process Syst 96, 333–356 (2024). https://doi.org/10.1007/s11265-024-01920-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-024-01920-z