Efficient data persistence and data division for distributed computing in cloud data center networks

Wang, Xi; Hu, Xinzhi; Fan, Weibei; Wang, Ruchuan

doi:10.1007/s11227-023-05276-2

Efficient data persistence and data division for distributed computing in cloud data center networks

Published: 26 April 2023

Volume 79, pages 16300–16327, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xi Wang^1,3,
Xinzhi Hu²,
Weibei Fan^2,3 &
…
Ruchuan Wang²

203 Accesses
2 Citations
Explore all metrics

Abstract

Container-based Hadoop distributed file system (HDFS) storage has been widely used in cloud data center networks, while traditional HDFS has single point problem resulting in overall unavailability. In this paper, we mainly study the storage reliability of the Docker container-based HDFS cluster with single point of failure. Firstly, we investigate a data volume-based persistence solution of Hadoop with the single point failure and single backup strategy of HDFS cluster. Secondly, we propose an HDFS-based replica placement algorithm for data storage with considering the performance of the host and container nodes. Thirdly, we design the KADC-KNN data segmentation algorithm to effectively store the persistent data of the Docker container. Extensive experimental results show that this method can effectively ensure the stable storage and fast migration of cluster data. Compared with the most advanced algorithm, the proposed data volume persistence algorithm DVPS can improve the data reliability by 19.8%. The data partitioning algorithm KADC-KNN improves the partitioning accuracy by 20.2% and has lower time overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid disaster-tolerant model with DDF technology for MooseFS open-source distributed file system

Article 22 October 2016

HaRD: a heterogeneity-aware replica deletion for HDFS

Article Open access 21 October 2019

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

Mostafa S, Tavassolipour A, Motahari M, Taghi MS (2019) Learning of gaussian processes in distributed and communication limited systems. IEEE Trans Pattern Anal Mach Intell 42(8):1928–1941
MATH Google Scholar
Jones KJ, Alli M (2021) Data aware caching using map reduce framework. Int J Comput Appl 7(1):1797–2250
Google Scholar
Chen X, Huo H, Huan J, Vitter JS, Zou L (2021) Msq-index: a succinct index for fast graph similarity search. IEEE Trans Knowl Data Eng 33(6):2654–2668
Article Google Scholar
Elkawkagy M, Elbeh H (2020) High performance hadoop distributed file system. Int J Network Distrib Comput 8(3):119–123
Article Google Scholar
Fan W, Han Z, Li P, Zhou J, Fan J, Wang R (2019) A live migration algorithm for containers based on resource locality. J Signal Process Syst 91(10):1077–1089
Article Google Scholar
Gemayel N (2016) Analyzing google file system and Hadoop distributed file system. Res J Inf Technol 8(3):66–74
Google Scholar
Kalid S, Syed A, Mohammad A, Halgamuge M (2017) Big-data NoSQL databases: comparison and analysis of “Big-Table”, “DynamoDB”, and “Cassandra”. In: IEEE 2nd International Conference on Big Data Analysis (ICBDA’17), pp. 89–93
Chen D, Zhang R (2022) An open source project for tuning and analyzing mapreduce performance in Hadoop and Spark. IEEE Softw 39(1):61–69
Article Google Scholar
Fan W, Xiao F, Fan J, Han Z, Sun L, Wang R (2023) Fault-tolerant routing with load balancing in LeTQ networks. IEEE Trans Depend Secure Comput 20(1):68–82
Article Google Scholar
Zhang H, Zhou R (2017) The analysis and optimization of decision tree based on ID3 algorithm. In: 9th International Conference on Modelling, Identification and Control (ICMIC), pp 924–928
Fan W, He J, Guo M, Li P, Han Z, Wang R (2020) Privacy preserving classification on local differential privacy in data centers. J Parallel Distrib Comput 135:70–82
Article Google Scholar
Das S, Kumar Kolya A (2017) Sense GST: text mining and sentiment analysis of GST tweets by Naive Bayes algorithm. In: Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp 239–244
Huang J, Wei Y, Yi J et al (2018) An improved kNN based on class contribution and feature weighting. In: 10th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), pp 313–316
Konovalenko I, Ludwig A (2022) Generating decision support for alarm processing in cold supply chains using a hybrid $k$-nn algorithm. Expert Syst Appl 190:1–15
Article Google Scholar
Xu B, Fu Y, Jiang YG, Li B, Sigal L (2018) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans Affect Comput 9(2):255–270
Article Google Scholar
Triguero I, Maillo J, Luengo J et al (2017) From big data to smart data with the $k$-nearest neighbours algorithm. In: IEEE International Conference on Internet of Things, pp. 859–864
Fan W, Han Z, Wang R (2018) An evaluation model and benchmark for parallel computing frameworks. Mob Inf Syst 1–14
Fan W, Xiao F, Chen X, Cui L, Yu S (2021) Efficient virtual network embedding of cloud-based data center networks into optical networks. IEEE Trans Parallel Distrib Syst 32(11):2793–2808
Article Google Scholar
Schmitz C, Peled G, Koren O (2021). Small files in HDFS and their impact on Hadoop performance. In The 23rd International Conference on Information Integration and Web Intelligence, pp 385–390
Fan W, Fan J, Zhang Y, Han Z, Chen G (2022) Communication and performance evaluation of 3-ary $n$-cubes onto network-on-chips. Sci China Inf Sci 65:179101–179104
Article MathSciNet Google Scholar
Fan W, He J, Han Z, Li P, Wang R (2020) Intelligent resource scheduling based on locality principle in data center networks. IEEE Commun Mag 58(10):94–100
Article Google Scholar
Usman AM, Haider S (2022) A flexible framework for diverse multi-robot task allocation scenarios including multi-tasking. ACM Trans Auton Adapt Syst 16(1):1–23
Google Scholar
Pradeep Kumar S, Aswini A, Kavithadevi M, Ramya S (2017) Improvised dedupication with keys and chunks in HDFS storage. In: Third International Conference on Science Technology Engineering and Management (ICONSTEM), pp 226–230
Liu J, Wang P, Zhou J, Li K (2019) Mctar: a multi-trigger checkpointing tactic for fast task recovery in mapreduce. IEEE Trans Serv Comput 14(6):1824–1836
Article Google Scholar
Zhou J, Chen Y, Wang W, He S, Meng D (2020) A highly reliable metadata service for large-scale distributed file systems. IEEE Trans Parallel Distrib Syst 31(2):374–392
Article Google Scholar
Wang X, Lee B, Qiao Y (2016) Experimental evaluation of memory configurations of Hadoop in Docker environments. In 2016 27th Irish Signals and Systems Conference (ISSC), pp 1–6
Lin CY, Lin YC (2015) A load-balancing algorithm for Hadoop distributed file system. In: International Conference on Network Based Information Systems, pp 173–179
Islam NS, Wasi-ur-Rahman M, Lu X, et al (2016) Efficient data access strategies for hadoop and spark on HPC cluster with heterogeneous storage. In: IEEE International Conference on Big Data, pp 223–232
Sun D (2021) Efficient text feature extraction by integrating the average linkage and K-medoids clustering. Mod Phys Lett B 35(09):2150151
Article MathSciNet Google Scholar
Deng Z, Zhu X, Cheng D et al (2016) Efficient kNN classification algorithm for big data. Neurocomputing 195:143–148
Article Google Scholar
Chen W, Chen S, Zhang H, Wu T (2017) A hybrid prediction model for type 2 diabetes using $k$-means and decision tree. In: 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp 386–390
Gallego AJ, Calvo-Zaragoza J, Valero-Mas JJ et al (2014) Clustering-based $k$-nearest neighbor classification for large-scale data with neural codes representation. Pattern Recogn 74:443–531
Google Scholar
Zhang X, Wang L, Huang Z, Xie H, Zhang Y, Ngulube M (2022) ConeSSD: a novel policy to optimize the performance of HDFS heterogeneous storage. In: 2022 IEEE 24th International Conference on High Performance Computing and Communications; 8th International Conference on Data Science and Systems; 20th International Conference on Smart City; 8th International Conference on Dependability in Sensor, Cloud and Big Data Systems and Application (HPCC/DSS/SmartCity/DependSys), pp 876–881
Dai W, Ibrahim I, Bassiouni M (2017) An improved replica placement policy for Hadoop distributed file system running on cloud platforms. In: IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), pp 270–275

Download references

Acknowledgements

We thank the editors and the anonymous reviewers for their useful feedback that improved this paper.

Funding

This work is supported by Natural Science Foundation of China under grant (No. 62172291, 62102196, 62102195), Natural Science Foundation of Jiangsu Province (No. BK20200753), Jiangsu Postdoctoral Science Foundation Funded Project (No. 2021K096A), Future Network Scientific Research Fund Project (FNSRFP-2021-YB-60), Natural Science Fund For Colleges and Universities in Jiangsu Province (21KJB520026), the Fundamental Research Funds for the Central Universities JL (No. 93K172020K25, 93K172021K03), Innovative Research Team Project of Suzhou Institute of Industrial Technology (2021KYTD003), and the Qing Lan Project of Jiangsu Province.

Author information

Authors and Affiliations

Suzhou Institute of Industrial Technology, Suzhou, 215006, People’s Republic of China
Xi Wang
College of Computer, Nanjing University of Posts and Telecommunications, Nanjing, 210003, People’s Republic of China
Xinzhi Hu, Weibei Fan & Ruchuan Wang
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, People’s Republic of China
Xi Wang & Weibei Fan

Authors

Xi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinzhi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Weibei Fan
View author publications
You can also search for this author in PubMed Google Scholar
Ruchuan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XW and WF wrote the main manuscript text and XH and RW prepared experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Weibei Fan.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

It is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Hu, X., Fan, W. et al. Efficient data persistence and data division for distributed computing in cloud data center networks. J Supercomput 79, 16300–16327 (2023). https://doi.org/10.1007/s11227-023-05276-2

Download citation

Accepted: 06 April 2023
Published: 26 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11227-023-05276-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient data persistence and data division for distributed computing in cloud data center networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid disaster-tolerant model with DDF technology for MooseFS open-source distributed file system

HaRD: a heterogeneity-aware replica deletion for HDFS

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Efficient data persistence and data division for distributed computing in cloud data center networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid disaster-tolerant model with DDF technology for MooseFS open-source distributed file system

HaRD: a heterogeneity-aware replica deletion for HDFS

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation