Efficient multi-job federated learning scheduling with fault tolerance

Fu, Boqian; Chen, Fahao; Pan, Shengli; Li, Peng; Su, Zhou

doi:10.1007/s12083-024-01847-z

Efficient multi-job federated learning scheduling with fault tolerance

Published: 16 January 2025

Volume 18, article number 71, (2025)
Cite this article

Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Boqian Fu¹,
Fahao Chen¹,
Shengli Pan²,
Peng Li¹ &
…
Zhou Su³

12 Accesses
Explore all metrics

Abstract

Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

Fig. 6

Federated Learning Optimization Algorithm Based on Dynamic Client Scale

Towards an Efficient Client Selection System for Federated Learning

FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning

Article Open access 28 February 2024

Data Availability

No datasets were generated or analysed during the current study.

References

McMahan B, Moore E, Ramage D, Hampson S, Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics, pp 1273–1282. PMLR
Lai F, Zhu X, Madhyastha HV, Chowdhury M (2021) Oort: efficient federated learning via guided participant selection. In: OSDI, pp 19–35
Li C, Zeng X, Zhang M, Cao Z (2022) Pyramidfl: a fine-grained client selection framework for efficient federated learning. In: Proceedings of the 28th annual international conference on mobile computing and networking, pp 158–171
Fu B, Chen F, Li P, Su Z (2023) Efficient scheduling for multi-job federated learning systems with client sharing. In: 2023 IEEE Intl conf on dependable, autonomic and secure computing, intl conf on pervasive intelligence and computing, intl conf on cloud and big data computing, intl conf on cyber science and technology congress (DASC/PiCom/CBDCom/CyberSciTech), pp 0891–0898. IEEE
Rodio A, Faticanti F, Marfoq O, Neglia G, Leonardi E (2023) Federated learning under heterogeneous and correlated client availability. In: IEEE INFOCOM 2023-IEEE conference on computer communications, pp 1–10. IEEE
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for MobileNetV3
Krizhevsky A, Hinton G et al (2019) Learning multiple layers of features from tiny images
Zhou C, Liu J, Jia J, Zhou J, Zhou Y, Dai H, Dou D (2022) Efficient device scheduling with multi-job federated learning. Proc AAAI Conf Artif Intell 36:9971–9979
MATH Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article MATH Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article MATH Google Scholar
Morse D, Richardson G (1983) The lifo/fifo decision. J Account Res 21(1):106–127. Accessed 26 June 2023
Fu B, Chen F, Li P, Zeng D (2023) Tcb: accelerating transformer inference services with request concatenation. In: Proceedings of the 51st international conference on parallel processing. ICPP ’22. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3545008.3545052
Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B et al (2021) Efficient large-scale language model training on gpu clusters using megatron-lm. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–15
Li Z, Zhuang S, Guo S, Zhuo D, Zhang H, Song D, Stoica I (2021) Terapipe: token-level pipeline parallelism for training large-scale language models. In: International conference on machine learning, pp 6543–6552. PMLR
Jiang J, Fu F, Yang T, Cui B (2018) Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 international conference on management of data, pp 1269–1284
Chen F, Li P, Wu C, Guo S (2022) Hare: exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous gpus. In: Proceedings of the 31st international symposium on high-performance parallel and distributed computing, pp 253–264
Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P et al (2020) Pytorch distributed: experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704
Chen F, Li P, Wu C (2023) Dgc: training dynamic graphs with spatio-temporal non-uniformity using graph partitioning by chunks. arXiv preprint arXiv:2309.03523
Kuznetsov M, Polykovskiy D (2021) Molgrow: a graph normalizing flow for hierarchical molecular generation. Proc AAAI Conf Artif Intell 35:8226–8234
MATH Google Scholar
Yang L, Li L, Zhang Z, Zhou X, Zhou E, Liu Y (2020) Dpgn: distribution propagation graph network for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13390–13399
Zhang Q, Zhou R, Wu C, Jiao L, Li Z (2020) Online scheduling of heterogeneous distributed machine learning jobs. In: Proceedings of the twenty-first international symposium on theory, algorithmic foundations, and protocol design for mobile networks and mobile computing, pp 111–120
Chen F, Li P, Miyazaki T, Wu C (2021) Fedgraph: federated graph learning with intelligent sampling. IEEE Trans Parallel Distrib Syst 33(8):1775–1786
Article MATH Google Scholar
Sajadmanesh S, Shamsabadi AS, Bellet A, Gatica-Perez D (2023) Gap: differentially private graph neural networks with aggregation perturbation. In: USENIX Security 2023-32nd USENIX security symposium
Chen P, Du X, Lu Z, Wu J, Hung PC (2022) Evfl: an explainable vertical federated learning for data-oriented artificial intelligence systems. J Syst Architect 126:102474
Rachakonda S, Moorthy S, Jain A, Bukharev A, Bucur A, Manni F, Quiterio TM, Joosten L, Mendez NI (2022) Privacy enhancing and scalable federated learning to accelerate ai implementation in cross-silo and iomt environments. IEEE J Biomed Health Inform 27(2):744–755
Article Google Scholar
Zhang Q, Wu T, Zhou P, Zhou S, Yang Y, Jin X (2022) Felicitas: federated learning in distributed cross device collaborative frameworks. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp 4502–4509
Wang H, Kaplan Z, Niu D, Li B (2020) Optimizing federated learning on non-iid data with reinforcement learning. In: IEEE INFOCOM 2020-IEEE Conference on computer communications, pp 1698–1707. IEEE
Luo L, Cai Q, Li Z, Yu H (2022) Joint client selection and resource allocation for federated learning in mobile edge networks. In: 2022 IEEE Wireless communications and networking conference (WCNC), pp 1218–1223. https://doi.org/10.1109/WCNC51071.2022.9771771
Liu J, Jia J, Ma B, Zhou C, Zhou J, Zhou Y, Dai H, Dou D (2022) Multi-job intelligent scheduling with cross-device federated learning. IEEE Trans Parallel Distrib Syst 34(2):535–551
Article MATH Google Scholar
Shi Y, Yu H (2024) Fairness-aware job scheduling for multi-job federated learning. In: ICASSP 2024-2024 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6350–6354. IEEE
Yang J, Jia J, Deng T, Dong M (2024) Efficient scheduling for multi-job vertical federated learning. In: ICC 2024-IEEE International conference on communications, pp 2640–2645. IEEE

Download references

Funding

This research is supported by Japan Society for the Promotion of Science (JSPS) KAKENHI No. 21H03424, Japan Science and Technology Agency (JST) PRESTO No. 23828673, Grant-in-Aid for JSPS Fellows No. 23KJ1786.

Author information

Authors and Affiliations

School of Computer Science and Engineering, The University of Aizu, Aizuwakamatsu, 9658580, Japan
Boqian Fu, Fahao Chen & Peng Li
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Shengli Pan
School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, 710049, China
Zhou Su

Authors

Boqian Fu
View author publications
You can also search for this author in PubMed Google Scholar
Fahao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shengli Pan
View author publications
You can also search for this author in PubMed Google Scholar
Peng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhou Su
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the idea and the technical design. The draft of the manuscript was written by Boqian Fu and all authors commented on the manuscript. Fahao Chen prepared figures 7-8. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Peng Li.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Ethical standard

not applicable.

Consent to Publish

All authors agree with the content and all give explicit consent to submit.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection: 5 - Track on Machine Learning

Guest Editor: Jiannong Cao

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fu, B., Chen, F., Pan, S. et al. Efficient multi-job federated learning scheduling with fault tolerance. Peer-to-Peer Netw. Appl. 18, 71 (2025). https://doi.org/10.1007/s12083-024-01847-z

Download citation

Received: 24 March 2024
Accepted: 30 September 2024
Published: 16 January 2025
DOI: https://doi.org/10.1007/s12083-024-01847-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient multi-job federated learning scheduling with fault tolerance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Federated Learning Optimization Algorithm Based on Dynamic Client Scale

Towards an Efficient Client Selection System for Federated Learning

FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Ethical standard

Consent to Publish

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Efficient multi-job federated learning scheduling with fault tolerance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Federated Learning Optimization Algorithm Based on Dynamic Client Scale

Towards an Efficient Client Selection System for Federated Learning

FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Ethical standard

Consent to Publish

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation