Abstract
Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.
Similar content being viewed by others
Data Availability
No datasets were generated or analysed during the current study.
References
McMahan B, Moore E, Ramage D, Hampson S, Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics, pp 1273–1282. PMLR
Lai F, Zhu X, Madhyastha HV, Chowdhury M (2021) Oort: efficient federated learning via guided participant selection. In: OSDI, pp 19–35
Li C, Zeng X, Zhang M, Cao Z (2022) Pyramidfl: a fine-grained client selection framework for efficient federated learning. In: Proceedings of the 28th annual international conference on mobile computing and networking, pp 158–171
Fu B, Chen F, Li P, Su Z (2023) Efficient scheduling for multi-job federated learning systems with client sharing. In: 2023 IEEE Intl conf on dependable, autonomic and secure computing, intl conf on pervasive intelligence and computing, intl conf on cloud and big data computing, intl conf on cyber science and technology congress (DASC/PiCom/CBDCom/CyberSciTech), pp 0891–0898. IEEE
Rodio A, Faticanti F, Marfoq O, Neglia G, Leonardi E (2023) Federated learning under heterogeneous and correlated client availability. In: IEEE INFOCOM 2023-IEEE conference on computer communications, pp 1–10. IEEE
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for MobileNetV3
Krizhevsky A, Hinton G et al (2019) Learning multiple layers of features from tiny images
Zhou C, Liu J, Jia J, Zhou J, Zhou Y, Dai H, Dou D (2022) Efficient device scheduling with multi-job federated learning. Proc AAAI Conf Artif Intell 36:9971–9979
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Morse D, Richardson G (1983) The lifo/fifo decision. J Account Res 21(1):106–127. Accessed 26 June 2023
Fu B, Chen F, Li P, Zeng D (2023) Tcb: accelerating transformer inference services with request concatenation. In: Proceedings of the 51st international conference on parallel processing. ICPP ’22. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3545008.3545052
Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B et al (2021) Efficient large-scale language model training on gpu clusters using megatron-lm. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–15
Li Z, Zhuang S, Guo S, Zhuo D, Zhang H, Song D, Stoica I (2021) Terapipe: token-level pipeline parallelism for training large-scale language models. In: International conference on machine learning, pp 6543–6552. PMLR
Jiang J, Fu F, Yang T, Cui B (2018) Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 international conference on management of data, pp 1269–1284
Chen F, Li P, Wu C, Guo S (2022) Hare: exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous gpus. In: Proceedings of the 31st international symposium on high-performance parallel and distributed computing, pp 253–264
Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P et al (2020) Pytorch distributed: experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704
Chen F, Li P, Wu C (2023) Dgc: training dynamic graphs with spatio-temporal non-uniformity using graph partitioning by chunks. arXiv preprint arXiv:2309.03523
Kuznetsov M, Polykovskiy D (2021) Molgrow: a graph normalizing flow for hierarchical molecular generation. Proc AAAI Conf Artif Intell 35:8226–8234
Yang L, Li L, Zhang Z, Zhou X, Zhou E, Liu Y (2020) Dpgn: distribution propagation graph network for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13390–13399
Zhang Q, Zhou R, Wu C, Jiao L, Li Z (2020) Online scheduling of heterogeneous distributed machine learning jobs. In: Proceedings of the twenty-first international symposium on theory, algorithmic foundations, and protocol design for mobile networks and mobile computing, pp 111–120
Chen F, Li P, Miyazaki T, Wu C (2021) Fedgraph: federated graph learning with intelligent sampling. IEEE Trans Parallel Distrib Syst 33(8):1775–1786
Sajadmanesh S, Shamsabadi AS, Bellet A, Gatica-Perez D (2023) Gap: differentially private graph neural networks with aggregation perturbation. In: USENIX Security 2023-32nd USENIX security symposium
Chen P, Du X, Lu Z, Wu J, Hung PC (2022) Evfl: an explainable vertical federated learning for data-oriented artificial intelligence systems. J Syst Architect 126:102474
Rachakonda S, Moorthy S, Jain A, Bukharev A, Bucur A, Manni F, Quiterio TM, Joosten L, Mendez NI (2022) Privacy enhancing and scalable federated learning to accelerate ai implementation in cross-silo and iomt environments. IEEE J Biomed Health Inform 27(2):744–755
Zhang Q, Wu T, Zhou P, Zhou S, Yang Y, Jin X (2022) Felicitas: federated learning in distributed cross device collaborative frameworks. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp 4502–4509
Wang H, Kaplan Z, Niu D, Li B (2020) Optimizing federated learning on non-iid data with reinforcement learning. In: IEEE INFOCOM 2020-IEEE Conference on computer communications, pp 1698–1707. IEEE
Luo L, Cai Q, Li Z, Yu H (2022) Joint client selection and resource allocation for federated learning in mobile edge networks. In: 2022 IEEE Wireless communications and networking conference (WCNC), pp 1218–1223. https://doi.org/10.1109/WCNC51071.2022.9771771
Liu J, Jia J, Ma B, Zhou C, Zhou J, Zhou Y, Dai H, Dou D (2022) Multi-job intelligent scheduling with cross-device federated learning. IEEE Trans Parallel Distrib Syst 34(2):535–551
Shi Y, Yu H (2024) Fairness-aware job scheduling for multi-job federated learning. In: ICASSP 2024-2024 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6350–6354. IEEE
Yang J, Jia J, Deng T, Dong M (2024) Efficient scheduling for multi-job vertical federated learning. In: ICC 2024-IEEE International conference on communications, pp 2640–2645. IEEE
Funding
This research is supported by Japan Society for the Promotion of Science (JSPS) KAKENHI No. 21H03424, Japan Science and Technology Agency (JST) PRESTO No. 23828673, Grant-in-Aid for JSPS Fellows No. 23KJ1786.
Author information
Authors and Affiliations
Contributions
All authors contributed to the idea and the technical design. The draft of the manuscript was written by Boqian Fu and all authors commented on the manuscript. Fahao Chen prepared figures 7-8. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Ethical standard
not applicable.
Consent to Publish
All authors agree with the content and all give explicit consent to submit.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection: 5 - Track on Machine Learning
Guest Editor: Jiannong Cao
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fu, B., Chen, F., Pan, S. et al. Efficient multi-job federated learning scheduling with fault tolerance. Peer-to-Peer Netw. Appl. 18, 71 (2025). https://doi.org/10.1007/s12083-024-01847-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12083-024-01847-z