Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Efficient multi-job federated learning scheduling with fault tolerance

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Federated Learning (FL) has emerged as a promising learning approach for utilizing data distributed across edge devices. However, existing works mainly focus on single-job FL systems. In practice, multiple FL jobs will be submitted simultaneously. How to schedule multiple FL jobs is crucial for client resource utilization and job efficiency. In addition, existing works assume that clients are always available during FL jobs, which is often not a reality since clients could be unavailable for FL jobs due to various reasons. To address these challenges, in this paper, we introduce a novel fault-tolerance multi-job scheduling strategy aimed at optimizing job efficiency and resource utilization. The basic idea of our approach is a redundancy-based fault tolerance mechanism, which is designed to ensure the robustness of FL jobs even with insufficient clients. The mechanism strategically selects clients for redundant model training. Based on the mechanism, the scheduling algorithm prioritizes urgent FL jobs, facilitating their completion and obviating the need for prolonged waiting periods for additional client availability. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, which can significantly outperform other baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

No datasets were generated or analysed during the current study.

References

  1. McMahan B, Moore E, Ramage D, Hampson S, Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics, pp 1273–1282. PMLR

  2. Lai F, Zhu X, Madhyastha HV, Chowdhury M (2021) Oort: efficient federated learning via guided participant selection. In: OSDI, pp 19–35

  3. Li C, Zeng X, Zhang M, Cao Z (2022) Pyramidfl: a fine-grained client selection framework for efficient federated learning. In: Proceedings of the 28th annual international conference on mobile computing and networking, pp 158–171

  4. Fu B, Chen F, Li P, Su Z (2023) Efficient scheduling for multi-job federated learning systems with client sharing. In: 2023 IEEE Intl conf on dependable, autonomic and secure computing, intl conf on pervasive intelligence and computing, intl conf on cloud and big data computing, intl conf on cyber science and technology congress (DASC/PiCom/CBDCom/CyberSciTech), pp 0891–0898. IEEE

  5. Rodio A, Faticanti F, Marfoq O, Neglia G, Leonardi E (2023) Federated learning under heterogeneous and correlated client availability. In: IEEE INFOCOM 2023-IEEE conference on computer communications, pp 1–10. IEEE

  6. Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for MobileNetV3

  7. Krizhevsky A, Hinton G et al (2019) Learning multiple layers of features from tiny images

  8. Zhou C, Liu J, Jia J, Zhou J, Zhou Y, Dai H, Dou D (2022) Efficient device scheduling with multi-job federated learning. Proc AAAI Conf Artif Intell 36:9971–9979

    MATH  Google Scholar 

  9. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  10. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  MATH  Google Scholar 

  11. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  12. Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747

  13. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  MATH  Google Scholar 

  14. Morse D, Richardson G (1983) The lifo/fifo decision. J Account Res 21(1):106–127. Accessed 26 June 2023

  15. Fu B, Chen F, Li P, Zeng D (2023) Tcb: accelerating transformer inference services with request concatenation. In: Proceedings of the 51st international conference on parallel processing. ICPP ’22. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3545008.3545052

  16. Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B et al (2021) Efficient large-scale language model training on gpu clusters using megatron-lm. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–15

  17. Li Z, Zhuang S, Guo S, Zhuo D, Zhang H, Song D, Stoica I (2021) Terapipe: token-level pipeline parallelism for training large-scale language models. In: International conference on machine learning, pp 6543–6552. PMLR

  18. Jiang J, Fu F, Yang T, Cui B (2018) Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 international conference on management of data, pp 1269–1284

  19. Chen F, Li P, Wu C, Guo S (2022) Hare: exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous gpus. In: Proceedings of the 31st international symposium on high-performance parallel and distributed computing, pp 253–264

  20. Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P et al (2020) Pytorch distributed: experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704

  21. Chen F, Li P, Wu C (2023) Dgc: training dynamic graphs with spatio-temporal non-uniformity using graph partitioning by chunks. arXiv preprint arXiv:2309.03523

  22. Kuznetsov M, Polykovskiy D (2021) Molgrow: a graph normalizing flow for hierarchical molecular generation. Proc AAAI Conf Artif Intell 35:8226–8234

    MATH  Google Scholar 

  23. Yang L, Li L, Zhang Z, Zhou X, Zhou E, Liu Y (2020) Dpgn: distribution propagation graph network for few-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13390–13399

  24. Zhang Q, Zhou R, Wu C, Jiao L, Li Z (2020) Online scheduling of heterogeneous distributed machine learning jobs. In: Proceedings of the twenty-first international symposium on theory, algorithmic foundations, and protocol design for mobile networks and mobile computing, pp 111–120

  25. Chen F, Li P, Miyazaki T, Wu C (2021) Fedgraph: federated graph learning with intelligent sampling. IEEE Trans Parallel Distrib Syst 33(8):1775–1786

    Article  MATH  Google Scholar 

  26. Sajadmanesh S, Shamsabadi AS, Bellet A, Gatica-Perez D (2023) Gap: differentially private graph neural networks with aggregation perturbation. In: USENIX Security 2023-32nd USENIX security symposium

  27. Chen P, Du X, Lu Z, Wu J, Hung PC (2022) Evfl: an explainable vertical federated learning for data-oriented artificial intelligence systems. J Syst Architect 126:102474

  28. Rachakonda S, Moorthy S, Jain A, Bukharev A, Bucur A, Manni F, Quiterio TM, Joosten L, Mendez NI (2022) Privacy enhancing and scalable federated learning to accelerate ai implementation in cross-silo and iomt environments. IEEE J Biomed Health Inform 27(2):744–755

    Article  Google Scholar 

  29. Zhang Q, Wu T, Zhou P, Zhou S, Yang Y, Jin X (2022) Felicitas: federated learning in distributed cross device collaborative frameworks. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp 4502–4509

  30. Wang H, Kaplan Z, Niu D, Li B (2020) Optimizing federated learning on non-iid data with reinforcement learning. In: IEEE INFOCOM 2020-IEEE Conference on computer communications, pp 1698–1707. IEEE

  31. Luo L, Cai Q, Li Z, Yu H (2022) Joint client selection and resource allocation for federated learning in mobile edge networks. In: 2022 IEEE Wireless communications and networking conference (WCNC), pp 1218–1223. https://doi.org/10.1109/WCNC51071.2022.9771771

  32. Liu J, Jia J, Ma B, Zhou C, Zhou J, Zhou Y, Dai H, Dou D (2022) Multi-job intelligent scheduling with cross-device federated learning. IEEE Trans Parallel Distrib Syst 34(2):535–551

    Article  MATH  Google Scholar 

  33. Shi Y, Yu H (2024) Fairness-aware job scheduling for multi-job federated learning. In: ICASSP 2024-2024 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 6350–6354. IEEE

  34. Yang J, Jia J, Deng T, Dong M (2024) Efficient scheduling for multi-job vertical federated learning. In: ICC 2024-IEEE International conference on communications, pp 2640–2645. IEEE

Download references

Funding

This research is supported by Japan Society for the Promotion of Science (JSPS) KAKENHI No. 21H03424, Japan Science and Technology Agency (JST) PRESTO No. 23828673, Grant-in-Aid for JSPS Fellows No. 23KJ1786.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the idea and the technical design. The draft of the manuscript was written by Boqian Fu and all authors commented on the manuscript. Fahao Chen prepared figures 7-8. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Peng Li.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Ethical standard

not applicable.

Consent to Publish

All authors agree with the content and all give explicit consent to submit.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection: 5 - Track on Machine Learning

Guest Editor: Jiannong Cao

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, B., Chen, F., Pan, S. et al. Efficient multi-job federated learning scheduling with fault tolerance. Peer-to-Peer Netw. Appl. 18, 71 (2025). https://doi.org/10.1007/s12083-024-01847-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12083-024-01847-z

Keywords