Liao Y, Xu Y, Xu H, Wang L, Qian C and Qiao C. Decentralized Federated Learning With Adaptive Configuration for Heterogeneous Participants. IEEE Transactions on Mobile Computing. 10.1109/TMC.2023.3335403. 23:6. (7453-7469).

https://ieeexplore.ieee.org/document/10325625/

Sun Z, Cao H, Wang Y, Feng G, Chen S, Wang H and Chen W. AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. (86-100).

Mo Z, Xu H and Xu C. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (499-513).

https://doi.org/10.1145/3620665.3640375

Zhang S, Diao L, Wu C, Cao Z, Wang S and Lin W. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis. Proceedings of the Nineteenth European Conference on Computer Systems. (524-541).

https://doi.org/10.1145/3627703.3629580

Liu J, Liu J, Xu H, Liao Y, Wang Z and Ma Q. YOGA: Adaptive Layer-Wise Model Aggregation for Decentralized Federated Learning. IEEE/ACM Transactions on Networking. 10.1109/TNET.2023.3329005. 32:2. (1768-1780).

https://ieeexplore.ieee.org/document/10309973/

Zhu Z, Tian Y, Huang Y, Xu J and He S. R-FAST: Robust Fully-Asynchronous Stochastic Gradient Tracking Over General Topology. IEEE Transactions on Signal and Information Processing over Networks. 10.1109/TSIPN.2024.3444484. 10. (665-678).

https://ieeexplore.ieee.org/document/10660468/

Geng J, Cao J, Jia H, Zhu Z, Fang H, Gao C, Ji C, Jia G, Han G and Zhou X. Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems. IEEE Transactions on Intelligent Transportation Systems. 10.1109/TITS.2023.3286400. 25:1. (959-972).

https://ieeexplore.ieee.org/document/10159578/

Zeng F, Gan W, Wang Y and Yu P. (2023). Distributed Training of Large Language Models 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS60453.2023.00126. 979-8-3503-3071-7. (840-847).

https://ieeexplore.ieee.org/document/10476280/

Navaz A, Kassabi H, Serhani M and Barka E. (2023). Resource-Aware Federated Hybrid Profiling for Edge Node Selection in Federated Patient Similarity Network. Applied Sciences. 10.3390/app132413114. 13:24. (13114).

https://www.mdpi.com/2076-3417/13/24/13114

Kim T, Park C, Mukimbekov M, Hong H, Kim M, Jin Z, Kim C, Shin J and Jeon M. (2023). FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation. Proceedings of the VLDB Endowment. 17:4. (863-876). Online publication date: 1-Dec-2023.

https://doi.org/10.14778/3636218.3636238

Kim H, Yu H and Kim S. (2023). Dynamic Worker Classification Scheme for Addressing Straggler Problem in Distributed Deep Learning Environments. The Journal of Korean Institute of Information Technology. 10.14801/jkiit.2023.21.10.1. 21:10. (1-9). Online publication date: 31-Oct-2023.

http://www.dbpia.co.kr/Journal/ArticleDetail/NODE11554480

Wu X, Liu C, Magnússon S and Johansson M. Delay-agnostic asynchronous coordinate update algorithm. Proceedings of the 40th International Conference on Machine Learning. (37582-37606).

/doi/10.5555/3618408.3619973

Liao Y, Xu Y, Xu H, Wang L and Qian C. (2023). Adaptive Configuration for Heterogeneous Participants in Decentralized Federated Learning IEEE INFOCOM 2023 - IEEE Conference on Computer Communications. 10.1109/INFOCOM53939.2023.10228945. 979-8-3503-3414-2. (1-10).

https://ieeexplore.ieee.org/document/10228945/

Šajina R, Tanković N and Ipšić I. (2023). Peer-to-peer deep learning with non-IID data. Expert Systems with Applications: An International Journal. 214:C. Online publication date: 15-Mar-2023.

https://doi.org/10.1016/j.eswa.2022.119159

Gu D, Zhao Y, Zhong Y, Xiong Y, Han Z, Cheng P, Yang F, Huang G, Jin X and Liu X. ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (266-280).

https://doi.org/10.1145/3575693.3575721

Hu Q, Zhang M, Sun P, Wen Y and Zhang T. Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (457-472).

https://doi.org/10.1145/3575693.3575705

Kim H, Song C, Lee H and Yu H. (2023). Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters 2023 IEEE International Conference on Consumer Electronics (ICCE). 10.1109/ICCE56470.2023.10043527. 978-1-6654-9130-3. (1-6).

https://ieeexplore.ieee.org/document/10043527/

Su W, Zhang Y, Cai Y, Ren K, Wang P, Yi H, Song Y, Chen J, Deng H, Xu J, Qu L and Zheng B. GBA. Proceedings of the 36th International Conference on Neural Information Processing Systems. (29525-29537).

/doi/10.5555/3600270.3602411

Luo S, Fan P, Li K, Xing H, Luo L and Yu H. (2022). Fast Parameter Synchronization for Distributed Learning with Selective Multicast ICC 2022 - IEEE International Conference on Communications. 10.1109/ICC45855.2022.9838266. 978-1-5386-8347-7. (4775-4780).

https://ieeexplore.ieee.org/document/9838266/

He J, Zhai J, Antunes T, Wang H, Luo F, Shi S and Li Q. FasterMoE. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (120-134).

https://doi.org/10.1145/3503221.3508418

Wang Z, Sim J, Lim E and Zhao J. (2022). Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA53966.2022.00018. 978-1-6654-2027-3. (126-140).

https://ieeexplore.ieee.org/document/9773214/

Roy R, Patel T and Tiwari D. IceBreaker: warming serverless functions better with heterogeneity. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. (753-767).

https://doi.org/10.1145/3503222.3507750

Kim K, Lee H, Oh S and Seo E. Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud. IEEE Access. 10.1109/ACCESS.2022.3184692. 10. (68468-68481).

https://ieeexplore.ieee.org/document/9801853/

Krasanakis E, Papadopoulos S and Kompatsiaris I. p2pGNN: A Decentralized Graph Neural Network for Node Classification in Peer-to-Peer Networks. IEEE Access. 10.1109/ACCESS.2022.3159688. 10. (34755-34765).

https://ieeexplore.ieee.org/document/9734065/

Zeng Z, Liu C, Tang Z, Chang W and Li K. (2021). Training Acceleration for Deep Neural Networks: A Hybrid Parallelization Strategy 2021 58th ACM/IEEE Design Automation Conference (DAC). 10.1109/DAC18074.2021.9586300. 978-1-6654-3274-0. (1165-1170).

https://ieeexplore.ieee.org/document/9586300/

Xu W, Pattnaik A, Yuan G, Wang Y, Zhang Y and Tang X. (2021). ScaleDNN: Data Movement Aware DNN Training on Multi-GPU 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 10.1109/ICCAD51958.2021.9643503. 978-1-6654-4507-8. (1-9).

https://ieeexplore.ieee.org/document/9643503/

Yuan K, Chen Y, Huang X, Zhang Y, Pan P, Xu Y and Yin W. (2021). DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 10.1109/ICCV48922.2021.00302. 978-1-6654-2812-5. (3009-3019).

https://ieeexplore.ieee.org/document/9711343/

Cao J, Zhu Z and Zhou X. (2021). SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters 2021 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/Cluster48925.2021.00023. 978-1-7281-9666-4. (94-102).

https://ieeexplore.ieee.org/document/9556082/

Hong R and Chandra A. DLion. Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. (227-238).

https://doi.org/10.1145/3431379.3460643

He X, Liu J, Xie Z, Chen H, Chen G, Zhang W and Li D. Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators. Proceedings of the 35th ACM International Conference on Supercomputing. (227-241).

https://doi.org/10.1145/3447818.3460371

Zhang Z, Wu C and Li Z. (2021). Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 10.1109/INFOCOM42981.2021.9488678. 978-1-6654-0325-2. (1-10).

https://ieeexplore.ieee.org/document/9488678/

Zhou P, Lin Q, Loghin D, Ooi B, Wu Y and Yu H. (2021). Communication-efficient Decentralized Machine Learning over Heterogeneous Networks 2021 IEEE 37th International Conference on Data Engineering (ICDE). 10.1109/ICDE51399.2021.00040. 978-1-7281-9184-3. (384-395).

https://ieeexplore.ieee.org/document/9458654/

Oguni H and Shudo K. (2021). Addressing the Heterogeneity of A Wide Area Network for DNNs 2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC). 10.1109/CCNC49032.2021.9369585. 978-1-7281-9794-4. (1-6).

https://ieeexplore.ieee.org/document/9369585/

Oguni H and Shudo K. Communication Scheduling for Gossip SGD in a Wide Area Network. IEEE Access. 10.1109/ACCESS.2021.3083639. 9. (77873-77881).

https://ieeexplore.ieee.org/document/9440060/

Zhou P, Sun G, Yu H and Chang V. (2021). Network-Aware Distributed Machine Learning Over Wide Area Network. Modern Industrial IoT, Big Data and Supply Chain. 10.1007/978-981-33-6141-6_6. (55-62).

http://link.springer.com/10.1007/978-981-33-6141-6_6

Yang D, Rang W and Cheng D. Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters. Proceedings of the 21st International Middleware Conference. (386-399).

https://doi.org/10.1145/3423211.3425693

Yi X, Zhang S, Luo Z, Long G, Diao L, Wu C, Zheng Z, Yang J and Lin W. Optimizing distributed training deployment in heterogeneous GPU clusters. Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. (93-107).

https://doi.org/10.1145/3386367.3432728

Oh S, Kim K and Seo E. (2020). A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters 2020 IEEE International Conference on Smart Cloud (SmartCloud). 10.1109/SmartCloud49737.2020.00039. 978-1-7281-6547-9. (165-168).

https://ieeexplore.ieee.org/document/9265951/

Krishna Giri Narra H, Lin Z, Ananthanarayanan G, Avestimehr S and Annavaram M. (2020). Collage Inference: Using Coded Redundancy for Lowering Latency Variation in Distributed Image Classification Systems 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). 10.1109/ICDCS47774.2020.00024. 978-1-7281-7002-2. (453-463).

https://ieeexplore.ieee.org/document/9355618/

Han R, Li S, Wang X, Liu C, Xin G and Chen L. Accelerating Gossip-based Deep Learning in Heterogeneous Edge Computing Platforms. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3046440. (1-1).

https://ieeexplore.ieee.org/document/9303468/