Liao Y, Xu Y, Xu H, Wang L, Qian C and Qiao C. Decentralized Federated Learning With Adaptive Configuration for Heterogeneous Participants. IEEE Transactions on Mobile Computing. 10.1109/TMC.2023.3335403. 23:6. (7453-7469).
Sun Z, Cao H, Wang Y, Feng G, Chen S, Wang H and Chen W. AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. (86-100).
Mo Z, Xu H and Xu C. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (499-513).
Zhang S, Diao L, Wu C, Cao Z, Wang S and Lin W. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis. Proceedings of the Nineteenth European Conference on Computer Systems. (524-541).
Liu J, Liu J, Xu H, Liao Y, Wang Z and Ma Q. YOGA: Adaptive Layer-Wise Model Aggregation for Decentralized Federated Learning. IEEE/ACM Transactions on Networking. 10.1109/TNET.2023.3329005. 32:2. (1768-1780).
Zhu Z, Tian Y, Huang Y, Xu J and He S. R-FAST: Robust Fully-Asynchronous Stochastic Gradient Tracking Over General Topology. IEEE Transactions on Signal and Information Processing over Networks. 10.1109/TSIPN.2024.3444484. 10. (665-678).
Geng J, Cao J, Jia H, Zhu Z, Fang H, Gao C, Ji C, Jia G, Han G and Zhou X. Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems. IEEE Transactions on Intelligent Transportation Systems. 10.1109/TITS.2023.3286400. 25:1. (959-972).
Zeng F, Gan W, Wang Y and Yu P.
(2023). Distributed Training of Large Language Models 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). 10.1109/ICPADS60453.2023.00126. 979-8-3503-3071-7. (840-847).
Navaz A, Kassabi H, Serhani M and Barka E.
(2023). Resource-Aware Federated Hybrid Profiling for Edge Node Selection in Federated Patient Similarity Network. Applied Sciences. 10.3390/app132413114. 13:24. (13114).
Kim T, Park C, Mukimbekov M, Hong H, Kim M, Jin Z, Kim C, Shin J and Jeon M.
(2023). FusionFlow: Accelerating Data Preprocessing for Machine Learning with CPU-GPU Cooperation. Proceedings of the VLDB Endowment. 17:4. (863-876). Online publication date: 1-Dec-2023.
Kim H, Yu H and Kim S.
(2023). Dynamic Worker Classification Scheme for Addressing Straggler Problem in Distributed Deep Learning Environments. The Journal of Korean Institute of Information Technology. 10.14801/jkiit.2023.21.10.1. 21:10. (1-9). Online publication date: 31-Oct-2023.
Wu X, Liu C, Magnússon S and Johansson M. Delay-agnostic asynchronous coordinate update algorithm. Proceedings of the 40th International Conference on Machine Learning. (37582-37606).
Liao Y, Xu Y, Xu H, Wang L and Qian C.
(2023). Adaptive Configuration for Heterogeneous Participants in Decentralized Federated Learning IEEE INFOCOM 2023 - IEEE Conference on Computer Communications. 10.1109/INFOCOM53939.2023.10228945. 979-8-3503-3414-2. (1-10).
Šajina R, Tanković N and Ipšić I.
(2023). Peer-to-peer deep learning with non-IID data. Expert Systems with Applications: An International Journal. 214:C. Online publication date: 15-Mar-2023.
Gu D, Zhao Y, Zhong Y, Xiong Y, Han Z, Cheng P, Yang F, Huang G, Jin X and Liu X. ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (266-280).
Hu Q, Zhang M, Sun P, Wen Y and Zhang T. Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. (457-472).
Kim H, Song C, Lee H and Yu H.
(2023). Addressing Straggler Problem Through Dynamic Partial All-Reduce for Distributed Deep Learning in Heterogeneous GPU Clusters 2023 IEEE International Conference on Consumer Electronics (ICCE). 10.1109/ICCE56470.2023.10043527. 978-1-6654-9130-3. (1-6).
Luo S, Fan P, Li K, Xing H, Luo L and Yu H.
(2022). Fast Parameter Synchronization for Distributed Learning with Selective Multicast ICC 2022 - IEEE International Conference on Communications. 10.1109/ICC45855.2022.9838266. 978-1-5386-8347-7. (4775-4780).
He J, Zhai J, Antunes T, Wang H, Luo F, Shi S and Li Q. FasterMoE. Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (120-134).
Wang Z, Sim J, Lim E and Zhao J.
(2022). Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 10.1109/HPCA53966.2022.00018. 978-1-6654-2027-3. (126-140).
Roy R, Patel T and Tiwari D. IceBreaker: warming serverless functions better with heterogeneity. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. (753-767).
Kim K, Lee H, Oh S and Seo E. Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud. IEEE Access. 10.1109/ACCESS.2022.3184692. 10. (68468-68481).
Krasanakis E, Papadopoulos S and Kompatsiaris I. p2pGNN: A Decentralized Graph Neural Network for Node Classification in Peer-to-Peer Networks. IEEE Access. 10.1109/ACCESS.2022.3159688. 10. (34755-34765).
Zeng Z, Liu C, Tang Z, Chang W and Li K.
(2021). Training Acceleration for Deep Neural Networks: A Hybrid Parallelization Strategy 2021 58th ACM/IEEE Design Automation Conference (DAC). 10.1109/DAC18074.2021.9586300. 978-1-6654-3274-0. (1165-1170).
Xu W, Pattnaik A, Yuan G, Wang Y, Zhang Y and Tang X.
(2021). ScaleDNN: Data Movement Aware DNN Training on Multi-GPU 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 10.1109/ICCAD51958.2021.9643503. 978-1-6654-4507-8. (1-9).
Yuan K, Chen Y, Huang X, Zhang Y, Pan P, Xu Y and Yin W.
(2021). DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 10.1109/ICCV48922.2021.00302. 978-1-6654-2812-5. (3009-3019).
Cao J, Zhu Z and Zhou X.
(2021). SAP-SGD: Accelerating Distributed Parallel Training with High Communication Efficiency on Heterogeneous Clusters 2021 IEEE International Conference on Cluster Computing (CLUSTER). 10.1109/Cluster48925.2021.00023. 978-1-7281-9666-4. (94-102).
He X, Liu J, Xie Z, Chen H, Chen G, Zhang W and Li D. Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators. Proceedings of the 35th ACM International Conference on Supercomputing. (227-241).
Zhang Z, Wu C and Li Z.
(2021). Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 10.1109/INFOCOM42981.2021.9488678. 978-1-6654-0325-2. (1-10).
Zhou P, Lin Q, Loghin D, Ooi B, Wu Y and Yu H.
(2021). Communication-efficient Decentralized Machine Learning over Heterogeneous Networks 2021 IEEE 37th International Conference on Data Engineering (ICDE). 10.1109/ICDE51399.2021.00040. 978-1-7281-9184-3. (384-395).
Oguni H and Shudo K.
(2021). Addressing the Heterogeneity of A Wide Area Network for DNNs 2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC). 10.1109/CCNC49032.2021.9369585. 978-1-7281-9794-4. (1-6).
Zhou P, Sun G, Yu H and Chang V.
(2021). Network-Aware Distributed Machine Learning Over Wide Area Network. Modern Industrial IoT, Big Data and Supply Chain. 10.1007/978-981-33-6141-6_6. (55-62).
Yang D, Rang W and Cheng D. Mitigating Stragglers in the Decentralized Training on Heterogeneous Clusters. Proceedings of the 21st International Middleware Conference. (386-399).
Yi X, Zhang S, Luo Z, Long G, Diao L, Wu C, Zheng Z, Yang J and Lin W. Optimizing distributed training deployment in heterogeneous GPU clusters. Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies. (93-107).
Oh S, Kim K and Seo E.
(2020). A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters 2020 IEEE International Conference on Smart Cloud (SmartCloud). 10.1109/SmartCloud49737.2020.00039. 978-1-7281-6547-9. (165-168).
Krishna Giri Narra H, Lin Z, Ananthanarayanan G, Avestimehr S and Annavaram M.
(2020). Collage Inference: Using Coded Redundancy for Lowering Latency Variation in Distributed Image Classification Systems 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). 10.1109/ICDCS47774.2020.00024. 978-1-7281-7002-2. (453-463).
Han R, Li S, Wang X, Liu C, Xin G and Chen L. Accelerating Gossip-based Deep Learning in Heterogeneous Edge Computing Platforms. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2020.3046440. (1-1).