Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Robust Searching-Based Gradient Collaborative Management in Intelligent Transportation System

Published: 27 September 2023 Publication History
  • Get Citation Alerts
  • Abstract

    With the rapid development of big data and the Internet of Things (IoT), traffic data from an Intelligent Transportation System (ITS) is becoming more and more accessible. To understand and simulate the traffic patterns from the traffic data, Multimedia Cognitive Computing (MCC) is an efficient and practical approach. Distributed Machine Learning (DML) has been the trend to provide sufficient computing resources and efficiency for MCC tasks to handle massive data and complex models. DML can speed up computation with those computing resources but introduces communication overhead. Gradient collaborative management or gradient aggregation in DML for MCC tasks is a critical task. An efficient managing algorithm of the communication schedules for gradient aggregation in ITS can improve the performance of MCC tasks. However, existing communication schedules typically rely on specific physical connection matrices, which have low robustness when a malfunction occurs. In this article, we propose Robust Searching-based Gradient Collaborative Management (RSGCM) in Intelligent Transportation System, a practical ring-based gradient managing algorithm for communication schedules across devices to deal with ITS malfunction. RSGCM provides solutions of communication schedules to various kinds of connection matrices with an acceptable amount of training time. Our experimental results have shown that RSGCM can deal with more varieties of connection matrices than existing state-of-the-art communication schedules. RSGCM also increases the robustness of ITS since it can restore the system’s functionality in an acceptable time when device or connection breakdown happens.

    References

    [1]
    Samah Aloufi and Abdulmotaleb El Saddik. 2022. MMSUM digital twins: A multi-view multi-modality summarization framework for sporting events. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 1 (2022), 1–25.
    [2]
    Qin Ba and Ketan Savla. 2017. Robustness of DC networks with controllable link weights. IEEE Transactions on Control of Network Systems 5, 3 (2017), 1479–1491.
    [4]
    Yixin Bao, Yanghua Peng, Yangrui Chen, and Chuan Wu. 2020. Preemptive all-reduce scheduling for expediting distributed DNN training. In Proceedings of the IEEE INFOCOM 2020 IEEE Conference on Computer Communications. 626–635.
    [5]
    Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online job scheduling in distributed machine learning clusters. In Proceedings of the IEEE INFOCOM 2018 IEEE Conference on Computer Communications. 495–503.
    [6]
    Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, et al. 2021. Synthesizing optimal collective algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 62–75.
    [7]
    Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, et al. 2021. Synthesizing optimal collective algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 62–75.
    [8]
    Gabriele Castellano, Flavio Esposito, and Fulvio Risso. 2019. A distributed orchestration algorithm for edge computing resources with guarantees. In Proceedings of the IEEE INFOCOM 2019-IEEE Conference on Computer Communications. 2548–2556.
    [9]
    Adrián Castelló, Enrique S. Quintana-Ortí, and José Duato. 2021. Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Computing 24, 4 (2021), 3797–3813.
    [10]
    Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting distributed synchronous SGD. arXiv:1604.00981.
    [11]
    Minsik Cho, Ulrich Finkler, Mauricio Serrano, David Kung, and Hillery Hunter. 2019. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy. IBM Journal of Research and Development 63, 6 (2019), 1–1.
    [12]
    Masaki Chujyo and Yukio Hayashi. 2021. A loop enhancement strategy for network robustness. Applied Network Science 6, 1 (2021), 1–13.
    [13]
    Maurice Clerc. 2010. Particle Swarm Optimization. Vol. 93.
    [14]
    Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms.
    [15]
    Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 4171–4186.
    [16]
    Salvatore Di Girolamo, Andreas Kurth, Alexandru Calotoiu, Thomas Benz, Timo Schneider, Jakub Beránek, et al. 2020. PsPIN: A high-performance low-power architecture for flexible in-network compute. arXiv:2010.03536.
    [17]
    Marco Dorigo. 2007. Ant colony optimization. Scholarpedia 2, 3 (2007), 1461.
    [18]
    Kathryn Anne Dowsland and Jonathan Thompson. 2012. Simulated annealing. Handbook of Natural Computing, 1623–1655.
    [20]
    Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.
    [21]
    Zan Gao, Yinming Li, and Shaohua Wan. 2020. Exploring deep learning for view-based 3D model retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), 1–21.
    [23]
    Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, et al. 2009. BCube: A high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication. 63–74.
    [24]
    Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and Songwu Lu. 2008. DCell: A scalable and fault-tolerant network structure for data centers. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication. 75–86.
    [25]
    Hanxi Guo, Hao Wang, Tao Song, Yang Hua, Zhangcheng Lv, Xiulang Jin, et al. 2021. Siren: Byzantine-robust federated learning via proactive alarming. In Proceedings of the ACM Symposium on Cloud Computing. 47–60.
    [26]
    Roger W. Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing. 20, 3 (1994), 389–398.
    [27]
    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.
    [28]
    Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arXiv:1807.11205.
    [29]
    Youhe Jiang, Huaxi Gu, Yunfeng Lu, and Xiaoshan Yu. 2020. 2D-HRA: Two-dimensional hierarchical ring-based all-reduce algorithm in large-scale distributed machine learning. IEEE Access 8 (2020), 183488–183494.
    [30]
    Andreas Jocksch, Noé Ohana, Emmanuel Lanti, Eirini Koutsaniti, Vasileios Karakasis, and Laurent Villard. 2021. An optimisation of allreduce communication in message-passing systems. Parallel Computing 107 (2021), 102812.
    [31]
    Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.
    [32]
    Jinho Lee, Inseok Hwang, Soham Shah, and Minsik Cho. 2020. FlexReduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference. 1–6.
    [33]
    Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, et al. 2019. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUdirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2019), 94–110.
    [34]
    Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 583–598.
    [35]
    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, et al. 2015. Continuous control with deep reinforcement learning. arXiv:1509.02971.
    [36]
    Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887.
    [37]
    Yangxin Lin, Ping Wang, and Meng Ma. 2017. Intelligent transportation system (ITS): Concept, challenge and opportunity. In Proceedings of the 2017 IEEE 3rd International Conference on Big Data Security on Cloud, IEEE International Conference on High Performance and Smart Computing, and IEEE International Conference on Intelligent Data and Security. 167–172.
    [38]
    Yao Liu, Junyi Zhang, Shuo Liu, Qiaoling Wang, Wangchen Dai, and Ray Chak Chung Cheung. 2021. Scalable fully pipelined hardware architecture for in-network aggregated AllReduce communication. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 10 (2021), 4194–4206.
    [39]
    Ruhui Ma, Jian Li, Haibing Guan, Mingyuan Xia, and Xue Liu. 2015. EnDAS: Efficient encrypted data search as a mobile cloud service. IEEE Transactions on Emerging Topics in Computing 3, 3 (2015), 372–383.
    [40]
    Derya Malak, Alejandro Cohen, and Muriel Médard. 2020. How to distribute computation in networks. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications. 327–336.
    [41]
    Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, Yuichi Kageyama, et al. 2018. Massively distributed SGD: ImageNet/ResNet-50 training in a flash. arXiv:1811.05233.
    [42]
    Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, et al. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. 1928–1937.
    [43]
    Truong Thao Nguyen and Mohamed Wahib. 2021. An Allreduce algorithm and network co-design for large-scale training of distributed deep learning. In Proceedings of the 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing. 396–405.
    [44]
    NVIDIA. 2020. NVIDIA Collective Communication Library (NCCL) Documentation. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html.
    [45]
    NVIDIA. 2022. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html.
    [46]
    Luo Qi. 2008. Research on intelligent transportation system technologies and applications. In Proceedings of the 2008 Workshop on Power Electronics and Intelligent Transportation System. 529–531.
    [47]
    Zhengwei Qi, Chengcheng Xiang, Ruhui Ma, Jian Li, Haibing Guan, and David S. L. Wei. 2017. ForenVisor: A tool for acquiring and preserving reliable data in cloud live forensics. IEEE Transactions on Cloud Computing 5, 3 (2017), 443–456.
    [48]
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
    [49]
    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, et al. 2020. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588, 7839 (2020), 604–609.
    [50]
    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347.
    [51]
    Bart Selman and Carla P. Gomes. 2006. Hill-climbing search. Encyclopedia of Cognitive Science 81 (2006), 82.
    [52]
    Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799.
    [53]
    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
    [54]
    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018), 1140–1144.
    [55]
    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354–359.
    [56]
    Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations.
    [57]
    Insoo Sohn. 2019. Robustness enhancement of complex networks via No-Regret learning. ICT Express 5, 3 (2019), 163–166.
    [58]
    Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.
    [59]
    Truong Thao Nguyen, Mohamed Wahib, and Ryousei Takano. 2021. Efficient MPI-AllReduce for large-scale deep learning on GPU-clusters. Concurrency and Computation: Practice and Experience 33, 12 (2021), 5574.
    [60]
    Shaohua Wan, Zan Gao, Hanwang Zhang, and Xiaojun Chang. 2021. Introduction to the special issue on fine-grained visual computing. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 1–3.
    [61]
    Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. 2020. Blink: Fast and generic collectives for distributed ML. Machine Learning and Systems 2 (2020), 172–186.
    [62]
    Hao Wang, Di Niu, and Baochun Li. 2019. Distributed machine learning with a serverless architecture. In Proceedings of the IEEE INFOCOM 2019 IEEE Conference on Computer Communications. 1288–1296.
    [63]
    Songtao Wang, Dan Li, Yang Cheng, Jinkun Geng, Yanshu Wang, Shuai Wang, et al. 2018. BML: A high-performance, low-cost gradient synchronization algorithm for DML training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 4243–4253.
    [64]
    Su Wang, Yichen Ruan, Yuwei Tu, Satyavrat Wagle, Christopher G. Brinton, and Carlee Joe-Wong. 2021. Network-aware optimization of distributed learning for fog computing. IEEE/ACM Transactions on Networking (2021).
    [65]
    Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter Pietzuch. 2016. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the 7th ACM Symposium on Cloud Computing. 84–97.
    [66]
    Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229–256.
    [67]
    Xuhua Yang, Wenhao Feng, Guang Chen, Lei Wang, Tao Zou, and Peng Jiang. 2020. Enhancing coupled networks robustness via removing key fragile dependency links. IEEE Transactions on Circuits and Systems II: Express Briefs 68, 3 (2020), 953–957.
    [68]
    Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image classification at supercomputer scale. arXiv:1811.06992.
    [69]
    Jiaru Zhang, Yang Hua, Tao Song, Hao Wang, Zhengui Xue, Ruhui Ma, et al. 2022. ImprovingBayesian neural networks by adversarial sampling. (2022).
    [70]
    Jianan Zhang, Hyang-Won Lee, and Eytan Modiano. 2019. On the robustness of distributed computing networks. In Proceedings of the 2019 15th International Conference on the Design of Reliable Communication Networks. 122–129.
    [71]
    Wei Zhang, Ting Yao, Shiai Zhu, and Abdulmotaleb El Saddik. 2019. Deep learning–based multimedia analytics: A review. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 1–26.
    [72]
    Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, and Carlee Joe-Wong. 2020. Machine learning on volatile instances. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications. 139–148.
    [73]
    Yin Zhang, Xiao Ma, Jing Zhang, M. Shamim Hossain, Ghulam Muhammad, and Syed Umar Amin. 2019. Edge intelligence in the cognitive Internet of Things: Improving sensitivity and interactivity. IEEE Network 33, 3 (2019), 58–64.
    [74]
    Huasha Zhao and John Canny. 2013. Butterfly mixing: Accelerating incremental-update algorithms on clusters. In Proceedings of the 2013 SIAM International Conference on Data Mining. 785–793.
    [75]
    Yi Zheng, Yong Zhou, Jiaqi Zhao, Ying Chen, Rui Yao, Bing Liu, and Abdulmotaleb El Saddik. 2022. Clustering matters: Sphere feature for fully unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–18.
    [76]
    Shaojun Zou, Jiawei Huang, Jianxin Wang, and Tian He. 2019. Improving TCP robustness over asymmetry with reordering marking and coding in data centers. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems. 57–67.

    Cited By

    View all
    • (2024)cFedDT: Cross-Domain Federated Learning in Digital Twins for Metaverse Consumer Electronic ProductsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332701070:1(3167-3182)Online publication date: Feb-2024
    • (2024)pFedEff: An Efficient and Personalized Federated Cognitive Learning Framework in Multiagent SystemsIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2023.328898516:1(31-45)Online publication date: Feb-2024
    • (2023)A Federated Network Intrusion Detection System with Multi-Branch Network and Vertical Blocking AggregationElectronics10.3390/electronics1219404912:19(4049)Online publication date: 27-Sep-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
    February 2024
    548 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3613570
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 September 2023
    Online AM: 21 July 2022
    Accepted: 13 July 2022
    Revised: 04 July 2022
    Received: 14 February 2022
    Published in TOMM Volume 20, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. All-reduce
    2. communication scheduling
    3. gradient aggregation
    4. robustness
    5. collaborative management

    Qualifiers

    • Research-article

    Funding Sources

    • National NSF of China
    • Shanghai Key Laboratory of Scalable Computing and Systems
    • Innovative Research Foundation of Ship General Performance
    • SJTU Library-Jiangsu Jiatu Future Library Smart Service Joint R&D Center
    • Key Laboratory of PK System Technologies Research of Hainan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)234
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)cFedDT: Cross-Domain Federated Learning in Digital Twins for Metaverse Consumer Electronic ProductsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.332701070:1(3167-3182)Online publication date: Feb-2024
    • (2024)pFedEff: An Efficient and Personalized Federated Cognitive Learning Framework in Multiagent SystemsIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2023.328898516:1(31-45)Online publication date: Feb-2024
    • (2023)A Federated Network Intrusion Detection System with Multi-Branch Network and Vertical Blocking AggregationElectronics10.3390/electronics1219404912:19(4049)Online publication date: 27-Sep-2023
    • (2023)A Combined Multi-Classification Network Intrusion Detection System Based on Feature Selection and Neural Network ImprovementApplied Sciences10.3390/app1314830713:14(8307)Online publication date: 18-Jul-2023
    • (2023)GUARDIAN: A Hardware-Assisted Distributed Framework to Enhance Deep Learning SecurityIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.326228910:6(3012-3020)Online publication date: Dec-2023
    • (2023)Automatic Pipeline Parallelism: A Parallel Inference Framework for Deep Learning Applications in 6G Mobile Communication SystemsIEEE Journal on Selected Areas in Communications10.1109/JSAC.2023.328097041:7(2041-2056)Online publication date: 1-Jul-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media