Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

ABSS: An Adaptive Batch-Stream Scheduling Module for Dynamic Task Parallelism on Chiplet-based Multi-Chip Systems

Published: 11 March 2024 Publication History

Abstract

Thanks to the recognition and promotion of chiplet-based High-Performance Computing (HPC) system design technology by semiconductor industry/market leaders, chiplet-based multi-chip systems have gradually become the mainstream. Unfortunately, programming such systems to achieve efficient computing is a challenge, especially when considering dynamic task parallelism. This paper presents an Adaptive Batch-Stream Scheduling (ABSS) module for dynamic task parallelism on chiplet-based multi-chip systems. To this end, we propose an adaptive batch-stream scheduling method based on Graph Convolution Network (GCN) classifier to select the appropriate scheduling scheme. We further design a chiplet-based core-cluster binding mechanism, which establishes the affinity between threads and core-clusters on CPU-compute die. Moreover, to achieve dynamic workload balance, we propose a chiplet-based nearest task stealing method. We implement our ABSS module on the HiSilicon Kunpeng-920 chiplet-based multi-chip system. Experiments show that it outperforms state-of-the-art parallelism solutions, such as Intel Threading Building Blocks.

References

[1]
U. A. Acar, A. Charguéraud, and M. Rainey. 2013. Scheduling parallel programs by work stealing with private deques. In PPoPP’13. 219–228.
[2]
P. R. Amestoy, I. S. Duff, and C. Puglisi. 1996. Multifrontal QR factorization in a multiprocessor environment. Numer. Linear Algebr. Appl. 3, 4 (1996), 275–300.
[3]
Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst. 20, 3 (2008), 404–418.
[4]
Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (2008), 404–418.
[5]
Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute: Asynchronous multi-GPU programming model with applications to large-scale graph processing. ACM Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–27.
[6]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. ACM SIGPLAN N. 30, 8 (1995), 207–216.
[7]
François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Nam. 2010. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Int. J. Parallel Program. 38, 5 (2010), 418–439.
[8]
Daniel Casini, Alessandro Biondi, and Giorgio Buttazzo. 2019. Analyzing parallel real-time tasks implemented with thread pools. In Proceedings of the 56th Annual Design Automation Conference 2019. 1–6.
[9]
Milind Chabbi, Abdelhalim Amer, and Xu Liu. 2020. Efficient abortable-locking protocol for multi-level NUMA systems: Design and correctness. ACM Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–32.
[10]
Jérôme Clet-Ortega, Patrick Carribault, and Marc Pérache. 2014. Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In Euro-Par 2014 Parallel Processing: 20th International Conference, Porto, Portugal, August 25-29, 2014. Proceedings 20. Springer, 596–607.
[11]
Huawei Technologies CO.2023. Kunpeng Math Library. (2023). http://www.hikunpeng.com/developer/boostkit/library/math
[12]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. ACM SIGPLAN N. 48, 4 (2013), 381–394.
[13]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (2011), 1–25.
[14]
C. E. Leiserson. 2010. The Cilk++ concurrency platform. J. Supercomput. 51, 3 (2010), 244–257.
[15]
Shengle Lin, Wangdong Yang, Haotian Wang, Qinyun Tsai, and Kenli Li. 2021. STM-multifrontal QR: Streaming task mapping multifrontal QR factorization empowered by GCN. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[16]
T. Lovett and R. Clapp. 1996. STiNG: A CC-NUMA computer system for the commercial marketplace. In ISCA’96. 308–317.
[17]
R. Maddox and R. J. Safranek. 2009. Introduction to Intel QuickPath Interconnect. High Performance Multi-Core Processor Fabric (2009).
[18]
M. Popov and A. Jimborean. 2019. Efficient thread/page/parallelism autotuning for NUMA systems. In ISC’19. 342–353.
[19]
J. Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc.
[20]
Michael Schmid, Florian Fritz, and Jürgen Mottok. 2022. Fine-grained parallelism framework with predictable work-stealing for real-time multiprocessor systems. Journal of Systems Architecture 124 (2022), 102393.
[21]
Michael Schmid and Jürgen Mottok. 2021. Response time analysis of parallel real-time DAG tasks scheduled by thread pools. In 29th International Conference on Real-Time Networks and Systems. 173–183.
[22]
Harsha Vardhan Simhadri, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Aapo Kyrola. 2016. Experimental analysis of space-bounded schedulers. ACM Transactions on Parallel Computing (TOPC) 3, 1 (2016), 1–27.
[23]
Christian Terboven, Jonas Hahnfeld, Xavier Teruel, Sergi Mateo, Alejandro Duran, Michael Klemm, Stephen L. Olivier, and Bronis R. de Supinski. 2016. Approaches for task affinity in OpenMP. In IWOMP’16. Springer, 102–115.
[24]
Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. 2010. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. ACM SIGPLAN Not. 45, 5 (2010), 179–190.
[25]
Haotian Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, and Keqin Li. 2023. A heterogeneous parallel computing approach optimizing SpTTM on CPU-GPU via GCN. ACM Transactions on Parallel Computing 10, 2 (2023), 1–23.
[26]
MoyangWang, Tuan Ta, Lin Cheng, and Christopher Batten. 2020. Efficiently supporting dynamic task parallelism on heterogeneous cache-coherent systems. In ISCA’20. IEEE, 173–186.
[27]
Tianqi Wang, Fan Feng, Shaolin Xiang, Qi Li, and Jing Xia. 2022. Application defined on-chip networks for heterogeneous chiplets: An implementation perspective. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1198–1210.
[28]
R. Whaley and J. Dontarra. 1998. Automatically tuned linear algebra software. IEEE (1998).
[29]
S. Williams, L. Ionkov, and M. Lang. 2017. NUMA distance for heterogeneous memory. In MCHPC’17. 30–34.
[30]
Yibo Wu, Liang Wang, Xiaohang Wang, Jie Han, Jianfeng Zhu, Honglan Jiang, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2022. Upward packet popup for deadlock freedom in modular chiplet-based systems. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 986–1000.
[31]
Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. 2021. Kunpeng 920: The first 7nm chiplet-based 64-core ARM SoC for cloud services. IEEE Micro PP, 99 (2021), 1–1.
[32]
Jin Yang, Wangdong Yang, Ruixuan Qi, Qinyun Tsai, Shengle Lin, Fengkun Dong, Kenli Li, and Keqin Li. 2023. Parallel algorithm design and optimization of geodynamic numerical simulation application on the Tianhe new-generation high-performance computer. The Journal of Supercomputing (2023), 1–32.
[33]
Di Zhang, Dong Dai, Youbiao He, Forrest Sheng Bao, and Bing Xie. 2020. RLScheduler: An automated HPC batch job scheduler using reinforcement learning. In SC’20. IEEE, 1–15.

Cited By

View all
  • (2024)MAHR: A Multi-Application Hybrid Reconfigurable Mechanism for Energy-Efficient Chiplet Interconnection NetworkJournal of Circuits, Systems and Computers10.1142/S0218126625500379Online publication date: 28-Sep-2024
  • (2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: Jul-2024
  • (2024)Enhancing security and scalability by AI/ML workload optimization in the cloudCluster Computing10.1007/s10586-024-04641-x27:10(13455-13469)Online publication date: 1-Dec-2024

Index Terms

  1. ABSS: An Adaptive Batch-Stream Scheduling Module for Dynamic Task Parallelism on Chiplet-based Multi-Chip Systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Parallel Computing
      ACM Transactions on Parallel Computing  Volume 11, Issue 1
      March 2024
      188 pages
      EISSN:2329-4957
      DOI:10.1145/3613487
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 March 2024
      Online AM: 29 January 2024
      Accepted: 23 January 2024
      Revised: 28 November 2023
      Received: 25 July 2023
      Published in TOPC Volume 11, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Adaptive batch-stream scheduling
      2. chiplet-based core-cluster binding
      3. chiplet-based multi-chip system
      4. chiplet-based nearest task stealing
      5. task parallelism

      Qualifiers

      • Research-article

      Funding Sources

      • Key-Area R&D Program of Guangdong Province
      • Programs of National Natural Science Foundation of China
      • Major Projects of Xiangjiang Laboratory
      • Key R&D Program of Hunan Province
      • Shenzhen Science and Technology Program

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)434
      • Downloads (Last 6 weeks)71
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)MAHR: A Multi-Application Hybrid Reconfigurable Mechanism for Energy-Efficient Chiplet Interconnection NetworkJournal of Circuits, Systems and Computers10.1142/S0218126625500379Online publication date: 28-Sep-2024
      • (2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: Jul-2024
      • (2024)Enhancing security and scalability by AI/ML workload optimization in the cloudCluster Computing10.1007/s10586-024-04641-x27:10(13455-13469)Online publication date: 1-Dec-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media