research-article

ABSS: An Adaptive Batch-Stream Scheduling Module for Dynamic Task Parallelism on Chiplet-based Multi-Chip Systems

Authors:

Kenli LiAuthors Info & Claims

ACM Transactions on Parallel Computing, Volume 11, Issue 1

Article No.: 6, Pages 1 - 24

https://doi.org/10.1145/3643597

Published: 11 March 2024 Publication History

Abstract

Thanks to the recognition and promotion of chiplet-based High-Performance Computing (HPC) system design technology by semiconductor industry/market leaders, chiplet-based multi-chip systems have gradually become the mainstream. Unfortunately, programming such systems to achieve efficient computing is a challenge, especially when considering dynamic task parallelism. This paper presents an Adaptive Batch-Stream Scheduling (ABSS) module for dynamic task parallelism on chiplet-based multi-chip systems. To this end, we propose an adaptive batch-stream scheduling method based on Graph Convolution Network (GCN) classifier to select the appropriate scheduling scheme. We further design a chiplet-based core-cluster binding mechanism, which establishes the affinity between threads and core-clusters on CPU-compute die. Moreover, to achieve dynamic workload balance, we propose a chiplet-based nearest task stealing method. We implement our ABSS module on the HiSilicon Kunpeng-920 chiplet-based multi-chip system. Experiments show that it outperforms state-of-the-art parallelism solutions, such as Intel Threading Building Blocks.

References

[1]

U. A. Acar, A. Charguéraud, and M. Rainey. 2013. Scheduling parallel programs by work stealing with private deques. In PPoPP’13. 219–228.

Digital Library

[2]

P. R. Amestoy, I. S. Duff, and C. Puglisi. 1996. Multifrontal QR factorization in a multiprocessor environment. Numer. Linear Algebr. Appl. 3, 4 (1996), 275–300.

[3]

Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst. 20, 3 (2008), 404–418.

Digital Library

[4]

Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (2008), 404–418.

Digital Library

[5]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute: Asynchronous multi-GPU programming model with applications to large-scale graph processing. ACM Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–27.

Digital Library

[6]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. ACM SIGPLAN N. 30, 8 (1995), 207–216.

Digital Library

[7]

François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Nam. 2010. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Int. J. Parallel Program. 38, 5 (2010), 418–439.

[8]

Daniel Casini, Alessandro Biondi, and Giorgio Buttazzo. 2019. Analyzing parallel real-time tasks implemented with thread pools. In Proceedings of the 56th Annual Design Automation Conference 2019. 1–6.

Digital Library

[9]

Milind Chabbi, Abdelhalim Amer, and Xu Liu. 2020. Efficient abortable-locking protocol for multi-level NUMA systems: Design and correctness. ACM Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–32.

Digital Library

[10]

Jérôme Clet-Ortega, Patrick Carribault, and Marc Pérache. 2014. Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In Euro-Par 2014 Parallel Processing: 20th International Conference, Porto, Portugal, August 25-29, 2014. Proceedings 20. Springer, 596–607.

[11]

Huawei Technologies CO.2023. Kunpeng Math Library. (2023). http://www.hikunpeng.com/developer/boostkit/library/math

[12]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. ACM SIGPLAN N. 48, 4 (2013), 381–394.

Digital Library

[13]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Software 38, 1 (2011), 1–25.

Digital Library

[14]

C. E. Leiserson. 2010. The Cilk++ concurrency platform. J. Supercomput. 51, 3 (2010), 244–257.

Digital Library

[15]

Shengle Lin, Wangdong Yang, Haotian Wang, Qinyun Tsai, and Kenli Li. 2021. STM-multifrontal QR: Streaming task mapping multifrontal QR factorization empowered by GCN. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

[16]

T. Lovett and R. Clapp. 1996. STiNG: A CC-NUMA computer system for the commercial marketplace. In ISCA’96. 308–317.

[17]

R. Maddox and R. J. Safranek. 2009. Introduction to Intel QuickPath Interconnect. High Performance Multi-Core Processor Fabric (2009).

[18]

M. Popov and A. Jimborean. 2019. Efficient thread/page/parallelism autotuning for NUMA systems. In ISC’19. 342–353.

[19]

J. Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media, Inc.

Digital Library

[20]

Michael Schmid, Florian Fritz, and Jürgen Mottok. 2022. Fine-grained parallelism framework with predictable work-stealing for real-time multiprocessor systems. Journal of Systems Architecture 124 (2022), 102393.

Digital Library

[21]

Michael Schmid and Jürgen Mottok. 2021. Response time analysis of parallel real-time DAG tasks scheduled by thread pools. In 29th International Conference on Real-Time Networks and Systems. 173–183.

Digital Library

[22]

Harsha Vardhan Simhadri, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Aapo Kyrola. 2016. Experimental analysis of space-bounded schedulers. ACM Transactions on Parallel Computing (TOPC) 3, 1 (2016), 1–27.

Digital Library

[23]

Christian Terboven, Jonas Hahnfeld, Xavier Teruel, Sergi Mateo, Alejandro Duran, Michael Klemm, Stephen L. Olivier, and Bronis R. de Supinski. 2016. Approaches for task affinity in OpenMP. In IWOMP’16. Springer, 102–115.

[24]

Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. 2010. Lazy binary-splitting: A run-time adaptive work-stealing scheduler. ACM SIGPLAN Not. 45, 5 (2010), 179–190.

Digital Library

[25]

Haotian Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, and Keqin Li. 2023. A heterogeneous parallel computing approach optimizing SpTTM on CPU-GPU via GCN. ACM Transactions on Parallel Computing 10, 2 (2023), 1–23.

Digital Library

[26]

MoyangWang, Tuan Ta, Lin Cheng, and Christopher Batten. 2020. Efficiently supporting dynamic task parallelism on heterogeneous cache-coherent systems. In ISCA’20. IEEE, 173–186.

[27]

Tianqi Wang, Fan Feng, Shaolin Xiang, Qi Li, and Jing Xia. 2022. Application defined on-chip networks for heterogeneous chiplets: An implementation perspective. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1198–1210.

[28]

R. Whaley and J. Dontarra. 1998. Automatically tuned linear algebra software. IEEE (1998).

[29]

S. Williams, L. Ionkov, and M. Lang. 2017. NUMA distance for heterogeneous memory. In MCHPC’17. 30–34.

Digital Library

[30]

Yibo Wu, Liang Wang, Xiaohang Wang, Jie Han, Jianfeng Zhu, Honglan Jiang, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2022. Upward packet popup for deadlock freedom in modular chiplet-based systems. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 986–1000.

[31]

Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. 2021. Kunpeng 920: The first 7nm chiplet-based 64-core ARM SoC for cloud services. IEEE Micro PP, 99 (2021), 1–1.

[32]

Jin Yang, Wangdong Yang, Ruixuan Qi, Qinyun Tsai, Shengle Lin, Fengkun Dong, Kenli Li, and Keqin Li. 2023. Parallel algorithm design and optimization of geodynamic numerical simulation application on the Tianhe new-generation high-performance computer. The Journal of Supercomputing (2023), 1–32.

[33]

Di Zhang, Dong Dai, Youbiao He, Forrest Sheng Bao, and Bing Xie. 2020. RLScheduler: An automated HPC batch job scheduler using reinforcement learning. In SC’20. IEEE, 1–15.

Cited By

Cai QTan GYang WHe XYan YLi KLi K(2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: Jul-2024
https://doi.org/10.1109/TC.2024.3385269
Priyadarshini SSawant TBhimrao Yadav GPremalatha JPawar S(2024)Enhancing security and scalability by AI/ML workload optimization in the cloudCluster Computing10.1007/s10586-024-04641-xOnline publication date: 28-Jun-2024
https://doi.org/10.1007/s10586-024-04641-x

Index Terms

ABSS: An Adaptive Batch-Stream Scheduling Module for Dynamic Task Parallelism on Chiplet-based Multi-Chip Systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Scheduling algorithms

Recommendations

OpenMP task scheduling strategies for multicore NUMA systems

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling ...
Adaptive scheduling with parallelism feedback
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Multiprocessor scheduling in a shared multiprogramming environment is often structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level task scheduler schedules the work of a job on the allotted ...
Scheduling task parallelism on multi-socket multicore systems
ROSS '11: Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 11, Issue 1

March 2024

188 pages

EISSN:2329-4957

DOI:10.1145/3613487

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 March 2024

Online AM: 29 January 2024

Accepted: 23 January 2024

Revised: 28 November 2023

Received: 25 July 2023

Published in TOPC Volume 11, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key-Area R&D Program of Guangdong Province
Programs of National Natural Science Foundation of China
Major Projects of Xiangjiang Laboratory
Key R&D Program of Hunan Province
Shenzhen Science and Technology Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
342
Total Downloads

Downloads (Last 12 months)342
Downloads (Last 6 weeks)39

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cai QTan GYang WHe XYan YLi KLi K(2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: Jul-2024
https://doi.org/10.1109/TC.2024.3385269
Priyadarshini SSawant TBhimrao Yadav GPremalatha JPawar S(2024)Enhancing security and scalability by AI/ML workload optimization in the cloudCluster Computing10.1007/s10586-024-04641-xOnline publication date: 28-Jun-2024
https://doi.org/10.1007/s10586-024-04641-x

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents