research-article

Open access

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Authors:

Kuan-Chung Chen,

Chung-Ho ChenAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 1

Article No.: 6, Pages 1 - 26

https://doi.org/10.1145/3177960

Published: 22 March 2018 Publication History

Abstract

Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential on homogeneous multi-core processors, which generally run in multiple-instruction multiple-data (MIMD) mode when utilizing the multi-core resources. We address three architecture issues in enabling SIMT execution model on multi-core processor, including multithreading execution model, kernel thread context placement, and thread divergence. For the SIMT execution model, we propose a fine-grained multithreading mechanism on an ARM-based multi-core system. Each of the processor cores stores the kernel thread contexts in its L1 data cache for per-cycle thread-switching requirement. For divergence-intensive kernels, an Inner Conditional Statement First (ICS-First) mechanism helps early re-convergence to occur and significantly improves the performance. The experiment results show that effectiveness in data-parallel processing reduces on average 36% dynamic instructions, and boosts the SIMT executions to achieve on average 1.52× and up to 5× speedups over the MIMD counterpart for OpenCL benchmarks for single issue in-order processor cores. By using the explicit vectorization optimization on the kernels, the SIMT model gains further benefits from the SIMD extension and achieves 1.71× speedup over the MIMD approach. The SIMT model using in-order superscalar processor cores outperforms the MIMD model that uses superscalar out-of-order processor cores by 40%. The results show that, to exploit data-level parallelism, enabling the SIMT model on homogeneous multi-core processors is important.

References

[1]

Advanced Micro Devices Inc. 2017. AMD Accelerated Parallel Processing SDK. Retrieved from http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.

[2]

Cavium. 2017. Cavium ThunderX ARM Processors. Retrieved from http://www.cavium.com/ThunderX_ARM_Processors.html.

[3]

En-Hao Chang, Chen-Chieh Wang, Chien-Te Liu, Kuan-Chung Chen, and Chung-Ho Chen. 2014. Virtualization technology for TCP/IP offload engine. IEEE Trans. Cloud Comput. 2, 2 (Apr. 2014), 117--129.

[4]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Los Alamitos, CA, 44--54.

Digital Library

[5]

Kuan-Chung Chen and Chung-Ho Chen. 2014. An openCL runtime system for a heterogeneous many-core virtual platform. In Proceedings of the 2014 IEEE International Symposium on Circuits and Systems (ISCAS’14). 2197--2200.

[6]

Yuan Chi. 2016. OpenCL Kernel Attribute Prediction for Operation Mode Se-lection in SIMT/MIMD Dual-mode Architecture. Master’s thesis. National Cheng Kung University, Taiwan.

[7]

Sylvain Collange. 2011. Stack-less SIMT Reconvergence at Low Cost. Technical Report.

[8]

Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 477--488.

Digital Library

[9]

Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Los Alamitos, CA, 25--36.

Digital Library

[10]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim. 6, 2, Article 7 (Jul. 2009), 37 pages.

Digital Library

[11]

Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, USA, 205--216.

Digital Library

[12]

John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Elsevier.

Digital Library

[13]

Yun-Chi Huang, Kuan-Chieh Hsu, Wan-Shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen. 2016. Dynamic SIMD re-convergence with paired-path comparison. In Proceedings of the 2016 IEEE International Symposium on Circuits and Systems (ISCAS’16). 233--236.

Digital Library

[14]

Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. pocl: A performance-portable openCL implementation. Int. J. Parallel Program. 43, 5 (Oct. 2015), 752--785.

Digital Library

[15]

James Jeffers and James Reinders. 2013. Intel Xeon Phi Coprocessor High Performance Programming (1st ed.). Morgan Kaufmann, San Francisco, CA.

Digital Library

[16]

Gangwon Jo, Won Jong Jeon, Wookeun Jung, Gordon Taft, and Jaejin Lee. 2014. OpenCL framework for ARM processors with NEON support. In Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing (WPMVP’14). ACM, New York, NY, 33--40.

Digital Library

[17]

Khronos Group. 2011. The OpenCL Specification 1.2. Retrieved from https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.

[18]

Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An openCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 341--352.

Digital Library

[19]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO’04). 75--86.

Digital Library

[20]

Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee. 2011. An openCL framework for homogeneous manycores with no hardware cache coherence. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, Washington, DC, USA, 56--67.

Digital Library

[21]

Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi. 2010. An openCL framework for heterogeneous multicores with local memory. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 193--204.

Digital Library

[22]

Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 89--100.

[23]

Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. 2016. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, New York, NY, Article 42, 12 pages.

Digital Library

[24]

Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, and Ben Juurlink. 2015. Spatiotemporal SIMT and scalarization for improving GPU efficiency. ACM Trans. Archit. Code Optim. 12, 3, Article 32 (Sept. 2015), 26 pages.

Digital Library

[25]

Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, Alan J. Miller, and Michael Upton. 2002. Hyper-threading technology architecture and microarchitecture. Intel Technol. J. 6, 1 (2002), 1--12.

[26]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 235--246.

Digital Library

[27]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 308--317.

Digital Library

[28]

John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro 30, 2 (Mar. 2010), 56--69.

Digital Library

[29]

NVIDIA Corporation. 2012. NVIDIA CUDA Toolkit 4.2. Retrieved from https://developer.nvidia.com/cuda-toolkit-42-archive.

[30]

Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. gem5-gpu: A heterogeneous CPU-GPU simulator. IEEE Comput. Arch. Lett. 14, 1 (Jan. 2015), 34--36.

Digital Library

[31]

Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Los Alamitos, CA, 591--602.

Digital Library

[32]

Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 356--367.

Digital Library

[33]

Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 489--501.

Digital Library

[34]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Los Alamitos, CA, 72--83.

Digital Library

[35]

Jie Shen, Jianbin Fang, Henk Sips, and Ana Lucia Varbanescu. 2012. Performance gaps between openMP and openCL for multi-core CPUs. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. 116--125.

Digital Library

[36]

Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratkovic, Adrian Cristal, Osman Unsal, and Mateo Valero. 2017. An integrated vector-scalar design on an in-order ARM core. ACM Trans. Archit. Code Optim. 14, 2, Article 17 (May 2017), 26 pages.

Digital Library

[37]

John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W. Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). ACM, New York, NY, 111--119.

Digital Library

[38]

John A. Stratton, Sam S. Stone, and Wen-mei W. Hwu. 2008. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs. Vol. 5335. Springer, Berlin, 16--30.

Digital Library

[39]

Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 392--403.

Digital Library

[40]

Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 335--344.

Digital Library

[41]

Ali Vahidsafa, Sebastian Turullols, David Smentek, Ram Sivaramakrishnan, Paul Loewenstein, Sumti Jairath, and John Feehrer. 2013. The oracle sparc T5 16-core processor scales to eight sockets. IEEE Micro 33, 2 (2013), 48--57.

Digital Library

[42]

Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 368--379.

Digital Library

[43]

Yaohua Wang, Shuming Chen, Jianghua Wan, Jiayuan Meng, Kai Zhang, Wei Liu, and Xi Ning. 2013. A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Los Alamitos, CA, 603--614.

Digital Library

[44]

Y. Wen, Z. Wang, and M. F. P. O’Boyle. 2014. Smart multi-task scheduling for openCL programs on CPU/GPU heterogeneous platforms. In Proceedings of the 2014 21st International Conference on High Performance Computing (HiPC’14). 1--10.

[45]

David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 5 (Sept. 2007), 15--31.

Digital Library

Cited By

Feng ZYang LZhang YGuo Q(2024)Hardware accelerator based on SIMT programmable rasterization2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603666(306-310)Online publication date: 12-Apr-2024
https://doi.org/10.1109/ICCEA62105.2024.10603666
Han RLee JSim JKim H(2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3554736
Montagna FTagliavini GRossi DGarofalo ABenini L(2021)Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUsArchitecture of Computing Systems10.1007/978-3-030-81682-7_11(167-182)Online publication date: 7-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-81682-7_11

Index Terms

Enabling SIMT Execution Model on Homogeneous Multi-Core System
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
      2. Single instruction, multiple data

Recommendations

An application-centric evaluation of OpenCL on multi-core CPUs

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we ...
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
ICPPW '12: Proceedings of the 2012 41st International Conference on Parallel Processing Workshops

OpenCL and OpenMP are the most commonly used programming models for multi-core processors. They are also fundamentally different in their approach to parallelization. In this paper, we focus on comparing the performance of OpenCL and OpenMP. We select ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 1

March 2018

401 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3199680

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2018

Accepted: 01 December 2017

Revised: 01 October 2017

Received: 01 June 2017

Published in TACO Volume 15, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Ministry of Science and Technology, Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
564
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)25

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feng ZYang LZhang YGuo Q(2024)Hardware accelerator based on SIMT programmable rasterization2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603666(306-310)Online publication date: 12-Apr-2024
https://doi.org/10.1109/ICCEA62105.2024.10603666
Han RLee JSim JKim H(2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3554736
Montagna FTagliavini GRossi DGarofalo ABenini L(2021)Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUsArchitecture of Computing Systems10.1007/978-3-030-81682-7_11(167-182)Online publication date: 7-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-81682-7_11
Thuerck D(2020)Supporting Irregularity in Throughput-Oriented Computing by SIMT-SIMD Integration2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA351965.2020.00010(31-35)Online publication date: Nov-2020
https://doi.org/10.1109/IA351965.2020.00010

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents