Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Published: 22 March 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Single-instruction multiple-thread (SIMT) machine emerges as a primary computing device in high-perfor-mance computing, since the SIMT execution paradigm can exploit data-level parallelism effectively. This article explores the SIMT execution potential on homogeneous multi-core processors, which generally run in multiple-instruction multiple-data (MIMD) mode when utilizing the multi-core resources. We address three architecture issues in enabling SIMT execution model on multi-core processor, including multithreading execution model, kernel thread context placement, and thread divergence. For the SIMT execution model, we propose a fine-grained multithreading mechanism on an ARM-based multi-core system. Each of the processor cores stores the kernel thread contexts in its L1 data cache for per-cycle thread-switching requirement. For divergence-intensive kernels, an Inner Conditional Statement First (ICS-First) mechanism helps early re-convergence to occur and significantly improves the performance. The experiment results show that effectiveness in data-parallel processing reduces on average 36% dynamic instructions, and boosts the SIMT executions to achieve on average 1.52× and up to 5× speedups over the MIMD counterpart for OpenCL benchmarks for single issue in-order processor cores. By using the explicit vectorization optimization on the kernels, the SIMT model gains further benefits from the SIMD extension and achieves 1.71× speedup over the MIMD approach. The SIMT model using in-order superscalar processor cores outperforms the MIMD model that uses superscalar out-of-order processor cores by 40%. The results show that, to exploit data-level parallelism, enabling the SIMT model on homogeneous multi-core processors is important.

    References

    [1]
    Advanced Micro Devices Inc. 2017. AMD Accelerated Parallel Processing SDK. Retrieved from http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.
    [2]
    Cavium. 2017. Cavium ThunderX ARM Processors. Retrieved from http://www.cavium.com/ThunderX_ARM_Processors.html.
    [3]
    En-Hao Chang, Chen-Chieh Wang, Chien-Te Liu, Kuan-Chung Chen, and Chung-Ho Chen. 2014. Virtualization technology for TCP/IP offload engine. IEEE Trans. Cloud Comput. 2, 2 (Apr. 2014), 117--129.
    [4]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Los Alamitos, CA, 44--54.
    [5]
    Kuan-Chung Chen and Chung-Ho Chen. 2014. An openCL runtime system for a heterogeneous many-core virtual platform. In Proceedings of the 2014 IEEE International Symposium on Circuits and Systems (ISCAS’14). 2197--2200.
    [6]
    Yuan Chi. 2016. OpenCL Kernel Attribute Prediction for Operation Mode Se-lection in SIMT/MIMD Dual-mode Architecture. Master’s thesis. National Cheng Kung University, Taiwan.
    [7]
    Sylvain Collange. 2011. Stack-less SIMT Reconvergence at Low Cost. Technical Report.
    [8]
    Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 477--488.
    [9]
    Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Los Alamitos, CA, 25--36.
    [10]
    Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim. 6, 2, Article 7 (Jul. 2009), 37 pages.
    [11]
    Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. 2010. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, USA, 205--216.
    [12]
    John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Elsevier.
    [13]
    Yun-Chi Huang, Kuan-Chieh Hsu, Wan-Shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen. 2016. Dynamic SIMD re-convergence with paired-path comparison. In Proceedings of the 2016 IEEE International Symposium on Circuits and Systems (ISCAS’16). 233--236.
    [14]
    Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. pocl: A performance-portable openCL implementation. Int. J. Parallel Program. 43, 5 (Oct. 2015), 752--785.
    [15]
    James Jeffers and James Reinders. 2013. Intel Xeon Phi Coprocessor High Performance Programming (1st ed.). Morgan Kaufmann, San Francisco, CA.
    [16]
    Gangwon Jo, Won Jong Jeon, Wookeun Jung, Gordon Taft, and Jaejin Lee. 2014. OpenCL framework for ARM processors with NEON support. In Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing (WPMVP’14). ACM, New York, NY, 33--40.
    [17]
    Khronos Group. 2011. The OpenCL Specification 1.2. Retrieved from https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.
    [18]
    Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An openCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 341--352.
    [19]
    Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004 (CGO’04). 75--86.
    [20]
    Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee. 2011. An openCL framework for homogeneous manycores with no hardware cache coherence. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, Washington, DC, USA, 56--67.
    [21]
    Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi. 2010. An openCL framework for heterogeneous multicores with local memory. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, 193--204.
    [22]
    Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 89--100.
    [23]
    Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. 2016. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, New York, NY, Article 42, 12 pages.
    [24]
    Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, and Ben Juurlink. 2015. Spatiotemporal SIMT and scalarization for improving GPU efficiency. ACM Trans. Archit. Code Optim. 12, 3, Article 32 (Sept. 2015), 26 pages.
    [25]
    Deborah T. Marr, Frank Binns, David L. Hill, Glenn Hinton, David A. Koufaty, Alan J. Miller, and Michael Upton. 2002. Hyper-threading technology architecture and microarchitecture. Intel Technol. J. 6, 1 (2002), 1--12.
    [26]
    Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 235--246.
    [27]
    Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 308--317.
    [28]
    John Nickolls and William J. Dally. 2010. The GPU computing era. IEEE Micro 30, 2 (Mar. 2010), 56--69.
    [29]
    NVIDIA Corporation. 2012. NVIDIA CUDA Toolkit 4.2. Retrieved from https://developer.nvidia.com/cuda-toolkit-42-archive.
    [30]
    Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. gem5-gpu: A heterogeneous CPU-GPU simulator. IEEE Comput. Arch. Lett. 14, 1 (Jan. 2015), 34--36.
    [31]
    Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Los Alamitos, CA, 591--602.
    [32]
    Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 356--367.
    [33]
    Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 489--501.
    [34]
    Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Los Alamitos, CA, 72--83.
    [35]
    Jie Shen, Jianbin Fang, Henk Sips, and Ana Lucia Varbanescu. 2012. Performance gaps between openMP and openCL for multi-core CPUs. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. 116--125.
    [36]
    Milan Stanic, Oscar Palomar, Timothy Hayes, Ivan Ratkovic, Adrian Cristal, Osman Unsal, and Mateo Valero. 2017. An integrated vector-scalar design on an in-order ARM core. ACM Trans. Archit. Code Optim. 14, 2, Article 17 (May 2017), 26 pages.
    [37]
    John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W. Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). ACM, New York, NY, 111--119.
    [38]
    John A. Stratton, Sam S. Stone, and Wen-mei W. Hwu. 2008. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs. Vol. 5335. Springer, Berlin, 16--30.
    [39]
    Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 392--403.
    [40]
    Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 335--344.
    [41]
    Ali Vahidsafa, Sebastian Turullols, David Smentek, Ram Sivaramakrishnan, Paul Loewenstein, Sumti Jairath, and John Feehrer. 2013. The oracle sparc T5 16-core processor scales to eight sockets. IEEE Micro 33, 2 (2013), 48--57.
    [42]
    Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 368--379.
    [43]
    Yaohua Wang, Shuming Chen, Jianghua Wan, Jiayuan Meng, Kai Zhang, Wei Liu, and Xi Ning. 2013. A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Los Alamitos, CA, 603--614.
    [44]
    Y. Wen, Z. Wang, and M. F. P. O’Boyle. 2014. Smart multi-task scheduling for openCL programs on CPU/GPU heterogeneous platforms. In Proceedings of the 2014 21st International Conference on High Performance Computing (HiPC’14). 1--10.
    [45]
    David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 5 (Sept. 2007), 15--31.

    Cited By

    View all
    • (2024)Hardware accelerator based on SIMT programmable rasterization2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603666(306-310)Online publication date: 12-Apr-2024
    • (2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
    • (2021)Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUsArchitecture of Computing Systems10.1007/978-3-030-81682-7_11(167-182)Online publication date: 7-Jun-2021

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 1
    March 2018
    401 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3199680
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 March 2018
    Accepted: 01 December 2017
    Revised: 01 October 2017
    Received: 01 June 2017
    Published in TACO Volume 15, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Control divergence
    2. SIMD processors
    3. data-level parallelism
    4. openCL
    5. spatiotemporal SIMT

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Ministry of Science and Technology, Taiwan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)127
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Hardware accelerator based on SIMT programmable rasterization2024 5th International Conference on Computer Engineering and Application (ICCEA)10.1109/ICCEA62105.2024.10603666(306-310)Online publication date: 12-Apr-2024
    • (2022)COX : Exposing CUDA Warp-level Functions to CPUsACM Transactions on Architecture and Code Optimization10.1145/355473619:4(1-25)Online publication date: 16-Sep-2022
    • (2021)Streamlining the OpenMP Programming Model on Ultra-Low-Power Multi-core MCUsArchitecture of Computing Systems10.1007/978-3-030-81682-7_11(167-182)Online publication date: 7-Jun-2021
    • (2020)Supporting Irregularity in Throughput-Oriented Computing by SIMT-SIMD Integration2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA351965.2020.00010(31-35)Online publication date: Nov-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media