research-article

Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

Authors:

Rimas Avizienis,

Derek Lockhart,

Christopher Batten,

Krste AsanovićAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 31, Issue 3

Article No.: 6, Pages 1 - 38

https://doi.org/10.1145/2491464

Published: 01 August 2013 Publication History

Abstract

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We introduce Maven, a new VT microarchitecture based on the traditional vector-SIMD microarchitecture, that is considerably simpler to implement and easier to program than previous VT designs. Using an extensive design-space exploration of full VLSI implementations of many accelerator design points, we evaluate the varying tradeoffs between programmability and implementation efficiency among the MIMD, vector-SIMD, and VT patterns on a workload of compiled microbenchmarks and application kernels. We find the vector cores provide greater efficiency than the MIMD cores, even on fairly irregular kernels. Our results suggest that the Maven VT microarchitecture is superior to the traditional vector-SIMD architecture, providing both greater efficiency and easier programmability.

References

[1]

Dennis Abts, Abdulla Bataineh, Steve Scott, Greg Faanes, Jim Schwarzmeier, Eric Lundberg, Tim Johnson, Mike Bye, and Gerald Schwoerer. 2007. The Cray BlackWidow: A highly scalable vector multiprocessor. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC).

Digital Library

[2]

Randy Allen and Ken Kennedy. 2001. Optimizing Compilers for Modern Architectures. Morgan Kaufmann.

[3]

Krste Asanović. 1998. Vector microprocessors. Ph.D. dissertation, EECS Department, University of California, Berkeley.

Digital Library

[4]

David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. Comput. Surv. 26, 4, 345--420.

Digital Library

[5]

Christopher Batten. 2010. Simplified vector-thread architectures for flexible and efficient data-parallel accelerators. Ph.D. dissertation, MIT.

Digital Library

[6]

Christopher Batten, Ronny Krashinsky, Steve Gerding, and Krste Asanović. 2004. Cache refill/access decoupling for vector machines. In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[7]

Werner Buchholz. 1986. The IBM System/370 vector architecture. IBM Syst. J. 25, 1, 51--62.

Digital Library

[8]

Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786.

Digital Library

[9]

Derek DeVries and Corinna G. Lee. 1995. A vectorizing SUIF compiler. In Proceedings of SUIF Compiler Workshop.

[10]

Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scale. 2000. AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20, 2, 85--95.

Digital Library

[11]

Roger Espasa and Mateo Valero. 1996. Decoupled vector architectures. In Proceedings of International Symposium on High-Performance Computer Architecture (HPCA).

Digital Library

[12]

Michael J. Flynn. 1966. Very high-speed computing systems. Proc. IEEE 54, 12, 1901--1909.

[13]

John M. Frankovich and H. Phillip Peterson. 1957. A functional description of the Lincoln TX-2 computer. In Proceedings of the IRE-AIEE-ACM Western Joint Computer Conference: Techniques For Realibility. 146--155.

Digital Library

[14]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim. 6, 2, 1--35.

Digital Library

[15]

John Goodacre and Andrew N. Sloss. 2005. Parallelism and the ARM instruction set architecture. Comput. 38, 7, 42--50.

Digital Library

[16]

Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. 2006. Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 2, 10--24.

Digital Library

[17]

Linley Gwennap. 1996. Digital, MIPS add multimedia extensions. Microprocessor Forum 10, 15.

[18]

Mark Hampton and Krste Asanović. 2008. Compiling for vector-thread architectures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO).

Digital Library

[19]

John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009a. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the International Symposium on Computer Architecture (ISCA).

Digital Library

[20]

John H. Kelm, Daniel R. Johnson, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009b. A task-centric memory model for scalable accelerator architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[21]

Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro 25, 2, 21--29.

Digital Library

[22]

Christoforos Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, Krste Asanović, Neal Cardwell, Richard Fromm, Jason Golbus, Benjamin Gribstad, Kimberly Keeton, Randi Thomas, Noah Treuhaft, and Kathy Yelick. 1997. Scalable processors in the billion-transistor era: IRAM. IEEE Comput. 30, 9, 75--78.

Digital Library

[23]

Ronny Krashinsky. 2007. Vector-thread architecture and implementation. Ph.D. dissertation, MIT.

Digital Library

[24]

Ronny Krashinsky, Christopher Batten, and Krste Asanović. 2008. Implementing the scale vector-thread processor. ACM Trans. Des. Autom. Electronic Syst. 13, 3.

Digital Library

[25]

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanović. 2004. The vector-thread architecture. In Proceedings of the International Symposium on Computer Architecture (ISCA).

Digital Library

[26]

Ruby B. Lee. 1996. Subword parallelism with MAX-2. IEEE Micro 16, 4, 51--59.

Digital Library

[27]

Yunsup Lee. 2011. Efficient VLSI implementations of vector-thread architectures. Master’s thesis, UC Berkeley.

[28]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA).

Digital Library

[29]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA Tesla: A unified graphics and computer architecture. IEEE Micro 28, 2, 39--55.

Digital Library

[30]

Chris Lomont. 2011. Introduction to Intel advanced vector extensions. Intel White Paper.

[31]

Aqeel Mahesri, Daniel Johnson, Neal Crago, and Sanjay J. Patel. 2008. Tradeoffs in designing accelerator architectures for visual computing. In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[32]

Microsoft. 2009. Graphics guide for Windows 7: A guide For hardware and system manufacturers. Microsoft White Paper. http://www.microsoft.com/whdc/device/display/graphicsguidewin7.mspx.

[33]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. Hewlett Packard, HPL-2009-85.

[34]

Umesh Gajanan Nawathe, Mahmudul Hassan, Lynn Warriner, King Yen, Bharat Upputuri, David Greenhill, and Ashok Kumar. 2007. An 8-core 64-thread 64 b power-efficient SPARC SoC. In Proceedings of the International Solid-State Circuits Conference (ISSCC).

[35]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2, 40--53.

Digital Library

[36]

NVIDIA. 2009. NVIDIA’s next gen CUDA compute architecture: Fermi. NVIDIA White Paper. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture _whitepaper.pdf.

[37]

OpenCL. 2008. The OpenCL specification. Khronos OpenCL Working Group. http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf.

[38]

OpenMP. 2008. OpenMP application program interface. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/spec30.pdf.

[39]

Alex Peleg and Uri Weiser. 1996. MMX technology extension to the Intel architecture. IEEE Micro 16, 4, 42--50.

Digital Library

[40]

S. K. Raman, V. Pentkovski, and J. Keshava. 2000. Implementing streaming SIMD extensions on the Pentium-III processor. IEEE Micro 20, 4, 47--57.

Digital Library

[41]

James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media.

Digital Library

[42]

Suzanne Rivoire, Rebecca Schultz, Tomofumi Okuda, and Christos Kozyrakis. 2006. Vector lane threading. In Proceedings of the International Conference on Parallel Processing (ICPP).

Digital Library

[43]

Richard M. Russell. 1978. The Cray-1 Computer System. Comm. ACM 21, 1, 63--72.

Digital Library

[44]

Karthikeyan Sankaralingam, Stephen W. Keckler, William R. Mark, and Doug Burger. 2003. Universal mechanisms for data-parallel architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[45]

J. E. Smith, Greg Faanes, and Rabin Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the International Symposium on Computer Architecture (ISCA).

Digital Library

[46]

Takashi Soga, Akihiro Musa, Youichi Shimomura, Ryusuke Egawa, Ken’ichi Itakura, Hiroyuki Takizawa, Koki Okabe, and Hiroaki Kobayashi. 2009. Performance evaluation of NEC SX-9 using real science and engineering applications. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), Article 28.

Digital Library

[47]

Hiroshi Tamura, Sachio Kamiya, and Takahiro Ishigai. 1985. FACOM VP-100/200: Supercomputers with ease of use. Parallel Comput. 2, 2, 87--107.

[48]

Marc Tremblay, J. Michael O’Connor, Venkatesh Narayanan, and Liang He. 1996. VIS speeds new media processing. IEEE Micro 16, 4, 10--20.

Digital Library

[49]

John Wawrzynek, Krste Asanović, Brian Kingsbury, David Johnson, James Beck, and Nelson Morgan. 1996. Spert-II: A vector microprocessor system. IEEE Comput. 29, 3, 79--86.

Digital Library

[50]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Comm. ACM 52, 4, 65--76.

Digital Library

[51]

Sven Woop, Jörg Schmittler, and Philipp Slusallek. 2005. RPU: A programmable ray processing unit for realtime ray tracing. ACM Trans. Graph. 24 3, 434--444.

Digital Library

[52]

Marco Zagha and Guy E. Blelloch. 1991. Radix sort for vector multiprocessors. In Proceedings of ACM/IEEE Conference on Supercomputing (SC).

Digital Library

Cited By

Jackowski AGryz LWełnicki MDubnicki CIwanicki K(2023)Derrick: A Three-layer Balancer for Self-managed Continuous ScalabilityACM Transactions on Storage10.1145/359454319:3(1-34)Online publication date: 19-Jun-2023
https://dl.acm.org/doi/10.1145/3594543
Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Lu YRooholamin SZiavras S(2016)Vector Coprocessor Virtualization for Simultaneous MultithreadingACM Transactions on Embedded Computing Systems10.1145/289836415:3(1-25)Online publication date: 23-May-2016
https://dl.acm.org/doi/10.1145/2898364
Show More Cited By

Index Terms

Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We have developed a new VT microarchitecture, Maven, ...
Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators
ISCA '11

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We have developed a new VT microarchitecture, Maven, ...
Developmental directions in parallel accelerators
AusPDC '14: Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing - Volume 152

Parallel accelerators such as massively-cored graphical processing units or many-cored co-processors such as the Xeon Phi are becoming widespread and affordable on many systems including blade servers and even desktops. The use of a single such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 31, Issue 3

August 2013

94 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/2518037

Issue’s Table of Contents

Copyright © 2013 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2013

Accepted: 01 March 2013

Revised: 01 March 2013

Received: 01 March 2013

Published in TOCS Volume 31, Issue 3

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
846
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jackowski AGryz LWełnicki MDubnicki CIwanicki K(2023)Derrick: A Three-layer Balancer for Self-managed Continuous ScalabilityACM Transactions on Storage10.1145/359454319:3(1-34)Online publication date: 19-Jun-2023
https://dl.acm.org/doi/10.1145/3594543
Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Lu YRooholamin SZiavras S(2016)Vector Coprocessor Virtualization for Simultaneous MultithreadingACM Transactions on Embedded Computing Systems10.1145/289836415:3(1-25)Online publication date: 23-May-2016
https://dl.acm.org/doi/10.1145/2898364
Soliman M(2015)Merging VLIW and vector processing techniques for a simple, high-performance processor architectureMicroelectronics Journal10.1016/j.mejo.2015.03.01246:7(637-655)Online publication date: Jul-2015
https://doi.org/10.1016/j.mejo.2015.03.012
Lee YGrover VKrashinsky RStephenson MKeckler SAsanović KFlautner KWenisch TOzer EFerdman M(2014)Exploring the Design Space of SPMD Divergence Management on Data-Parallel ArchitecturesProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2014.48(101-113)Online publication date: 13-Dec-2014
https://dl.acm.org/doi/10.1109/MICRO.2014.48

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents