research-article

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Authors:

Ethan Schuchman,

Jamison D. Collins,

Gautham Chinya,

Ankur Khandelwal Groen,

Hong WangAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 52 - 61

https://doi.org/10.1145/1454115.1454125

Published: 25 October 2008 Publication History

Abstract

Moore's Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically "fuses" existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8x.

References

[1]

GPGPU: General Purpose Computation using Graphics Hardware. http://www.gpgpu.org.

[2]

A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proc. 17th International Symposium on Computer Architecture, pages 104 -- 114, May 1990.

Digital Library

[3]

M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's Law through EPI Throttling. In Proc. 32nd International Symposium on Computer Architecture, 2005.

Digital Library

[4]

S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proc. 32nd International Symposium on Computer Architecture, pages 506--517, Jun. 2005.

Digital Library

[5]

A. Bracy, K. Doshi, and Q. Jacobson. Disintermediated Active Communication. IEEE Computer Architecture Letters, 5(2), 2006.

Digital Library

[6]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. In ACM Transactions on Graphics, volume 23, pages 777--786, 2004.

Digital Library

[7]

W. J. Dally, L. Chao, A. Chien, S. Hassoun, W. Horwat, J. Kaplan, P. Song, B. Totty, and S. Wills. Architecture of a Message-Driven Processor. In Proc. 14th International Symposium on Computer Architecture, pages 189 -- 196, 1987.

Digital Library

[8]

S. Ghiasi. Aide de Camp: Asymmetric Multi-core Design for Dynamic Thermal Management. Technical Report TR-01-43, 2003.

[9]

E. Grochowski and M. Annavaram. Energy per Instruction Trends in Intel Microprocessors. Technology@Intel Magazine, March 2006.

[10]

E. Grochowski, R. Ronen, J. Shen, and H. Wang. Best of Both Latency and Throughput. In Proc. IEEE International Conference on Computer Design, 2004.

Digital Library

[11]

E. Haines. An Introductory Tour of Interactive Rendering. IEEE Computer Graphics and Applications, 26(1), 2006.

Digital Library

[12]

R. A. Hankins, G. N. Chinya, J. D. Collins, P. H. Wang, R. Rakvic, H. Wang, and J. P. Shen. Multiple Instruction Stream Processor. In Proc. 33rd International Symposium on Computer Architecture, 2006.

Digital Library

[13]

D. S. Henry and C. F. Joerg. A Tightly-Coupled Processor-Network Interface. In Proc. 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, 1992.

Digital Library

[14]

M. Horowitz, M. Martonosi, T. Mowry, and M. Smith. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proc. 23rd International Symposium on Computer Architecture, pages 244--255, May 1996.

Digital Library

[15]

Intel. G45 Express Chipset. http://www.intel.com/Assets/PDF/prodbrief/319946.pdf.

[16]

Intel. IA Programmers Reference Manual 2008. http://www.intel.com/products/processor/manuals/index.htm.

[17]

Intel. Use MONITOR and MWAIT Streaming SIMD Extensions 3 Instructions. http://softwarecommunity.intel.com/Wiki.

[18]

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4/5):589--604, July/September 2005.

Digital Library

[19]

R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures: the Potential for Processor Power Reduction. In Proc. 36th International Symposium on Microarchitecture, Dec. 2003.

Digital Library

[20]

R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proc. 31st International Symposium on Computer Architecture, Jun. 2004.

Digital Library

[21]

R. Kumar, D. M. Tullsen, and N. P. Jouppi. Core Architecture Optimization for Heterogeneous Chip Multiprocessors. In Proc. 15th International Conference on Parallel Architectures and Compilation Techniques, 2006.

Digital Library

[22]

J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH Multiprocessor. In Proc. 21st International Symposium on Computer Architecture, 1994.

Digital Library

[23]

S.-L. L. Lu, P. Yiannacouras, R. Kassa, M. Konow, and T. Suh. An FPGA-based Pentium in a Complete Desktop System. In International Symposium on Field-Programmable Gate Arrays, pages 53--59, 2007.

Digital Library

[24]

O. Maquelin, G. R. Gao, H. H. J. Hum, K. B. Theobald, and X.-M. Tian. Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling. In Proc. 23rd International Symposium on Computer Architecture, pages 179--188, 1996.

Digital Library

[25]

M. D. McCool, K. Wadleigh, B. Henderson, and H.-Y. Lin. Performance Evaluation of GPUs Using the RapidMind Development Platform. In Proc. 2006 ACM/IEEE Conference on Supercomputing, 2006.

Digital Library

[26]

Microsoft. A Roadmap for DirectX. http://msdn.microsoft.com/en-us/library/bb756949.aspx.

[27]

T. Morad, U. Weiser, and A. Kolodny. ACCMP - Asymmetric Cluster Chip-Multiprocessing. Technical Report 488, CCIT, 2004.

[28]

T. Morad, U. Weiser, A. Kolodny, M. Valero, and E. Ayguade. Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors. IEEE Computer Architecture Letters, 5(1), 2006.

Digital Library

[29]

S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent Network Interfaces for Fine-Grain Communication. In Proc. 23rd International Symposium on Computer Architecture, 1996.

Digital Library

[30]

T. H. Myer and I. E. Sutherland. On the Design of Display Processors. Communications of ACM, 11(6):410--414, 1968.

Digital Library

[31]

Nvidia. Compute Unified Device Architecture (CUDA). http://developer.nvidia.com/object/cuda.html.

[32]

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics 2005, State of the Art Reports, pages 21--51, Aug. 2005.

[33]

Peakstream Inc. The PeakStream Platform: High Productivity Software Development for Multi-core Processors, 2006.

[34]

M. Pharr, A. Lefohn, C. Kolb, P. Lalonde, T. Foley, and G. Berry. Programmable graphics: the future of interactive rendering. In SIGGRAPH '08: ACM SIGGRAPH 2008 classes, pages 1--6, 2008.

Digital Library

[35]

C. A. Thekkath and H. M. Levy. Hardware and Software Support for Efficient Exception Handling. In Proc. 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 110--119, 1994.

Digital Library

[36]

R. Uhlig, R. Fishtein, O. Gershon, I. Hirsh, and H. Wang. SoftSDV: A Pre-silicon Software Development Environment for the IA-64 Architecture. Intel Technology Journal, (Q4):14, 1999.

[37]

T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In Proc. 19th International Symposium on Computer Architecture, pages 430--440, May 1992.

Digital Library

[38]

P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In Proc. 2007 ACM Conference on Programming Language Design and Implementation, 2007.

Digital Library

Cited By

Lee HSanchez D(2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00092
Benoit NLouise S(2023)Runtime support for automatic placement of workloads on heterogeneous processors2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00039(210-217)Online publication date: 18-Dec-2023
https://doi.org/10.1109/MCSoC60832.2023.00039
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Show More Cited By

Index Terms

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Special issue on Community Analysis and Information Recommendation

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
792
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee HSanchez D(2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00092
Benoit NLouise S(2023)Runtime support for automatic placement of workloads on heterogeneous processors2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00039(210-217)Online publication date: 18-Dec-2023
https://doi.org/10.1109/MCSoC60832.2023.00039
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Castillo EAlvarez LMoreto MCasas MVallejo EBosque JBeivide RValero M(2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
https://doi.org/10.1109/HPCA.2018.00033
Zhu QWu BShen XShen KShen LWang Z(2018)Resolving the GPU responsiveness dilemma through program transformationsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6206-y12:3(545-559)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11704-016-6206-y
Zhang GChiu VSanchez DHsu WYang CLipasti MLee H(2016)Exploiting semantic commutativity in hardware speculationThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195679(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195679
Zhang GChiu VSanchez D(2016)Exploiting semantic commutativity in hardware speculation2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO.2016.7783737(1-12)Online publication date: Oct-2016
https://doi.org/10.1109/MICRO.2016.7783737
Kornaros GPratikakis M(2016)VWQS: A dispatching mechanism of variable-size tasks in heterogeneous systems2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568335(196-203)Online publication date: Jul-2016
https://doi.org/10.1109/HPCSim.2016.7568335
Licheng YYulong PTianzhou CXueqing LMinghui WTiefei Z(2016)LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0046(260-267)Online publication date: Dec-2016
https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0046
Zhang GHorn WSanchez DPrvulovic M(2015)Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systemsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830774(13-25)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830774
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents