Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1454115.1454125acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Published: 25 October 2008 Publication History

Abstract

Moore's Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically "fuses" existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8x.

References

[1]
GPGPU: General Purpose Computation using Graphics Hardware. http://www.gpgpu.org.
[2]
A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multiprocessing. In Proc. 17th International Symposium on Computer Architecture, pages 104 -- 114, May 1990.
[3]
M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's Law through EPI Throttling. In Proc. 32nd International Symposium on Computer Architecture, 2005.
[4]
S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proc. 32nd International Symposium on Computer Architecture, pages 506--517, Jun. 2005.
[5]
A. Bracy, K. Doshi, and Q. Jacobson. Disintermediated Active Communication. IEEE Computer Architecture Letters, 5(2), 2006.
[6]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. In ACM Transactions on Graphics, volume 23, pages 777--786, 2004.
[7]
W. J. Dally, L. Chao, A. Chien, S. Hassoun, W. Horwat, J. Kaplan, P. Song, B. Totty, and S. Wills. Architecture of a Message-Driven Processor. In Proc. 14th International Symposium on Computer Architecture, pages 189 -- 196, 1987.
[8]
S. Ghiasi. Aide de Camp: Asymmetric Multi-core Design for Dynamic Thermal Management. Technical Report TR-01-43, 2003.
[9]
E. Grochowski and M. Annavaram. Energy per Instruction Trends in Intel Microprocessors. Technology@Intel Magazine, March 2006.
[10]
E. Grochowski, R. Ronen, J. Shen, and H. Wang. Best of Both Latency and Throughput. In Proc. IEEE International Conference on Computer Design, 2004.
[11]
E. Haines. An Introductory Tour of Interactive Rendering. IEEE Computer Graphics and Applications, 26(1), 2006.
[12]
R. A. Hankins, G. N. Chinya, J. D. Collins, P. H. Wang, R. Rakvic, H. Wang, and J. P. Shen. Multiple Instruction Stream Processor. In Proc. 33rd International Symposium on Computer Architecture, 2006.
[13]
D. S. Henry and C. F. Joerg. A Tightly-Coupled Processor-Network Interface. In Proc. 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, 1992.
[14]
M. Horowitz, M. Martonosi, T. Mowry, and M. Smith. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proc. 23rd International Symposium on Computer Architecture, pages 244--255, May 1996.
[15]
Intel. G45 Express Chipset. http://www.intel.com/Assets/PDF/prodbrief/319946.pdf.
[16]
Intel. IA Programmers Reference Manual 2008. http://www.intel.com/products/processor/manuals/index.htm.
[17]
Intel. Use MONITOR and MWAIT Streaming SIMD Extensions 3 Instructions. http://softwarecommunity.intel.com/Wiki.
[18]
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4/5):589--604, July/September 2005.
[19]
R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures: the Potential for Processor Power Reduction. In Proc. 36th International Symposium on Microarchitecture, Dec. 2003.
[20]
R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proc. 31st International Symposium on Computer Architecture, Jun. 2004.
[21]
R. Kumar, D. M. Tullsen, and N. P. Jouppi. Core Architecture Optimization for Heterogeneous Chip Multiprocessors. In Proc. 15th International Conference on Parallel Architectures and Compilation Techniques, 2006.
[22]
J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The Stanford FLASH Multiprocessor. In Proc. 21st International Symposium on Computer Architecture, 1994.
[23]
S.-L. L. Lu, P. Yiannacouras, R. Kassa, M. Konow, and T. Suh. An FPGA-based Pentium in a Complete Desktop System. In International Symposium on Field-Programmable Gate Arrays, pages 53--59, 2007.
[24]
O. Maquelin, G. R. Gao, H. H. J. Hum, K. B. Theobald, and X.-M. Tian. Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling. In Proc. 23rd International Symposium on Computer Architecture, pages 179--188, 1996.
[25]
M. D. McCool, K. Wadleigh, B. Henderson, and H.-Y. Lin. Performance Evaluation of GPUs Using the RapidMind Development Platform. In Proc. 2006 ACM/IEEE Conference on Supercomputing, 2006.
[26]
Microsoft. A Roadmap for DirectX. http://msdn.microsoft.com/en-us/library/bb756949.aspx.
[27]
T. Morad, U. Weiser, and A. Kolodny. ACCMP - Asymmetric Cluster Chip-Multiprocessing. Technical Report 488, CCIT, 2004.
[28]
T. Morad, U. Weiser, A. Kolodny, M. Valero, and E. Ayguade. Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors. IEEE Computer Architecture Letters, 5(1), 2006.
[29]
S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent Network Interfaces for Fine-Grain Communication. In Proc. 23rd International Symposium on Computer Architecture, 1996.
[30]
T. H. Myer and I. E. Sutherland. On the Design of Display Processors. Communications of ACM, 11(6):410--414, 1968.
[31]
Nvidia. Compute Unified Device Architecture (CUDA). http://developer.nvidia.com/object/cuda.html.
[32]
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics 2005, State of the Art Reports, pages 21--51, Aug. 2005.
[33]
Peakstream Inc. The PeakStream Platform: High Productivity Software Development for Multi-core Processors, 2006.
[34]
M. Pharr, A. Lefohn, C. Kolb, P. Lalonde, T. Foley, and G. Berry. Programmable graphics: the future of interactive rendering. In SIGGRAPH '08: ACM SIGGRAPH 2008 classes, pages 1--6, 2008.
[35]
C. A. Thekkath and H. M. Levy. Hardware and Software Support for Efficient Exception Handling. In Proc. 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 110--119, 1994.
[36]
R. Uhlig, R. Fishtein, O. Gershon, I. Hirsh, and H. Wang. SoftSDV: A Pre-silicon Software Development Environment for the IA-64 Architecture. Intel Technology Journal, (Q4):14, 1999.
[37]
T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In Proc. 19th International Symposium on Computer Architecture, pages 430--440, May 1992.
[38]
P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In Proc. 2007 ACM Conference on Programming Language Design and Implementation, 2007.

Cited By

View all
  • (2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
  • (2023)Runtime support for automatic placement of workloads on heterogeneous processors2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00039(210-217)Online publication date: 18-Dec-2023
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • Show More Cited By

Index Terms

  1. Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques
    October 2008
    328 pages
    ISBN:9781605582825
    DOI:10.1145/1454115
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 October 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chip multiprocessor
    2. heterogeneous
    3. ia32
    4. on-chip integration

    Qualifiers

    • Research-article

    Conference

    PACT '08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 121 of 471 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Terminus: A Programmable Accelerator for Read and Update Operations on Sparse Data Structures2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00092(1233-1246)Online publication date: 2-Nov-2024
    • (2023)Runtime support for automatic placement of workloads on heterogeneous processors2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00039(210-217)Online publication date: 18-Dec-2023
    • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
    • (2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
    • (2018)Resolving the GPU responsiveness dilemma through program transformationsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6206-y12:3(545-559)Online publication date: 1-Jun-2018
    • (2016)Exploiting semantic commutativity in hardware speculationThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195679(1-12)Online publication date: 15-Oct-2016
    • (2016)Exploiting semantic commutativity in hardware speculation2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO.2016.7783737(1-12)Online publication date: Oct-2016
    • (2016)VWQS: A dispatching mechanism of variable-size tasks in heterogeneous systems2016 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2016.7568335(196-203)Online publication date: Jul-2016
    • (2016)LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0046(260-267)Online publication date: Dec-2016
    • (2015)Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systemsProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830774(13-25)Online publication date: 5-Dec-2015
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media