Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

Published: 01 January 2006 Publication History

Abstract

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband EngineTM architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.

References

[1]
1. D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, "The Design and implementation of a First-Generation CELL Processor," Digest of Technical Papers, IEEE International Solid-State Circuits Conference (ISSCC 2005) IEEE International, Piscataway, NJ (February 2005), pp. 184-185, http://www-03.ibm.com/industries/ telecom/doc/content/bin/tc_isscc_10.2_cell_design.pdf.
[2]
2. PowerPC Microprocessor Family: AltiVec Technology Programming Environments Manual, IBM Corporation (July 2004).
[3]
3. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, "Introduction to the Cell Multiprocessor," IBM Journal of Research and Development 49, No. 4/5, 589-604 (July/September 2005).
[4]
4. S. Larsen and S. Amarasinghe, "Exploiting Superword-Level Parallelism with Multimedia Instruction Sets," Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, New York (June 2000), pp. 145-156, http://portal.acm.org/ citation.cfm?id-349320.
[5]
5. J. Shin, M. Hall, and J. Chame, "Superword-Level Parallelism in the Presence of Control Flow," Proceedings of the International Symposium on Code Generation and Optimization (March 2005), pp. 165-175, http://doi. ieeecomputersociety.org/10.1109/CGO.2005.33.
[6]
6. Aart Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian, "Automatic Intra-Register Vectorization for the Intel Architecture," International Journal of Parallel Programming 30, No. 2, pp. 65-98 (April 2002).
[7]
7. D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks, "Vectorizing for a SIMDD DSP Architecture," Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (October 2003), pp. 2-11.
[8]
8. Crescent Bay Software - VAST/AltiVec, http://www. crescentbaysoftware.com/vast_altivec.html.
[9]
9. N. Sreraman and R. Govindarajan, "A Vectorizing Compiler for Multimedia Extensions," International Journal of Parallel Programming 28, No. 4, 363-400 (August 2000).
[10]
10. C. G. Lee and M. G. Stoodley, "Simple Vector Microprocessors for Multimedia Applications," Proceedings of the 31st International Symposium on Microarchitecture, IEEE Computer Society Press, Los Alamitos, CA (1998), pp. 25-36, http://portal.acm.org/citation. cfm?coll-GUIDE&dl=GUIDE&id=290951.
[11]
11. A. E. Eichenberger, P. Wu, and K. O'Brien, "Vectorization for SIMD Architectures with Alignment Constraints," Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, New York (June 2004), pp. 82-93.
[12]
12. P. Wu, A. E. Eichenberger, and A. Wang, "Efficient SIMD Code Generation for Runtime Alignment and Length Conversion," Proceedings of the International Symposium on Code Generation and Optimization, IEEE Computer Society Press, Los Alamitos, CA (March 2005), pp. 153-164.
[13]
13. P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao, "An Integrated SIMDization Framework Using Virtual Vectors," Proceedings of the 19th Annual International Conference on Supercomputing, ACM Press, New York (June 2005), pp. 169-178.
[14]
14. Official OpenMP Specifications, OpenMP Architecture Review Board (2002), http://www.openmp.org/specs/.
[15]
15. T. C. Mowry, "Tolerating Latency through Software-Controlled Data Prefetching," Doctoral dissertation, Stanford University (March 1994).
[16]
16. M. E. Wolf and M. S. Lam, "A Data Locality Optimizing Algorithm," Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, ACM Press, New York (May 1991), pp. 30-44, http://portal.acm.org/citation. cfm?id=113449&coll=Portal&dl=GUIDE&CFID= 54819031&CFTOKEN=14228294.
[17]
17. G. Rivera and C.-W. Tseng, "Tiling Optimizations for 3D Scientific Computation," Proceedings of the 2000 ACM/ IEEE Conference on Supercomputing, IEEE Computer Society, Washington, DC, Online proceedings (November 2000) http://portal.acm.org/citation. cfm?id=370403&coll=Portal&dl=GUIDE&CFID= 54819031&CFTOKEN=14228294.
[18]
18. A. Badaway, A. Aggarwal, D. Yeung, and C.-W. Tseng, "Evaluating the Impact of Memory System Performance on Software Prefetching and Locality Optimizations," Proceedings of the 15th International Conference on Supercomputing, ACM Press, New York (June 2001), pp. 486-500, http://portal.acm.org/citation. cfm?id=377906&coll=Port al&dl=GUIDE&CFID= 54819031&CFTOKEN=14228294.
[19]
19. J. Andrews and C. Polychronopoulos, "An Analytical Approach to Performance/Cost Modeling of Parallel Computers," Journal of Parallel and Distributed Computing 12, No. 4, 343-356 (August 1991).
[20]
20. D. J. Lilja, "A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies," Journal of Parallel Computing 20, No. 5, 729-751 (May 1994).

Cited By

View all
  • (2018)The Case for Polymorphic Registers in Dataflow ComputingInternational Journal of Parallel Programming10.1007/s10766-017-0494-146:6(1185-1219)Online publication date: 1-Dec-2018
  • (2018)Data-Driven Thread Execution on Heterogeneous ProcessorsInternational Journal of Parallel Programming10.1007/s10766-016-0486-646:2(198-224)Online publication date: 1-Apr-2018
  • (2017)Energy-efficient data prefetch buffering for low-end embedded processorsMicroelectronics Journal10.1016/j.mejo.2017.01.01462:C(57-64)Online publication date: 1-Apr-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IBM Systems Journal
IBM Systems Journal  Volume 45, Issue 1
January 2006
194 pages

Publisher

IBM Corp.

United States

Publication History

Published: 01 January 2006

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2018)The Case for Polymorphic Registers in Dataflow ComputingInternational Journal of Parallel Programming10.1007/s10766-017-0494-146:6(1185-1219)Online publication date: 1-Dec-2018
  • (2018)Data-Driven Thread Execution on Heterogeneous ProcessorsInternational Journal of Parallel Programming10.1007/s10766-016-0486-646:2(198-224)Online publication date: 1-Apr-2018
  • (2017)Energy-efficient data prefetch buffering for low-end embedded processorsMicroelectronics Journal10.1016/j.mejo.2017.01.01462:C(57-64)Online publication date: 1-Apr-2017
  • (2016)Workload acceleration with the IBM POWER vector-scalar architectureIBM Journal of Research and Development10.1147/JRD.2016.252741860:2-3(14:1-14:18)Online publication date: 1-Mar-2016
  • (2015)Coherence protocol for transparent management of scratchpad memories in shared memory manycore architecturesACM SIGARCH Computer Architecture News10.1145/2872887.275041143:3S(720-732)Online publication date: 13-Jun-2015
  • (2015)Coherence protocol for transparent management of scratchpad memories in shared memory manycore architecturesProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750411(720-732)Online publication date: 13-Jun-2015
  • (2014)A Novel Object-Oriented Software Cache for Scratchpad-Based Multi-Core ClustersJournal of Signal Processing Systems10.1007/s11265-014-0881-477:1-2(77-93)Online publication date: 1-Oct-2014
  • (2013)A software-only scheme for managing heap data on limited local memory(LLM) multicore processorsACM Transactions on Embedded Computing Systems10.1145/2501626.250163213:1(1-18)Online publication date: 5-Sep-2013
  • (2012)Hardware-software coherence protocol for the coexistence of caches and local memoriesProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389117(1-11)Online publication date: 10-Nov-2012
  • (2012)For extreme parallelism, your OS is Sooooo last-millenniumProceedings of the 4th USENIX conference on Hot Topics in Parallelism10.5555/2342788.2342791(3-3)Online publication date: 7-Jun-2012
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media