article

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine^TM architecture

Authors:

A. E. Eichenberger,

J. C. Shepherd,

M. K. Gschwind,

R. Archambault,

R. KooAuthors Info & Claims

IBM Systems Journal, Volume 45, Issue 1

Pages 59 - 84

https://doi.org/10.1147/sj.451.0059

Published: 01 January 2006 Publication History

Abstract

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Engine^TM architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC^® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.

References

[1]

1. D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa, "The Design and implementation of a First-Generation CELL Processor," Digest of Technical Papers, IEEE International Solid-State Circuits Conference (ISSCC 2005) IEEE International, Piscataway, NJ (February 2005), pp. 184-185, http://www-03.ibm.com/industries/ telecom/doc/content/bin/tc_isscc_10.2_cell_design.pdf.

[2]

2. PowerPC Microprocessor Family: AltiVec Technology Programming Environments Manual, IBM Corporation (July 2004).

[3]

3. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, "Introduction to the Cell Multiprocessor," IBM Journal of Research and Development 49, No. 4/5, 589-604 (July/September 2005).

Digital Library

[4]

4. S. Larsen and S. Amarasinghe, "Exploiting Superword-Level Parallelism with Multimedia Instruction Sets," Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, New York (June 2000), pp. 145-156, http://portal.acm.org/ citation.cfm?id-349320.

[5]

5. J. Shin, M. Hall, and J. Chame, "Superword-Level Parallelism in the Presence of Control Flow," Proceedings of the International Symposium on Code Generation and Optimization (March 2005), pp. 165-175, http://doi. ieeecomputersociety.org/10.1109/CGO.2005.33.

Digital Library

[6]

6. Aart Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian, "Automatic Intra-Register Vectorization for the Intel Architecture," International Journal of Parallel Programming 30, No. 2, pp. 65-98 (April 2002).

Digital Library

[7]

7. D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks, "Vectorizing for a SIMDD DSP Architecture," Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (October 2003), pp. 2-11.

[8]

8. Crescent Bay Software - VAST/AltiVec, http://www. crescentbaysoftware.com/vast_altivec.html.

[9]

9. N. Sreraman and R. Govindarajan, "A Vectorizing Compiler for Multimedia Extensions," International Journal of Parallel Programming 28, No. 4, 363-400 (August 2000).

[10]

10. C. G. Lee and M. G. Stoodley, "Simple Vector Microprocessors for Multimedia Applications," Proceedings of the 31st International Symposium on Microarchitecture, IEEE Computer Society Press, Los Alamitos, CA (1998), pp. 25-36, http://portal.acm.org/citation. cfm?coll-GUIDE&dl=GUIDE&id=290951.

[11]

11. A. E. Eichenberger, P. Wu, and K. O'Brien, "Vectorization for SIMD Architectures with Alignment Constraints," Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, New York (June 2004), pp. 82-93.

[12]

12. P. Wu, A. E. Eichenberger, and A. Wang, "Efficient SIMD Code Generation for Runtime Alignment and Length Conversion," Proceedings of the International Symposium on Code Generation and Optimization, IEEE Computer Society Press, Los Alamitos, CA (March 2005), pp. 153-164.

[13]

13. P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao, "An Integrated SIMDization Framework Using Virtual Vectors," Proceedings of the 19th Annual International Conference on Supercomputing, ACM Press, New York (June 2005), pp. 169-178.

[14]

14. Official OpenMP Specifications, OpenMP Architecture Review Board (2002), http://www.openmp.org/specs/.

[15]

15. T. C. Mowry, "Tolerating Latency through Software-Controlled Data Prefetching," Doctoral dissertation, Stanford University (March 1994).

[16]

16. M. E. Wolf and M. S. Lam, "A Data Locality Optimizing Algorithm," Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, ACM Press, New York (May 1991), pp. 30-44, http://portal.acm.org/citation. cfm?id=113449&coll=Portal&dl=GUIDE&CFID= 54819031&CFTOKEN=14228294.

[17]

17. G. Rivera and C.-W. Tseng, "Tiling Optimizations for 3D Scientific Computation," Proceedings of the 2000 ACM/ IEEE Conference on Supercomputing, IEEE Computer Society, Washington, DC, Online proceedings (November 2000) http://portal.acm.org/citation. cfm?id=370403&coll=Portal&dl=GUIDE&CFID= 54819031&CFTOKEN=14228294.

[18]

18. A. Badaway, A. Aggarwal, D. Yeung, and C.-W. Tseng, "Evaluating the Impact of Memory System Performance on Software Prefetching and Locality Optimizations," Proceedings of the 15th International Conference on Supercomputing, ACM Press, New York (June 2001), pp. 486-500, http://portal.acm.org/citation. cfm?id=377906&coll=Port al&dl=GUIDE&CFID= 54819031&CFTOKEN=14228294.

[19]

19. J. Andrews and C. Polychronopoulos, "An Analytical Approach to Performance/Cost Modeling of Parallel Computers," Journal of Parallel and Distributed Computing 12, No. 4, 343-356 (August 1991).

Digital Library

[20]

20. D. J. Lilja, "A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies," Journal of Parallel Computing 20, No. 5, 729-751 (May 1994).

Digital Library

Cited By

Ciobanu CGaydadjiev GPilato CSciuto D(2018)The Case for Polymorphic Registers in Dataflow ComputingInternational Journal of Parallel Programming10.1007/s10766-017-0494-146:6(1185-1219)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-017-0494-1
Arandi SMatheou GKyriacou CEvripidou P(2018)Data-Driven Thread Execution on Heterogeneous ProcessorsInternational Journal of Parallel Programming10.1007/s10766-016-0486-646:2(198-224)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0486-6
Yasir Qadri MQadri NFleury MMcDonald-Maier K(2017)Energy-efficient data prefetch buffering for low-end embedded processorsMicroelectronics Journal10.1016/j.mejo.2017.01.01462:C(57-64)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.mejo.2017.01.014
Show More Cited By

Index Terms

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine^TM architecture

Recommendations

Cell broadband engine architecture and its first implementation: a performance view

The Cell Broadband Engine^™ (Cell/B.E.) processor is the first implementation of the Cell Broadband Engine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM. In addition to use of the Cell/B.E. processor in the Sony Computer Entertainment ...
Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

This work presents a static method implemented in a compiler for extracting high instruction level parallelism for the 32-bit QueueCore, a queue computation-based processor. The instructions of a queue processor implicitly read and write their operands, ...
Computing discrete transforms on the Cell Broadband Engine

Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet ...

Comments

Information & Contributors

Information

Published In

cover image IBM Systems Journal

IBM Systems Journal Volume 45, Issue 1

January 2006

194 pages

ISSN:0018-8670

Issue’s Table of Contents

Publisher

IBM Corp.

United States

Publication History

Published: 01 January 2006

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ciobanu CGaydadjiev GPilato CSciuto D(2018)The Case for Polymorphic Registers in Dataflow ComputingInternational Journal of Parallel Programming10.1007/s10766-017-0494-146:6(1185-1219)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-017-0494-1
Arandi SMatheou GKyriacou CEvripidou P(2018)Data-Driven Thread Execution on Heterogeneous ProcessorsInternational Journal of Parallel Programming10.1007/s10766-016-0486-646:2(198-224)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0486-6
Yasir Qadri MQadri NFleury MMcDonald-Maier K(2017)Energy-efficient data prefetch buffering for low-end embedded processorsMicroelectronics Journal10.1016/j.mejo.2017.01.01462:C(57-64)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1016/j.mejo.2017.01.014
Gschwind M(2016)Workload acceleration with the IBM POWER vector-scalar architectureIBM Journal of Research and Development10.1147/JRD.2016.252741860:2-3(14:1-14:18)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1147/JRD.2016.2527418
Alvarez LVilanova LMoreto MCasas MGonzàlez MMartorell XNavarro NAyguadé EValero M(2015)Coherence protocol for transparent management of scratchpad memories in shared memory manycore architecturesACM SIGARCH Computer Architecture News10.1145/2872887.275041143:3S(720-732)Online publication date: 13-Jun-2015
https://dl.acm.org/doi/10.1145/2872887.2750411
Alvarez LVilanova LMoreto MCasas MGonzàlez MMartorell XNavarro NAyguadé EValero MMarr DAlbonesi D(2015)Coherence protocol for transparent management of scratchpad memories in shared memory manycore architecturesProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750411(720-732)Online publication date: 13-Jun-2015
https://dl.acm.org/doi/10.1145/2749469.2750411
Pinto CBenini L(2014)A Novel Object-Oriented Software Cache for Scratchpad-Based Multi-Core ClustersJournal of Signal Processing Systems10.1007/s11265-014-0881-477:1-2(77-93)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1007/s11265-014-0881-4
Bai KShrivastava A(2013)A software-only scheme for managing heap data on limited local memory(LLM) multicore processorsACM Transactions on Embedded Computing Systems10.1145/2501626.250163213:1(1-18)Online publication date: 5-Sep-2013
https://dl.acm.org/doi/10.1145/2501626.2501632
Alvarez LVilanova LGonzalez MMartorell XNavarro NAyguade EHollingsworth J(2012)Hardware-software coherence protocol for the coexistence of caches and local memoriesProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389117(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389117
Knauerhase RCledat RTeller J(2012)For extreme parallelism, your OS is Sooooo last-millenniumProceedings of the 4th USENIX conference on Hot Topics in Parallelism10.5555/2342788.2342791(3-3)Online publication date: 7-Jun-2012
https://dl.acm.org/doi/10.5555/2342788.2342791
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents