article

Scientific computing Kernels on the cell processor

Authors:

Samuel Williams,

Parry Husbands,

Katherine YelickAuthors Info & Claims

International Journal of Parallel Programming, Volume 35, Issue 3

Pages 263 - 298

Published: 01 June 2007 Publication History

Abstract

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key numerical kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Next, we validate our model by comparing results against published hardware data, as well as our own Cell blade implementations. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different kernel implementations and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

References

[1]

1. S. Williams, J. Shalf, L. Oliker, et. al., The Potential of the Cell Processor for Scientific Computing, Computing Frontiers, pp. 9-20 (May 2006).

Digital Library

[2]

2. M. Kondo, H. Okawara, H. Nakamura, et. al., Scima: A Novel Processor Architecture for High Performance Computing, 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, pp. 355-360 (May 2000).

[3]

3. P. Keltcher, S. Richardson, S. Siu, et. al., An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor. Technical report, HP Laboratories (April 2000).

[4]

4. S. Tomar, S. Kim, N. Vijaykrishnan, et. al., Use of Local Memory for Efficient Java Execution, Proceedings of the International Conference on Computer Design, pp. 468-473 (September 2001).

[5]

5. M. Kandemir, J. Ramanujam, M. Irwin, et. al., Dynamic Management of Scratch-pad Memory Space, Proceedings of the Design Automation Conference, pp. 690-695 (June 2001).

[6]

6. P. Francesco, P. Marchal, D. Atienzaothers, et. al., An Integrated Hardware/Software Approach for Run-time Scratchpad Management, Proceedings of the 41st Design Automation Conference, pp. 238-243 (June 2004).

[7]

7. The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.

[8]

8. B. Khailany, W. Dally, S. Rixner, et. al., Imagine: Media Processing with Streams, IEEE Micro, 21(2):35-46 (March-April 2001).

Digital Library

[9]

9. M. Oka and M. Suzuoki, Designing and Programming the Emotion Engine, IEEE Micro, 19(6):20-28 (November 1999).

Digital Library

[10]

10. A. Kunimatsu, N. Ide, T. Sato, et. al., Vector Unit Architecture for Emotion Synthesis, IEEE Micro, 20(2):40-47 (March 2000).

Digital Library

[11]

11. M. Suzuoki, et. al., A Microprocessor with a 128-bit cpu, Ten Floating Point Macs, Four Floating-point Dividers, and an mpeg-2 Decoder, IEEE Solid State Circuits, 34(1):1608-1618 (November 1999).

[12]

12. B. Flachs, S. Asano, S. H. Dhong, et. al., A Streaming Processor Unit for a Cell Processor, ISSCC Dig. Tech. Papers, pp. 134-135 (February 2005).

[13]

13. D. Pham, S. Asano, M. Bollier, et. al., The Design and Implementation of a First-generation Cell Processor, ISSCC Dig. Tech. Papers, pp. 184-185 (February 2005).

[14]

14. S. M. Mueller, C. Jacobi, C. Hwa-Joon, et. al., The Vector Floating-point Unit in a Synergistic Processor Element of a Cell Processor, 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), pp. 59-67 (June 2005).

Digital Library

[15]

15. J. A. Kahle, M. N. Day, H. P. Hofstee, et. al., Introduction to the Cell Multiprocessor. IBM Journal of R&D, 49(4) (2005).

[16]

16. IBM Cell specifications, http://www.research.ibm.com/cell/home.html.

[17]

17. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI: The Complete Reference (Vol. 1). The MIT Press (1998).

[18]

18. Sony press release, http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.

[19]

19. N. Park, B. Hong, and V. K. Prasanna, Analysis of Memory Hierarchy Performance of Block Data Layout. International Conference on Parallel Processing (ICPP), p. 35 (August 2002).

[20]

20. L. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, p. 228 (1969).

[21]

21. Cell Broadband Engine Architecture and its First Implementation. http://www-128. ibm.com/developerworks/power/library/pa-cellperf/

[22]

22. Y. Saad, Iterative Methods for Sprarse Linear Systems. PWS, Boston, MA (1996).

[23]

23. G. Blelloch, M. Heroux, and M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Technical Report CMU-CS-93-173, CMU (1993).

Digital Library

[24]

24. R. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, PhD thesis, University of California at Berkeley (2003).

[25]

25. E.-J. Ira, K. Yelick, and R. Vuduc, Sparsity: Optimization Framework for Sparse Matrix Kernels, International Journal of High Performance Computing Applications, pp. 135-158 (2004).

[26]

26. E. F. D'Azevedo, M. R. Fahey. and R. T. Mills, Vectorized Sparse Matrix Multiply for Compressed Row Storage Format, International Conference on Computational Science (ICCS), pp. 99-106 (2005).

Digital Library

[27]

27. Chombo homepage, http://seesar.lbl.gov/anag/chombo.

[28]

28. Cactus homepage, http://www.cactuscode.org.

[29]

29. Z. Li and Y. Song, Automatic Tiling of Iterative Stencil Loops. ACM Transactions on Programming Language Systems, 26(6):975-1028 (2004).

Digital Library

[30]

30. David Wonnacott, Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. International Parallel and Distributed Processing Symposium (IPDPS), pp. 171-180 (2000).

[31]

31. G. Jin, J. Mellor-Crummey, and R. Fowlerothers, Increasing Temporal Locality with Skewing and Recursive Blocking, Proc. SC2001 (2001).

[32]

32. S. Kamil, K. Datta, S. Williams, et. al., Implicit and Explicit Optimizations for Stencil Computations, ACM Workshop on Memory System Performance and Correctness, pp. 51-60 (October 2005).

[33]

33. S. Kamil, P. Husbands, L. Oliker, et. al., Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations, ACM Workshop on Memory System Performance, pp. 36-43 (June 2005).

[34]

34. L. Oliker, R. Biswas, J. Borrill, et. al., A Performance Evaluation of the Cray X1 for Scientific Applications, Proc. 6th International Meeting on High Performance Computing for Computational Science, pp. 51-65 (2004).

Digital Library

[35]

35. FFTW speed tests, http://www.fftw.org.

[36]

36. A. Chow, G. Fosum, D, and Brokenshire, A Programming Example: Large FFT on the Cell Broadband Engine, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).

[37]

37. J. Greene and R. Cooper, A Parallel 64k Complex FFT Algorithm for the PIBM/Sony/Toshiba Cell Broadband Engine processor, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).

[38]

38. ORNL cray xl evaluation, http://www.csm.ornl.gov/~dunigan/cray.

Cited By

Pimentel JBohnenstiehl BBaas B(2017)Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput TradeoffsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2016.258014225:1(100-113)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1109/TVLSI.2016.2580142
Thanh-Hoang TShambayati AChien AFanucci LTeich J(2016)A data layout transformation (DLT) acceleratorProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972155(1489-1492)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972155
Tadonki CLacassagne LDadi EEl Daoudi M(2012)Accelerator-Based implementation of the harris algorithmProceedings of the 5th international conference on Image and Signal Processing10.1007/978-3-642-31254-0_55(485-492)Online publication date: 28-Jun-2012
https://dl.acm.org/doi/10.1007/978-3-642-31254-0_55
Show More Cited By

Recommendations

Scientific Computing Kernels on the Cell Processor
In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell ...
The potential of the cell processor for scientific computing
CF '06: Proceedings of the 3rd conference on Computing frontiers

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining ...
Optimization of BLAS on the cell processor
HiPC'08: Proceedings of the 15th international conference on High performance computing

The unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 35, Issue 3

Jun 2007

188 pages

ISSN:0885-7458

Issue’s Table of Contents

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 June 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pimentel JBohnenstiehl BBaas B(2017)Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput TradeoffsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2016.258014225:1(100-113)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1109/TVLSI.2016.2580142
Thanh-Hoang TShambayati AChien AFanucci LTeich J(2016)A data layout transformation (DLT) acceleratorProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972155(1489-1492)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972155
Tadonki CLacassagne LDadi EEl Daoudi M(2012)Accelerator-Based implementation of the harris algorithmProceedings of the 5th international conference on Image and Signal Processing10.1007/978-3-642-31254-0_55(485-492)Online publication date: 28-Jun-2012
https://dl.acm.org/doi/10.1007/978-3-642-31254-0_55
Chow J(2011)A performance evaluation on monte carlo simulation for radiation dosimetry using cell processorJournal of Computational Methods in Sciences and Engineering10.5555/2010385.201039111:1,2(1-12)Online publication date: 1-Apr-2011
https://dl.acm.org/doi/10.5555/2010385.2010391
Saidani TLacassagne LFalcou JTadonki CBouaziz S(2011)Parallelization schemes for memory optimization on the cell processorTransactions on high-performance embedded architectures and compilers III10.5555/1980776.1980789(177-200)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.5555/1980776.1980789
Krueger JDonofrio DShalf JMohiyuddin MWilliams SOliker LPfreund FLathrop SCosta JKramer W(2011)Hardware/software co-design for energy-efficient seismic modelingProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063482(1-12)Online publication date: 12-Nov-2011
https://dl.acm.org/doi/10.1145/2063384.2063482
Unat DCai XBaden SLowenthal Dde Supinski BMcKee S(2011)MintProceedings of the international conference on Supercomputing10.1145/1995896.1995932(214-224)Online publication date: 31-May-2011
https://dl.acm.org/doi/10.1145/1995896.1995932
Tadonki CGrodidier GPene O(2011)An efficient CELL library for lattice quantum chromodynamicsACM SIGARCH Computer Architecture News10.1145/1926367.192637838:4(60-65)Online publication date: 14-Jan-2011
https://dl.acm.org/doi/10.1145/1926367.1926378
Saidani TLacassagne LFalcou JTadonki CBouaziz S(2011)Parallelization Schemes for Memory Optimization on the Cell ProcessorProceedings of the 2011 conference on Transactions on High-Performance Embedded Architectures and Compilers III - Volume 659010.1007/978-3-642-19448-1_10(177-200)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.1007/978-3-642-19448-1_10
Heuveline VDavis K(2010)HiFlow3Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing10.1145/2039312.2039316(1-6)Online publication date: 17-Oct-2010
https://dl.acm.org/doi/10.1145/2039312.2039316
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents