Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Scientific computing Kernels on the cell processor

Published: 01 June 2007 Publication History
  • Get Citation Alerts
  • Abstract

    In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key numerical kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Next, we validate our model by comparing results against published hardware data, as well as our own Cell blade implementations. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different kernel implementations and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

    References

    [1]
    1. S. Williams, J. Shalf, L. Oliker, et. al., The Potential of the Cell Processor for Scientific Computing, Computing Frontiers, pp. 9-20 (May 2006).
    [2]
    2. M. Kondo, H. Okawara, H. Nakamura, et. al., Scima: A Novel Processor Architecture for High Performance Computing, 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, pp. 355-360 (May 2000).
    [3]
    3. P. Keltcher, S. Richardson, S. Siu, et. al., An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor. Technical report, HP Laboratories (April 2000).
    [4]
    4. S. Tomar, S. Kim, N. Vijaykrishnan, et. al., Use of Local Memory for Efficient Java Execution, Proceedings of the International Conference on Computer Design, pp. 468-473 (September 2001).
    [5]
    5. M. Kandemir, J. Ramanujam, M. Irwin, et. al., Dynamic Management of Scratch-pad Memory Space, Proceedings of the Design Automation Conference, pp. 690-695 (June 2001).
    [6]
    6. P. Francesco, P. Marchal, D. Atienzaothers, et. al., An Integrated Hardware/Software Approach for Run-time Scratchpad Management, Proceedings of the 41st Design Automation Conference, pp. 238-243 (June 2004).
    [7]
    7. The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.
    [8]
    8. B. Khailany, W. Dally, S. Rixner, et. al., Imagine: Media Processing with Streams, IEEE Micro, 21(2):35-46 (March-April 2001).
    [9]
    9. M. Oka and M. Suzuoki, Designing and Programming the Emotion Engine, IEEE Micro, 19(6):20-28 (November 1999).
    [10]
    10. A. Kunimatsu, N. Ide, T. Sato, et. al., Vector Unit Architecture for Emotion Synthesis, IEEE Micro, 20(2):40-47 (March 2000).
    [11]
    11. M. Suzuoki, et. al., A Microprocessor with a 128-bit cpu, Ten Floating Point Macs, Four Floating-point Dividers, and an mpeg-2 Decoder, IEEE Solid State Circuits, 34(1):1608-1618 (November 1999).
    [12]
    12. B. Flachs, S. Asano, S. H. Dhong, et. al., A Streaming Processor Unit for a Cell Processor, ISSCC Dig. Tech. Papers, pp. 134-135 (February 2005).
    [13]
    13. D. Pham, S. Asano, M. Bollier, et. al., The Design and Implementation of a First-generation Cell Processor, ISSCC Dig. Tech. Papers, pp. 184-185 (February 2005).
    [14]
    14. S. M. Mueller, C. Jacobi, C. Hwa-Joon, et. al., The Vector Floating-point Unit in a Synergistic Processor Element of a Cell Processor, 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), pp. 59-67 (June 2005).
    [15]
    15. J. A. Kahle, M. N. Day, H. P. Hofstee, et. al., Introduction to the Cell Multiprocessor. IBM Journal of R&D, 49(4) (2005).
    [16]
    16. IBM Cell specifications, http://www.research.ibm.com/cell/home.html.
    [17]
    17. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI: The Complete Reference (Vol. 1). The MIT Press (1998).
    [18]
    18. Sony press release, http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.
    [19]
    19. N. Park, B. Hong, and V. K. Prasanna, Analysis of Memory Hierarchy Performance of Block Data Layout. International Conference on Parallel Processing (ICPP), p. 35 (August 2002).
    [20]
    20. L. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, p. 228 (1969).
    [21]
    21. Cell Broadband Engine Architecture and its First Implementation. http://www-128. ibm.com/developerworks/power/library/pa-cellperf/
    [22]
    22. Y. Saad, Iterative Methods for Sprarse Linear Systems. PWS, Boston, MA (1996).
    [23]
    23. G. Blelloch, M. Heroux, and M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Technical Report CMU-CS-93-173, CMU (1993).
    [24]
    24. R. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, PhD thesis, University of California at Berkeley (2003).
    [25]
    25. E.-J. Ira, K. Yelick, and R. Vuduc, Sparsity: Optimization Framework for Sparse Matrix Kernels, International Journal of High Performance Computing Applications, pp. 135-158 (2004).
    [26]
    26. E. F. D'Azevedo, M. R. Fahey. and R. T. Mills, Vectorized Sparse Matrix Multiply for Compressed Row Storage Format, International Conference on Computational Science (ICCS), pp. 99-106 (2005).
    [27]
    27. Chombo homepage, http://seesar.lbl.gov/anag/chombo.
    [28]
    28. Cactus homepage, http://www.cactuscode.org.
    [29]
    29. Z. Li and Y. Song, Automatic Tiling of Iterative Stencil Loops. ACM Transactions on Programming Language Systems, 26(6):975-1028 (2004).
    [30]
    30. David Wonnacott, Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. International Parallel and Distributed Processing Symposium (IPDPS), pp. 171-180 (2000).
    [31]
    31. G. Jin, J. Mellor-Crummey, and R. Fowlerothers, Increasing Temporal Locality with Skewing and Recursive Blocking, Proc. SC2001 (2001).
    [32]
    32. S. Kamil, K. Datta, S. Williams, et. al., Implicit and Explicit Optimizations for Stencil Computations, ACM Workshop on Memory System Performance and Correctness, pp. 51-60 (October 2005).
    [33]
    33. S. Kamil, P. Husbands, L. Oliker, et. al., Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations, ACM Workshop on Memory System Performance, pp. 36-43 (June 2005).
    [34]
    34. L. Oliker, R. Biswas, J. Borrill, et. al., A Performance Evaluation of the Cray X1 for Scientific Applications, Proc. 6th International Meeting on High Performance Computing for Computational Science, pp. 51-65 (2004).
    [35]
    35. FFTW speed tests, http://www.fftw.org.
    [36]
    36. A. Chow, G. Fosum, D, and Brokenshire, A Programming Example: Large FFT on the Cell Broadband Engine, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).
    [37]
    37. J. Greene and R. Cooper, A Parallel 64k Complex FFT Algorithm for the PIBM/Sony/Toshiba Cell Broadband Engine processor, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).
    [38]
    38. ORNL cray xl evaluation, http://www.csm.ornl.gov/~dunigan/cray.

    Cited By

    View all
    • (2017)Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput TradeoffsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2016.258014225:1(100-113)Online publication date: 1-Jan-2017
    • (2016)A data layout transformation (DLT) acceleratorProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972155(1489-1492)Online publication date: 14-Mar-2016
    • (2012)Accelerator-Based implementation of the harris algorithmProceedings of the 5th international conference on Image and Signal Processing10.1007/978-3-642-31254-0_55(485-492)Online publication date: 28-Jun-2012
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image International Journal of Parallel Programming
    International Journal of Parallel Programming  Volume 35, Issue 3
    Jun 2007
    188 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 June 2007

    Author Tags

    1. FFT
    2. GEMM
    3. SpMV
    4. cell processor
    5. sparse matrix
    6. stencil
    7. three level memory

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2017)Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput TradeoffsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2016.258014225:1(100-113)Online publication date: 1-Jan-2017
    • (2016)A data layout transformation (DLT) acceleratorProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972155(1489-1492)Online publication date: 14-Mar-2016
    • (2012)Accelerator-Based implementation of the harris algorithmProceedings of the 5th international conference on Image and Signal Processing10.1007/978-3-642-31254-0_55(485-492)Online publication date: 28-Jun-2012
    • (2011)A performance evaluation on monte carlo simulation for radiation dosimetry using cell processorJournal of Computational Methods in Sciences and Engineering10.5555/2010385.201039111:1,2(1-12)Online publication date: 1-Apr-2011
    • (2011)Parallelization schemes for memory optimization on the cell processorTransactions on high-performance embedded architectures and compilers III10.5555/1980776.1980789(177-200)Online publication date: 1-Jan-2011
    • (2011)Hardware/software co-design for energy-efficient seismic modelingProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063482(1-12)Online publication date: 12-Nov-2011
    • (2011)MintProceedings of the international conference on Supercomputing10.1145/1995896.1995932(214-224)Online publication date: 31-May-2011
    • (2011)An efficient CELL library for lattice quantum chromodynamicsACM SIGARCH Computer Architecture News10.1145/1926367.192637838:4(60-65)Online publication date: 14-Jan-2011
    • (2011)Parallelization Schemes for Memory Optimization on the Cell ProcessorProceedings of the 2011 conference on Transactions on High-Performance Embedded Architectures and Compilers III - Volume 659010.1007/978-3-642-19448-1_10(177-200)Online publication date: 1-Jan-2011
    • (2010)HiFlow3Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing10.1145/2039312.2039316(1-6)Online publication date: 17-Oct-2010
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media