Article

Sequoia: programming the memory hierarchy

Authors:

Kayvon Fatahalian,

Daniel Reiter Horn,

Timothy J. Knight,

William J. Dally,

Pat HanrahanAuthors Info & Claims

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Pages 83 - es

https://doi.org/10.1145/1188455.1188543

Published: 11 November 2006 Publication History

Abstract

We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.

References

[1]

Aho, A., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley.

Digital Library

[2]

Allen, E., Chase, D., Luchangco, V., Maessen, J.-W., Ryu, S., Steele, G., and Tobin-Hochstadt., S., 2005. The Fortress language specification version 0.707. Technical report. Sun Microsystems.

[3]

Alpern, B., Carter, L., and Ferrante, J. 1993. Modeling parallel computers as memory hierarchies. In Proc. Programming Models for Massively Parallel Computers.

[4]

Alpern, B., Carter, L., Feig, E., and Selker, T. 1994. The uniform memory hierarchy model of computation. Algorithmica 12, 2/3, 72--109.

[5]

Alpern, B., Carter, L., and Ferrante, J. 1995. Space-limited procedures: A methodology for portable high performance. In International Working Conference on Massively Parallel Programming Models.

Digital Library

[6]

Alverson, G. A., and Notkin, D. 1993. Program structuring for effective parallel portability. IEEE Trans. Parallel Distrib. Syst. 4, 9, 1041--1059.

Digital Library

[7]

Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B. B., Garzarn, M. J., Padua, D., and von Praun, C. 2006. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 48--57.

Digital Library

[8]

Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming.

Digital Library

[9]

Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786.

Digital Library

[10]

Callahan, D., Chamberlain, B. L., and Zima, H. P. 2004. The Cascade high productivity language. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, IEEE Computer Society, 52--60.

[11]

Carlson, W. W., Draper, J. M., Culler, D. E., Yelick, K., Brooks, E., and Warren, K., 1999. Introduction to UPC and language specification. University of California-Berkeley Technical Report: CCS-TR-99-157.

[12]

Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. 2005. X10: An object-oriented approach to nonuniform cluster computing. In OOPSLA '05: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, 519--538.

Digital Library

[13]

Chow, A., Fossum, G., and Brokenshire, D., 2005. A programming example: Large FFT on the Cell Broadband Engine.

[14]

Culler, D. E., Arpaci-Dusseau, A. C., Goldstein, S. C., Krishnamurthy, A., Lumetta, S., Von Eicken, T., and Yelick, K. A. 1993. Parallel programming in Split-C. In Supercomputing, 262--273.

Digital Library

[15]

Dagum, L., and Menon, R. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1, 46--55.

Digital Library

[16]

Dally, W. J., Hanrahan, P., Erez, M., Knight, T. J., Labonte, F., Ahn, J.-H. Jayasena, N., Kapasi, U. J., Das, A., Gummaraju, J., and Buck, I. 2003. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 35.

Digital Library

[17]

Deitz, S. J., Chamberlain, B. L., and Snyder, L. 2004. Abstractions for dynamic data distribution. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, IEEE Computer Society, 42--51.

[18]

Eager, D. L., and Jahorjan, J. 1993. Chores: Enhanced run-time support for shared-memory parallel computing. ACM Trans. Comput. Syst. 11, 1, 1--32.

Digital Library

[19]

Frigo, M., and Strumpen, V. 2005. Cache oblivious stencil computations. In ICS '05: Proceedings of the 19th Annual International Conference on Supercomputing, 361--366.

Digital Library

[20]

Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, IEEE Computer Society, Washington, DC, USA, 285.

Digital Library

[21]

Frigo, M. 1999. A fast Fourier transform compiler. In Proc. 1999 ACM SIGPLAN Conf. on Programming Language Design and Implementation, vol. 34, 169--180.

Digital Library

[22]

Fukushige, T., Makino, J., and Kawai, A. 2005. GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems. Publications of the Astronomical Society of Japan 57 (dec), 1009--1021.

[23]

Gustavson, F. G. 1997. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 6, 737--756.

Digital Library

[24]

Guyer, S. Z., and Lin, C. 1999. An annotation language for optimizing software libraries. In Second Conference on Domain-Specific Languages, 39--52.

Digital Library

[25]

Horn, D. R., Houston, M., and Hanrahan, P. 2005. ClawHMMER: A streaming HMMer-search implementation. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, Washington, DC, USA, 11.

Digital Library

[26]

Intel, 2005. Math kernel library. http://www.intel.com/software/products/mkl.

[27]

Jia-Wei, H., and Kung, H. T. 1981. I/O complexity: The red-blue pebble game. In STOC '81: Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, 326--333.

Digital Library

[28]

Kapasi, U., Dally, W. J., Rixner, S., Owens, J. D., and Khailany, B. 2002. The Imagine stream processor. In Proceedings 2002 IEEE International Conference on Computer Design, 282--288.

Digital Library

[29]

Kennedy, K., Broom, B., Cooper, K., Dongarra, J., Fowler, R., Gannon, D., Johnsson, L., Mellor-Crummey, J., and Torczon, L. 2001. Telescoping languages: A strategy for automatic generation of scientific problem-solving systems from annotated libraries. Journal of Parallel Distributed Computing 61 (December), 1803--1826.

Digital Library

[30]

Labonte, F., Mattson, P., Buck, I., Kozyrakis, C., and Horowitz, M. 2004. The stream virtual machine. In Proceedings of the 2004 International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[31]

Lim, A. W., Liao, S.-W., and Lam, M. S. 2001. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 103--112.

Digital Library

[32]

Mattson, P. 2002. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University.

Digital Library

[33]

McPeak, S., and Wilkerson, D., 2005. Elsa: The Elkhound-based C/C++ Parser. http://www.cs.berkeley.edu/~smcpeak/elkhound.

[34]

Numrich, R. W., and Reid, J. 1998. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17, 2, 1--31.

Digital Library

[35]

Pham, D., Asano, S., Bolliger, M., Day, M. N., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. 2005. The design and implementation of a first-generation CELL processor. In IEEE International Solid-State Circuits Conference.

[36]

Vitter, J. S. 2002. External memory algorithms. In Handbook of Massive Data Sets, Kluwer Academic Publishers, Norwell, MA, USA, 359--416.

Digital Library

[37]

Whaley, R. C., Petitet, A., and Dongarra, J. J. 2001. Automated empirical optimization of software and the ATLAS project. Parallel Computing 27, 1--2, 3--35.

[38]

Yelick, K., Semenzato, L., Pike, G., Miyamoto, C., Liblit, B., Krishnamurthy, A., Hilfinger, P., Graham, S., Gay, D., Colella, P., and Aiken, A. 1998. Titanium: A high-performance Java dialect. In ACM 1998 Workshop on Java for High-Performance Network Computing.

Cited By

Köpcke BGorlatch SSteuwer M(2024)Descend: A Safe GPU Systems Programming LanguageProceedings of the ACM on Programming Languages10.1145/36564118:PLDI(841-864)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656411
Rasch ASchulze RShabalin DElster AGorlatch SHall MVerbrugge CLhoták OShen X(2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580269
Ikarashi YBernstein GReinking AGenc HRagan-Kelley JJhala RDillig I(2022)Exocompilation for productive programming of hardware acceleratorsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523446(703-718)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523446
Show More Cited By

Recommendations

Sequoia: A High-Endurance NVM-Based Cache Architecture

Emerging nonvolatile memory technologies, such as spin-transfer torque RAM or resistive RAM, can increase the capacity of the last-level cache (LLC) in a latency and power-efficient manner. These technologies endure $10^{9}$ – $10^{12}$ writes per cell, making a ...
The Sequoia 2000 Storage Benchmark
The SEQUOIA 2000 storage benchmark
SIGMOD '93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data

This paper presents a benchmark that concisely captures the data base requirements of a collection of Earth Scientists working in the SEQUOIA 2000 project on various aspects of global change research. This benchmark has the novel characteristic that it ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

November 2006

746 pages

ISBN:0769527000

DOI:10.1145/1188455

Conference Chair:
Barbara Horner-Miller
Arctic Region Supercomputing Center

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SC '06

Sponsor:

SIGARCH
IEEE-CS

SC '06: International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 17, 2006

Florida, Tampa

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

214
Total Citations
View Citations
1,743
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)10

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Köpcke BGorlatch SSteuwer M(2024)Descend: A Safe GPU Systems Programming LanguageProceedings of the ACM on Programming Languages10.1145/36564118:PLDI(841-864)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656411
Rasch ASchulze RShabalin DElster AGorlatch SHall MVerbrugge CLhoták OShen X(2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580269
Ikarashi YBernstein GReinking AGenc HRagan-Kelley JJhala RDillig I(2022)Exocompilation for productive programming of hardware acceleratorsProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523446(703-718)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523446
Yadav RAiken AKjolstad FJhala RDillig I(2022)DISTAL: the distributed tensor algebra compilerProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523437(286-300)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523437
Liu ABernstein GChlipala ARagan-Kelley J(2022)Verified tensor-program optimization via high-level scheduling rewritesProceedings of the ACM on Programming Languages10.1145/34987176:POPL(1-28)Online publication date: 12-Jan-2022
https://dl.acm.org/doi/10.1145/3498717
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Tripathy DAbdolrashidi ABhuyan LZhou LWong D(2021)PAVERACM Transactions on Architecture and Code Optimization10.1145/345116418:3(1-26)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3451164
Bauer MLee WSlaughter EJia ZDi Renzo MPapadakis MShipman GMcCormick PGarland MAiken ALee JPetrank E(2021)Scaling implicit parallelism via dynamic control replicationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441587(105-118)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441587
Arora JWestrick SAcar U(2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434299
Tripathy DAbdolrashidi AFan QWong DSatpathy M(2021)LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605411(1-8)Online publication date: Oct-2021
https://doi.org/10.1109/NAS51552.2021.9605411
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten