Article

Temporal Streaming of Shared Memory

Authors:

Thomas F. Wenisch,

Stephen Somogyi,

Nikolaos Hardavellas,

Anastassia Ailamaki,

Babak FalsafiAuthors Info & Claims

ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture

Pages 222 - 233

https://doi.org/10.1109/ISCA.2005.50

Published: 01 May 2005 Publication History

Abstract

Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory accesses. Temporal streaming dynamically identifies address sequences to be streamed by exploiting two common phenomena in shared-memory access patterns: (1) temporal address correlation - groups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream locality - recently-accessed address streams are likely to recur. We present a practical design for temporal streaming. We evaluate our design using a combination of trace-driven and cycle-accurate full-system simulation of a cache-coherent distributed shared-memory system. We show that temporal streaming can eliminate 98% of coherent read misses in scientific applications, and between 43% and 60% in database and web server workloads. Our design yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads.

References

[1]

{1} S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66-76, Dec. 1996.

Digital Library

[2]

{2} L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 3-14, June 1998.

Digital Library

[3]

{3} T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the SIGPLAN '02 Conference on Programming Language Design and Implementation (PLDI), June 2002.

Digital Library

[4]

{4} Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004.

Digital Library

[5]

{5} J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic speculative precomputation. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 34), December 2001.

Digital Library

[6]

{6} D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel programming in Split-C. In Proceedings of Supercomputing '93, pages 262-273, Nov. 1993.

Digital Library

[7]

{7} Z. Cvetanovic. Performance analysis of the alpha 21364- based hp gs1280 multiprocessor. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 218-229, June 2003.

Digital Library

[8]

{8} K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniquesto enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing (Vol. I Architecture), pages I- 355-364, Aug. 1991.

[9]

{9} C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002.

Digital Library

[10]

{10} C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 162-171, May 1999.

Digital Library

[11]

{11} R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J. P. Shen. Scaling and characterizing data-base workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec. 2003.

Digital Library

[12]

{12} N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review , 31(4):31-35, April 2004.

Digital Library

[13]

{13} J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: making use of incoherence. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI), October 2004.

Digital Library

[14]

{14} D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252-263, June 1997.

Digital Library

[15]

{15} S. Kaxiras and C. Young. Coherence communication prediction in shared memory multiprocessors. In Proceedings of the 6th IEEE Symposium on High-Performance Computer Architecture, January 2000.

[16]

{16} P. Keleher. Tapeworm: High-level abstractions of shared accesses. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI), February 1999.

Digital Library

[17]

{17} D. A. Koufaty, X. Chen, D. K. Poulsena, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. In Proceedings of the 1995 International Conference on Supercomputing, July 1995.

Digital Library

[18]

{18} A.-C. Lai and B. Falsafi. Memory sharing predictor: The key to a speculative coherent DSM. In Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999.

Digital Library

[19]

{19} A.-C. Lai and B. Falsafi. Selective, accurate, and timely self-invalidation using last-touch prediction. In Proceedings of the 27th Annual International Symposium on Computer Architecture , June 2000.

Digital Library

[20]

{20} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, February 2002.

Digital Library

[21]

{21} M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.

Digital Library

[22]

{22} S. S. Mukherjee and M. D. Hill. Using prediction to accelerate coherence protocols. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.

Digital Library

[23]

{23} S. S. Mukherjee, S. D. Sharma, M. D. Hill, J. R. Larus, A. Rogers, and J. Saltz. Efficient support for irregular applications on distributed-memory machines. In 5th ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 68-79, July 1995.

Digital Library

[24]

{24} O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: an effective alternative to large instruction windows. IEEE Micro, 23(6):20-25, November/December 2003.

Digital Library

[25]

{25} K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the 10th IEEE Symposium on High-Performance Computer Architecture, Feb. 2004.

Digital Library

[26]

{26} D. G. Perez, G. Mouchard, and O. Temam. Microlib: a case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 3rd Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD04), June 2004.

Digital Library

[27]

{27} P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII), pages 307-318, Oct. 1998.

Digital Library

[28]

{28} T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd Annual IEEE/ ACM International Symposium on Microarchitecture (MICRO 33), pages 42-53, December 2000.

Digital Library

[29]

{29} S. Somogyi, T. F. Wenisch, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Memory coherence activity prediction in commercial workloads. In 3rd Workshop on Memory Performance Issues, June 2004.

Digital Library

[30]

{30} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, July 1995.

Digital Library

[31]

{31} R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. Smarts: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.

Digital Library

Cited By

Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/3641853Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3641853
Schlüter TChoudhari AHetterich LTrampert LNemati HIbrahim ASchwarz MRossow CTippenhauer NMeng WJensen CCremers CKirda E(2023)FetchBench: Systematic Identification and Characterization of Proprietary PrefetchersProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623124(975-989)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623124
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Show More Cited By

Index Terms

Temporal Streaming of Shared Memory

Recommendations

Spatio-temporal memory streaming

Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of ...
Spatio-temporal memory streaming
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of ...
Temporal instruction fetch streaming
MICRO 41: Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture

L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture

June 2005

541 pages

ISBN:076952270X

ACM SIGARCH Computer Architecture News Volume 33, Issue 2
ISCA 2005
May 2005
531 pages
ISSN:0163-5964
DOI:10.1145/1080695
Issue’s Table of Contents

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 May 2005

Check for updates

Qualifiers

Article

Conference

ISCA05

Sponsor:

SIGARCH

ISCA05: The 32nd Annual International Symposium on Computer Architecture 2005

June 4 - 8, 2005

Acceptance Rates

ISCA '05 Paper Acceptance Rate 45 of 194 submissions, 23%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

69
Total Citations
View Citations
39
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/3641853Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3641853
Schlüter TChoudhari AHetterich LTrampert LNemati HIbrahim ASchwarz MRossow CTippenhauer NMeng WJensen CCremers CKirda E(2023)FetchBench: Systematic Identification and Characterization of Proprietary PrefetchersProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623124(975-989)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623124
Kumar RAlipour MBlack-Schaffer D(2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
https://dl.acm.org/doi/10.1145/3506704
Vijaykumar NOlgun AKanellopoulos KBostanci FHassan HLotfi MGibbons PMutlu O(2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3505250
Jamilan SKhan TAyers GKasikci BLitz HBromberg YKermarrec AKozyrakis C(2022)APT-GETProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519583(747-764)Online publication date: 28-Mar-2022
https://dl.acm.org/doi/10.1145/3492321.3519583
Khan TBrown NSriraman ASoundararajan NKumar RDevietti JSubramoney SPokam GLitz HKasikci B(2021)Twig: Profile-Guided BTB Prefetching for Data Center ApplicationsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480124(816-829)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480124
Bakhshalipour MTabaeiaghdaei SLotfi-Kamran PSarbazi-Azad H(2019)Evaluation of Hardware Data Prefetchers on Server ProcessorsACM Computing Surveys10.1145/331274052:3(1-29)Online publication date: 18-Jun-2019
https://dl.acm.org/doi/10.1145/3312740
Wu HNathella KSunwoo DJain ALin CManne SHunter HAltman E(2019)Efficient metadata management for irregular data prefetchingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322225(449-461)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322225
Ainsworth SJones T(2018)An Event-Triggered Programmable Prefetcher for Irregular WorkloadsACM SIGPLAN Notices10.1145/3296957.317318953:2(578-592)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173189
Akin BChou CPark JHughes CAgarwal RJacob B(2018)Dynamic fine-grained sparse memory accessesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240416(85-97)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3240302.3240416
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents