Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/ISCA.2005.50acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Temporal Streaming of Shared Memory

Published: 01 May 2005 Publication History

Abstract

Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory accesses. Temporal streaming dynamically identifies address sequences to be streamed by exploiting two common phenomena in shared-memory access patterns: (1) temporal address correlation - groups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream locality - recently-accessed address streams are likely to recur. We present a practical design for temporal streaming. We evaluate our design using a combination of trace-driven and cycle-accurate full-system simulation of a cache-coherent distributed shared-memory system. We show that temporal streaming can eliminate 98% of coherent read misses in scientific applications, and between 43% and 60% in database and web server workloads. Our design yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads.

References

[1]
{1} S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66-76, Dec. 1996.
[2]
{2} L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 3-14, June 1998.
[3]
{3} T. M. Chilimbi and M. Hirzel. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the SIGPLAN '02 Conference on Programming Language Design and Implementation (PLDI), June 2002.
[4]
{4} Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004.
[5]
{5} J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic speculative precomputation. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 34), December 2001.
[6]
{6} D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel programming in Split-C. In Proceedings of Supercomputing '93, pages 262-273, Nov. 1993.
[7]
{7} Z. Cvetanovic. Performance analysis of the alpha 21364- based hp gs1280 multiprocessor. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 218-229, June 2003.
[8]
{8} K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniquesto enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing (Vol. I Architecture), pages I- 355-364, Aug. 1991.
[9]
{9} C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002.
[10]
{10} C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 162-171, May 1999.
[11]
{11} R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eri, H. Nueckel, and J. P. Shen. Scaling and characterizing data-base workloads: Bridging the gap between research and practice. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec. 2003.
[12]
{12} N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. Simflex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review , 31(4):31-35, April 2004.
[13]
{13} J. Huh, J. Chang, D. Burger, and G. S. Sohi. Coherence decoupling: making use of incoherence. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XI), October 2004.
[14]
{14} D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252-263, June 1997.
[15]
{15} S. Kaxiras and C. Young. Coherence communication prediction in shared memory multiprocessors. In Proceedings of the 6th IEEE Symposium on High-Performance Computer Architecture, January 2000.
[16]
{16} P. Keleher. Tapeworm: High-level abstractions of shared accesses. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI), February 1999.
[17]
{17} D. A. Koufaty, X. Chen, D. K. Poulsena, and J. Torrellas. Data forwarding in scalable shared-memory multiprocessors. In Proceedings of the 1995 International Conference on Supercomputing, July 1995.
[18]
{18} A.-C. Lai and B. Falsafi. Memory sharing predictor: The key to a speculative coherent DSM. In Proceedings of the 26th Annual International Symposium on Computer Architecture, May 1999.
[19]
{19} A.-C. Lai and B. Falsafi. Selective, accurate, and timely self-invalidation using last-touch prediction. In Proceedings of the 27th Annual International Symposium on Computer Architecture , June 2000.
[20]
{20} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, February 2002.
[21]
{21} M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.
[22]
{22} S. S. Mukherjee and M. D. Hill. Using prediction to accelerate coherence protocols. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
[23]
{23} S. S. Mukherjee, S. D. Sharma, M. D. Hill, J. R. Larus, A. Rogers, and J. Saltz. Efficient support for irregular applications on distributed-memory machines. In 5th ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 68-79, July 1995.
[24]
{24} O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: an effective alternative to large instruction windows. IEEE Micro, 23(6):20-25, November/December 2003.
[25]
{25} K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the 10th IEEE Symposium on High-Performance Computer Architecture, Feb. 2004.
[26]
{26} D. G. Perez, G. Mouchard, and O. Temam. Microlib: a case for the quantitative comparison of micro-architecture mechanisms. In Proceedings of the 3rd Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD04), June 2004.
[27]
{27} P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII), pages 307-318, Oct. 1998.
[28]
{28} T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd Annual IEEE/ ACM International Symposium on Microarchitecture (MICRO 33), pages 42-53, December 2000.
[29]
{29} S. Somogyi, T. F. Wenisch, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Memory coherence activity prediction in commercial workloads. In 3rd Workshop on Memory Performance Issues, June 2004.
[30]
{30} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, July 1995.
[31]
{31} R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. Smarts: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture, June 2003.

Cited By

View all
  • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/3641853Online publication date: 22-Jan-2024
  • (2023)FetchBench: Systematic Identification and Characterization of Proprietary PrefetchersProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623124(975-989)Online publication date: 15-Nov-2023
  • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
June 2005
541 pages
ISBN:076952270X
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
    ISCA 2005
    May 2005
    531 pages
    ISSN:0163-5964
    DOI:10.1145/1080695
    Issue’s Table of Contents

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 May 2005

Check for updates

Qualifiers

  • Article

Conference

ISCA05
Sponsor:

Acceptance Rates

ISCA '05 Paper Acceptance Rate 45 of 194 submissions, 23%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/3641853Online publication date: 22-Jan-2024
  • (2023)FetchBench: Systematic Identification and Characterization of Proprietary PrefetchersProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623124(975-989)Online publication date: 15-Nov-2023
  • (2022)Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order CoresACM Transactions on Architecture and Code Optimization10.1145/350670419:2(1-28)Online publication date: 7-Mar-2022
  • (2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
  • (2022)APT-GETProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519583(747-764)Online publication date: 28-Mar-2022
  • (2021)Twig: Profile-Guided BTB Prefetching for Data Center ApplicationsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480124(816-829)Online publication date: 18-Oct-2021
  • (2019)Evaluation of Hardware Data Prefetchers on Server ProcessorsACM Computing Surveys10.1145/331274052:3(1-29)Online publication date: 18-Jun-2019
  • (2019)Efficient metadata management for irregular data prefetchingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322225(449-461)Online publication date: 22-Jun-2019
  • (2018)An Event-Triggered Programmable Prefetcher for Irregular WorkloadsACM SIGPLAN Notices10.1145/3296957.317318953:2(578-592)Online publication date: 19-Mar-2018
  • (2018)Dynamic fine-grained sparse memory accessesProceedings of the International Symposium on Memory Systems10.1145/3240302.3240416(85-97)Online publication date: 1-Oct-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media