article

Load squared: adding logic close to memory to reduce the latency of indirect loads with high miss ratios

Authors:

Jean-Francois Collard,

Olivier TemamAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 33, Issue 3

Pages 17 - 24

https://doi.org/10.1145/1101868.1101873

Published: 29 September 2004 Publication History

Abstract

Indirect memory accesses, where a load is fed by another load, are ubiquitous because of rich data structures and sophisticated software conventions, such as the use of linkage tables and position independent code. Unfortunately, they can be costly: if both loads miss, two round trips to memory are required even though the role of the first load is often limited to fetching the address of the second load. To reduce the total latency of such indirect accesses, a new instruction called load squared is introduced. A load squared does two fetches, the first fetch reading the target address of the second. (An offset is optionally added to the result of the first fetch.) The load squared operation is performed by memory-side logic (typically, the memory controller if it isn't located on the main processor chip). In this study, load squared is not an architecturally visible instruction: the micro-architecture transparently decides which loads should be replaced by loads squared. We show that performance is sometimes improved significantly, and never degraded.

References

[1]

Flexram: Toward an advanced intelligent memory system. In Proc. IEEE Int'l Conf. on Comp. Design, page 192, 1999.

Digital Library

[2]

M. Bekerman et al. Correlated load-address predictors. In Proc. 26th Int'l Symp. on Comp. Arch., pages 54--63, 1999.

Digital Library

[3]

D. Burger, T. M. Austin, and S. Bennett. Evaluating future microprocessors: The SimpleScalar tool set. Technical Report CS-TR-1996-1308, 1996.

[4]

J. B. Carter et al. Impulse: Building a smarter memory controller. In HPCA, pages 70--79, 1999.

Digital Library

[5]

J. Collins, S. Sair, B. Calder, and D. M. Tullsen. Pointer cache assisted prefetching. In Proc. 35th ACM/IEEE Int'l Symp. on Microarchitecture, pages 62--73, 2002.

Digital Library

[6]

R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In Proc. 10th Int'l Conf on Arch. Support for Prog. Lang. and Op. Sys., pages 279--290, 2002.

Digital Library

[7]

J. Henning. Spec cpu2000: measuring cpu performance in the new millennium. IEEE Computer, 33(7):28--35, 2000.

Digital Library

[8]

Intel Corp. Intel Itanium 2 Processor Reference Manual.

[9]

M. Karlsson, F. Dahlgren, and P. Stenstrom. A prefetching technique for irregular accesses to linked data structures. In Proc. 6th Int'l Symp. on High-Perf. Comp. Arch. (HPCA'6), pages 206--217, 2000.

[10]

M. H. Lipasti et al. Spaid: software prefetching in pointer-and call-intensive environments. In Proceedings of the 28th annual international symposium on Microarchitecture, pages 231--236, 1995.

Digital Library

[11]

C.-K. Luk and T. C. Mowry. Compiler-based prefetching for recursive data structures. In Proc. 7th Int'l Conf. on Arch. Support for Prog. Lang. and Op. Sys., pages 222--233, 1996.

Digital Library

[12]

S. McFarling. Combining branch predictors. Technical Note TN-36, Digital WRL, june 1993.

[13]

A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren. Supporting dynamic data structures on distributed-memory machines. ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 17(2):233--263, 1995.

Digital Library

[14]

A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In Proc. 8th Int'l Conf on Arch. Support for Prog. Lang. and Op. Sys., pages 115--126, 1998.

Digital Library

[15]

A. Roth and G. S. Sohi. Effective jump-pointer prefetching for linked data structures. In Proc. 26th Int'l Symp. on Comp, Arch., pages 111--121, 1999.

Digital Library

[16]

J. E. Smith. A study of branch prediction strategies. In Proc. 8th Symp. on Comp. Arch., pages 135--148, 1981.

Digital Library

[17]

Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proc. 29th Int'l Symp. on Comp. Arch., pages 171--182, 2002.

Digital Library

[18]

C.-L. Yang and A. R. Lebeck. Push vs. pull: data movement for linked data structures. In Proc. 14th Int'l Conf. on Supercomputing, pages 176--186, 2000.

Digital Library

[19]

T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In Proc. 24th Int'l Symp. on Microarchitecture, pages 51--61, 1991.

Digital Library

Cited By

Malkowski KLink GRaghavan PIrwin M(2007)Load Miss Prediction - Exploiting Power Performance Trade-offs2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370536(1-8)Online publication date: Mar-2007
https://doi.org/10.1109/IPDPS.2007.370536
Hashemi MKhubaib Ebrahimi EMutlu OPatt Y(2016)Accelerating dependent cache misses with an enhanced memory controllerACM SIGARCH Computer Architecture News10.1145/3007787.300118444:3(444-455)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001184
Hashemi MKhubaib Ebrahimi EMutlu OPatt YMin SLoh G(2016)Accelerating dependent cache misses with an enhanced memory controllerProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.46(444-455)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.46
Show More Cited By

Index Terms

Load squared: adding logic close to memory to reduce the latency of indirect loads with high miss ratios
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Very long instruction word
    2. Serial architectures
      1. Complex instruction set computing
      2. Reduced instruction set computing
2. Hardware

Recommendations

Load squared: adding logic close to memory to reduce the latency of indirect loads with high miss ratios
MEDEA '04: Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture

Indirect memory accesses, where a load is fed by another load, are ubiquitous because of rich data structures and sophisticated software conventions, such as the use of linkage tables and position independent code. Unfortunately, they can be costly: if ...
Load squared: Adding logic close to memory to reduce the latency of indirect loads in embedded and general systems
Embeded Processors and Systems: Architectural Issues and Solutions for Emerging Applications

Indirect memory accesses, where a load is fed by another load, are ubiquitous because of rich data structures and sophisticated software conventions, such as the use of linkage tables and position independent code. Unfortunately, they can be costly: if ...
Macro Data Load: An Efficient Mechanism for Enhancing Loaded Data Reuse

This paper presents a study on macro data load, a novel mechanism to increase the amount of loaded data reuse within a processor. A macro data load brings into the processor a maximum-width data the cache port allows. In a 64-bit processor, for example, ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 33, Issue 3

Special issue: MEDEA 2004 workshop

June 2005

74 pages

ISSN:0163-5964

DOI:10.1145/1101868

Issue’s Table of Contents

MEDEA '04: Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
September 2004
62 pages
ISBN:9781450378192
DOI:10.1145/1152922
Conference Chairs:
Sandro Bartolini
University of Siena, Italy
,
Pierfrancesco Foglia
University of Pisa, Italy
,
Roberto Giorgi
University of Siena, Italy
,
Cosimo Antonio Prete
University of Pisa, Italy

Copyright © 2004 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 September 2004

Published in SIGARCH Volume 33, Issue 3

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
186
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Malkowski KLink GRaghavan PIrwin M(2007)Load Miss Prediction - Exploiting Power Performance Trade-offs2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370536(1-8)Online publication date: Mar-2007
https://doi.org/10.1109/IPDPS.2007.370536
Hashemi MKhubaib Ebrahimi EMutlu OPatt Y(2016)Accelerating dependent cache misses with an enhanced memory controllerACM SIGARCH Computer Architecture News10.1145/3007787.300118444:3(444-455)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001184
Hashemi MKhubaib Ebrahimi EMutlu OPatt YMin SLoh G(2016)Accelerating dependent cache misses with an enhanced memory controllerProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.46(444-455)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.46
Li WRezaei MKavi KNaz ASweany P(2007)Feasibility of decoupling memory management from the execution pipelineJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2007.03.00353:12(927-936)Online publication date: 1-Dec-2007
https://dl.acm.org/doi/10.1016/j.sysarc.2007.03.003

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents