Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Load squared: adding logic close to memory to reduce the latency of indirect loads with high miss ratios

Published: 29 September 2004 Publication History

Abstract

Indirect memory accesses, where a load is fed by another load, are ubiquitous because of rich data structures and sophisticated software conventions, such as the use of linkage tables and position independent code. Unfortunately, they can be costly: if both loads miss, two round trips to memory are required even though the role of the first load is often limited to fetching the address of the second load. To reduce the total latency of such indirect accesses, a new instruction called load squared is introduced. A load squared does two fetches, the first fetch reading the target address of the second. (An offset is optionally added to the result of the first fetch.) The load squared operation is performed by memory-side logic (typically, the memory controller if it isn't located on the main processor chip). In this study, load squared is not an architecturally visible instruction: the micro-architecture transparently decides which loads should be replaced by loads squared. We show that performance is sometimes improved significantly, and never degraded.

References

[1]
Flexram: Toward an advanced intelligent memory system. In Proc. IEEE Int'l Conf. on Comp. Design, page 192, 1999.
[2]
M. Bekerman et al. Correlated load-address predictors. In Proc. 26th Int'l Symp. on Comp. Arch., pages 54--63, 1999.
[3]
D. Burger, T. M. Austin, and S. Bennett. Evaluating future microprocessors: The SimpleScalar tool set. Technical Report CS-TR-1996-1308, 1996.
[4]
J. B. Carter et al. Impulse: Building a smarter memory controller. In HPCA, pages 70--79, 1999.
[5]
J. Collins, S. Sair, B. Calder, and D. M. Tullsen. Pointer cache assisted prefetching. In Proc. 35th ACM/IEEE Int'l Symp. on Microarchitecture, pages 62--73, 2002.
[6]
R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In Proc. 10th Int'l Conf on Arch. Support for Prog. Lang. and Op. Sys., pages 279--290, 2002.
[7]
J. Henning. Spec cpu2000: measuring cpu performance in the new millennium. IEEE Computer, 33(7):28--35, 2000.
[8]
Intel Corp. Intel Itanium 2 Processor Reference Manual.
[9]
M. Karlsson, F. Dahlgren, and P. Stenstrom. A prefetching technique for irregular accesses to linked data structures. In Proc. 6th Int'l Symp. on High-Perf. Comp. Arch. (HPCA'6), pages 206--217, 2000.
[10]
M. H. Lipasti et al. Spaid: software prefetching in pointer-and call-intensive environments. In Proceedings of the 28th annual international symposium on Microarchitecture, pages 231--236, 1995.
[11]
C.-K. Luk and T. C. Mowry. Compiler-based prefetching for recursive data structures. In Proc. 7th Int'l Conf. on Arch. Support for Prog. Lang. and Op. Sys., pages 222--233, 1996.
[12]
S. McFarling. Combining branch predictors. Technical Note TN-36, Digital WRL, june 1993.
[13]
A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren. Supporting dynamic data structures on distributed-memory machines. ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 17(2):233--263, 1995.
[14]
A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In Proc. 8th Int'l Conf on Arch. Support for Prog. Lang. and Op. Sys., pages 115--126, 1998.
[15]
A. Roth and G. S. Sohi. Effective jump-pointer prefetching for linked data structures. In Proc. 26th Int'l Symp. on Comp, Arch., pages 111--121, 1999.
[16]
J. E. Smith. A study of branch prediction strategies. In Proc. 8th Symp. on Comp. Arch., pages 135--148, 1981.
[17]
Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proc. 29th Int'l Symp. on Comp. Arch., pages 171--182, 2002.
[18]
C.-L. Yang and A. R. Lebeck. Push vs. pull: data movement for linked data structures. In Proc. 14th Int'l Conf. on Supercomputing, pages 176--186, 2000.
[19]
T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In Proc. 24th Int'l Symp. on Microarchitecture, pages 51--61, 1991.

Cited By

View all
  • (2007)Load Miss Prediction - Exploiting Power Performance Trade-offs2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370536(1-8)Online publication date: Mar-2007
  • (2016)Accelerating dependent cache misses with an enhanced memory controllerACM SIGARCH Computer Architecture News10.1145/3007787.300118444:3(444-455)Online publication date: 18-Jun-2016
  • (2016)Accelerating dependent cache misses with an enhanced memory controllerProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.46(444-455)Online publication date: 18-Jun-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 33, Issue 3
Special issue: MEDEA 2004 workshop
June 2005
74 pages
ISSN:0163-5964
DOI:10.1145/1101868
Issue’s Table of Contents
  • cover image ACM Conferences
    MEDEA '04: Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
    September 2004
    62 pages
    ISBN:9781450378192
    DOI:10.1145/1152922

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 September 2004
Published in SIGARCH Volume 33, Issue 3

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2007)Load Miss Prediction - Exploiting Power Performance Trade-offs2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370536(1-8)Online publication date: Mar-2007
  • (2016)Accelerating dependent cache misses with an enhanced memory controllerACM SIGARCH Computer Architecture News10.1145/3007787.300118444:3(444-455)Online publication date: 18-Jun-2016
  • (2016)Accelerating dependent cache misses with an enhanced memory controllerProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.46(444-455)Online publication date: 18-Jun-2016
  • (2007)Feasibility of decoupling memory management from the execution pipelineJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2007.03.00353:12(927-936)Online publication date: 1-Dec-2007

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media