research-article

Open access

Early Address Prediction: Efficient Pipeline Prefetch and Reuse

Authors:

Stefanos Kaxiras, and

David Black-SchafferAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 3

Article No.: 39, Pages 1 - 22

https://doi.org/10.1145/3458883

Published: 08 June 2021 Publication History

All formats PDF

Abstract

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead.

In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead.

Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.

References

[1]

Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18).

[2]

Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (SBAC-PAD’17). IEEE, 49–56.

[3]

Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: The untapped potential of the store-buffer. In Proceedings of the 46th IEEE International Symposium on Computer Architecture. ACM, 436–448.

Digital Library

[4]

Steven Battle, Andrew D. Hilton, Mark Hempstead, and Amir Roth. 2012. Flexible register management using reference counting. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture. IEEE, 1–12.

Digital Library

[5]

Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi Yoaz, and Uri Weiser. 1999. Correlated load-address predictors. In ACM SIGARCH Computer Architecture News, Vol. 27. IEEE Computer Society, 54–63.

[6]

Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 64–69.

Digital Library

[7]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7.

Digital Library

[8]

George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture. IEEE, 142–153.

[9]

Standard Performance Evaluation Corporation. 2006. SPEC CPU2006. Retrieved from: http://www.spec.org/cpu20066.

[10]

Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM J. Res. Devel. 37, 4 (1993), 547–564.

Digital Library

[11]

B. Fahs, T. Rafacz, S. J. Patel, and S. S. Lumetta. 2005. Continuous optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 86–97.

[12]

Manoj Franklin and Gurindar S. Sohi. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (1996), 552–571.

Digital Library

[13]

Freddy Gabbay. 1996. Speculative Execution Based on Value Prediction. Technion-IIT, Department of Electrical Engineering.

[14]

Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture. ACM, 97–104.

Digital Library

[15]

José González and Antonio González. 1997. Speculative execution via address prediction and data prefetching. In Proceedings of the International Conference on Supercomputing. Citeseer, 196–203.

Digital Library

[16]

Stephan Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture. IEEE, 216–225.

[17]

Richard E. Kessler. 1999. The alpha 21264 microprocessor. IEEE Micro 19, 2 (1999), 24–36.

Digital Library

[18]

Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 184–193.

[19]

Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher et al. 2019. Spectre attacks: Exploiting speculative execution. In Proceedings of the IEEE Symposium on Security and Privacy (SP’19). IEEE, 1–19.

[20]

Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-aided Design. IEEE Press, 694–701.

Digital Library

[21]

M. H. Lipasti. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[22]

Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226–237.

[23]

Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. arXiv preprint arXiv:1801.01207 (2018).

[24]

Andreas Moshovos, Scott E. Breach, Terani N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 181–193.

[25]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0. Technical Report HPL-2009-85. HP Labs.

[26]

Soner Önder and Rajiv Gupta. 2001. Load and store reuse using register file contents. In Proceedings of the 15th International Conference on Supercomputing. ACM, 289–302.

Digital Library

[27]

Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Archit. Code Optim. 15, 4 (2018), 49.

[28]

Arthur Perais, Fernando A. Endo, and André Seznec. 2016. Register sharing for equality prediction. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 4.

Digital Library

[29]

Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 481–492.

Digital Library

[30]

Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In Proceedings of the IEEE 20th International Symposium on High-performance Computer Architecture (HPCA’14). IEEE, 428–439.

[31]

Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In Proceedings of the IEEE 21st International Symposium on High-performance Computer Architecture (HPCA’15). IEEE, 13–25.

[32]

Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (HPCA’16). IEEE, 694–706.

[33]

Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In Proceedings of the ACM/IEEE 42nd International Symposium on Computer Architecture (ISCA’15). IEEE, 247–259.

Digital Library

[34]

Vlad Petric, Anne Bracy, and Amir Roth. 2002. Three extensions to register integration. In Proceedings of the 35th IEEE/ACM International Symposium on Microarchitecture (MICRO’02). IEEE, 37–47.

[35]

Vlad Petric, Tingting Sha, and Amir Roth. 2005. RENO: A rename-based instruction optimizer. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 98–109.

Digital Library

[36]

Alberto Ros and Stefanos Kaxiras. 2018. The superfluous load queue. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 95–107.

Digital Library

[37]

A. Roth. 2005. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 458–468.

Digital Library

[38]

Amir Roth. 2008. Physical register reference counting. IEEE Comput. Archit. Lett. 7, 1 (2008), 9–12.

Digital Library

[39]

Rami Sheikh, Harold W. Cain, and Raguram Damodaran. 2017. Load value prediction via path-based address prediction: Avoiding mispredictions due to conflicting stores. In Proceedings of the 50th IEEE/ACM International Symposium on Microarchitecture. ACM, 423–435.

Digital Library

[40]

Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97).

[41]

Nathan Tuck and Dean M. Tullsen. 2005. Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-performance Computer Architecture. IEEE, 5–15.

[42]

Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281–290.

Digital Library

Cited By

Zhang QSong HZhou KWei JXiao C(2024)A prefetching indexing scheme for in-memory database systemsFuture Generation Computer Systems10.1016/j.future.2024.03.012156(179-190)Online publication date: Jul-2024
https://doi.org/10.1016/j.future.2024.03.012

Index Terms

Early Address Prediction: Efficient Pipeline Prefetch and Reuse
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing
      2. Superscalar architectures

Recommendations

Register file prefetching
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that ...
Read More
Load value prediction via path-based address prediction: avoiding mispredictions due to conflicting stores
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Current flagship processors excel at extracting instruction-level-parallelism (ILP) by forming large instruction windows. Even then, extracting ILP is inherently limited by true data dependencies. Value prediction was proposed to address this ...
Read More
Data Dependence Speculation Using Data Address Prediction and its Enhancement with Instruction Reissue
EUROMICRO '98: Proceedings of the 24th Conference on EUROMICRO - Volume 1

In this paper, we introduce an instruction reissue mechanism in order to enhance dynamic data dependence speculation using data address prediction. Since instructions which are not data-dependent upon speculatively executed instructions are not squashed,...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 18, Issue 3

September 2021

370 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3460978

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2021

Accepted: 01 March 2021

Revised: 01 March 2021

Received: 01 December 2020

Published in TACO Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,013
Total Downloads

Downloads (Last 12 months)327
Downloads (Last 6 weeks)20

Other Metrics

View Author Metrics

Citations

Cited By

Zhang QSong HZhou KWei JXiao C(2024)A prefetching indexing scheme for in-memory database systemsFuture Generation Computer Systems10.1016/j.future.2024.03.012156(179-190)Online publication date: Jul-2024
https://doi.org/10.1016/j.future.2024.03.012

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents