Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Early Address Prediction: Efficient Pipeline Prefetch and Reuse

Published: 08 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead.
    In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead.
    Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.

    References

    [1]
    Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18).
    [2]
    Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (SBAC-PAD’17). IEEE, 49–56.
    [3]
    Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: The untapped potential of the store-buffer. In Proceedings of the 46th IEEE International Symposium on Computer Architecture. ACM, 436–448.
    [4]
    Steven Battle, Andrew D. Hilton, Mark Hempstead, and Amir Roth. 2012. Flexible register management using reference counting. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture. IEEE, 1–12.
    [5]
    Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi Yoaz, and Uri Weiser. 1999. Correlated load-address predictors. In ACM SIGARCH Computer Architecture News, Vol. 27. IEEE Computer Society, 54–63.
    [6]
    Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 64–69.
    [7]
    Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7.
    [8]
    George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture. IEEE, 142–153.
    [9]
    Standard Performance Evaluation Corporation. 2006. SPEC CPU2006. Retrieved from: http://www.spec.org/cpu20066.
    [10]
    Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM J. Res. Devel. 37, 4 (1993), 547–564.
    [11]
    B. Fahs, T. Rafacz, S. J. Patel, and S. S. Lumetta. 2005. Continuous optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 86–97.
    [12]
    Manoj Franklin and Gurindar S. Sohi. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (1996), 552–571.
    [13]
    Freddy Gabbay. 1996. Speculative Execution Based on Value Prediction. Technion-IIT, Department of Electrical Engineering.
    [14]
    Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture. ACM, 97–104.
    [15]
    José González and Antonio González. 1997. Speculative execution via address prediction and data prefetching. In Proceedings of the International Conference on Supercomputing. Citeseer, 196–203.
    [16]
    Stephan Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture. IEEE, 216–225.
    [17]
    Richard E. Kessler. 1999. The alpha 21264 microprocessor. IEEE Micro 19, 2 (1999), 24–36.
    [18]
    Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 184–193.
    [19]
    Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher et al. 2019. Spectre attacks: Exploiting speculative execution. In Proceedings of the IEEE Symposium on Security and Privacy (SP’19). IEEE, 1–19.
    [20]
    Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-aided Design. IEEE Press, 694–701.
    [21]
    M. H. Lipasti. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems.
    [22]
    Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226–237.
    [23]
    Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. arXiv preprint arXiv:1801.01207 (2018).
    [24]
    Andreas Moshovos, Scott E. Breach, Terani N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 181–193.
    [25]
    Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0. Technical Report HPL-2009-85. HP Labs.
    [26]
    Soner Önder and Rajiv Gupta. 2001. Load and store reuse using register file contents. In Proceedings of the 15th International Conference on Supercomputing. ACM, 289–302.
    [27]
    Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Archit. Code Optim. 15, 4 (2018), 49.
    [28]
    Arthur Perais, Fernando A. Endo, and André Seznec. 2016. Register sharing for equality prediction. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 4.
    [29]
    Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 481–492.
    [30]
    Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In Proceedings of the IEEE 20th International Symposium on High-performance Computer Architecture (HPCA’14). IEEE, 428–439.
    [31]
    Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In Proceedings of the IEEE 21st International Symposium on High-performance Computer Architecture (HPCA’15). IEEE, 13–25.
    [32]
    Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (HPCA’16). IEEE, 694–706.
    [33]
    Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In Proceedings of the ACM/IEEE 42nd International Symposium on Computer Architecture (ISCA’15). IEEE, 247–259.
    [34]
    Vlad Petric, Anne Bracy, and Amir Roth. 2002. Three extensions to register integration. In Proceedings of the 35th IEEE/ACM International Symposium on Microarchitecture (MICRO’02). IEEE, 37–47.
    [35]
    Vlad Petric, Tingting Sha, and Amir Roth. 2005. RENO: A rename-based instruction optimizer. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 98–109.
    [36]
    Alberto Ros and Stefanos Kaxiras. 2018. The superfluous load queue. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 95–107.
    [37]
    A. Roth. 2005. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 458–468.
    [38]
    Amir Roth. 2008. Physical register reference counting. IEEE Comput. Archit. Lett. 7, 1 (2008), 9–12.
    [39]
    Rami Sheikh, Harold W. Cain, and Raguram Damodaran. 2017. Load value prediction via path-based address prediction: Avoiding mispredictions due to conflicting stores. In Proceedings of the 50th IEEE/ACM International Symposium on Microarchitecture. ACM, 423–435.
    [40]
    Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97).
    [41]
    Nathan Tuck and Dean M. Tullsen. 2005. Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-performance Computer Architecture. IEEE, 5–15.
    [42]
    Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281–290.

    Cited By

    View all
    • (2024)A prefetching indexing scheme for in-memory database systemsFuture Generation Computer Systems10.1016/j.future.2024.03.012156(179-190)Online publication date: Jul-2024

    Index Terms

    1. Early Address Prediction: Efficient Pipeline Prefetch and Reuse

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 3
        September 2021
        370 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3460978
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 08 June 2021
        Accepted: 01 March 2021
        Revised: 01 March 2021
        Received: 01 December 2020
        Published in TACO Volume 18, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Pipeline prefetching
        2. address prediction
        3. energy efficient computing
        4. first level cache
        5. register sharing

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)327
        • Downloads (Last 6 weeks)20

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A prefetching indexing scheme for in-memory database systemsFuture Generation Computer Systems10.1016/j.future.2024.03.012156(179-190)Online publication date: Jul-2024

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media