research-article

Decoupled Vector Runahead

Authors:

Ajeya Naithani,

Jaime Roelandts,

Timothy M. Jones,

Lieven EeckhoutAuthors Info & Claims

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 17 - 31

https://doi.org/10.1145/3613424.3614255

Published: 08 December 2023 Publication History

Abstract

We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect memory accesses. DVR dynamically infers loop bounds at run-time, recognizing striding loads, and vectorizing subsequent instructions that are part of an indirect chain. It proactively issues memory accesses for the resulting loads far into the future, even when the out-of-order core has not yet stalled, bringing their data into the L1 cache, and thus providing timely prefetches for the main thread. DVR can adjust the degree of vectorization at run-time, vectorize the same chain of indirect memory accesses across multiple invocations of an inner loop, and efficiently handle branch divergence along the vectorized chain. DVR runs as an on-demand, speculative, in-order, lightweight hardware subthread alongside the main thread within the core and incurs a minimal hardware overhead of only 1139 bytes. Relative to a large superscalar 5-wide out-of-order baseline and Vector Runahead — a recent microarchitectural technique to accelerate indirect memory accesses on out-of-order processors — DVR delivers 2.4 × and 2 × higher performance, respectively, for a set of graph analytics, database, and HPC workloads.

References

[1]

Sam Ainsworth and Timothy M. Jones. 2016. Graph Prefetching Using Data Structure Knowledge. In Proceedings of the 2016 International Conference on Supercomputing (Istanbul, Turkey) (ICS ’16). Association for Computing Machinery, New York, NY, USA, Article 39, 11 pages. https://doi.org/10.1145/2925426.2926254

Digital Library

[2]

Sam Ainsworth and Timothy M. Jones. 2018. An Event-Triggered Programmable Prefetcher for Irregular Workloads. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) (ASPLOS ’18). Association for Computing Machinery, New York, NY, USA, 578–592. https://doi.org/10.1145/3173162.3173189

Digital Library

[3]

Sam Ainsworth and Timothy M. Jones. 2019. Software Prefetching for Indirect Memory Accesses: A Microarchitectural Perspective. ACM Transactions on Computer Systems 36, 3, Article 8 (jun 2019), 34 pages. https://doi.org/10.1145/3319393

Digital Library

[4]

Sam Ainsworth and Timothy M. Jones. 2020. Prefetching in Functional Languages. In Proceedings of the 2020 ACM SIGPLAN International Symposium on Memory Management (London, UK) (ISMM ’20). Association for Computing Machinery, New York, NY, USA, 16–29. https://doi.org/10.1145/3381898.3397209

Digital Library

[5]

Hassan Al-Sukhni, Ian Bratt, and Daniel A. Connors. 2003. Compiler-Directed Content-Aware Prefetching for Dynamic Data Structures. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques(PACT ’03). IEEE Computer Society, Los Alamitos, CA, USA, 91. https://doi.org/10.1109/PACT.2003.1238005

[6]

James Alfred Ang, Brian W. Barrett, Kyle Bruce Wheeler, and Richard C. Murphy. 2010. Introducing the graph 500.Cray User’s Group (CUG) 19 (5 2010), 45–74. https://www.osti.gov/biblio/1014641

[7]

Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2001. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Göteborg, Sweden) (ISCA ’01). Association for Computing Machinery, New York, NY, USA, 52–61. https://doi.org/10.1145/379240.379251

Digital Library

[8]

Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis, and Nectarios Koziris. 2008. Exploring the Performance Limits of Simultaneous Multithreading for Memory Intensive Applications. Journal of Supercomputing 44, 1 (apr 2008), 64–97. https://doi.org/10.1007/s11227-007-0149-x

Digital Library

[9]

Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 513–526. https://doi.org/10.1145/3373376.3378498

Digital Library

[10]

Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-Vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (Santa Barbara, CA, USA) (PLDI ’16). Association for Computing Machinery, New York, NY, USA, 697–710. https://doi.org/10.1145/2908080.2908111

Digital Library

[11]

Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo Spatial Data Prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture(HPCA ’19). IEEE Computer Society, Los Alamitos, CA, USA, 399–411. https://doi.org/10.1109/HPCA.2019.00053

[12]

Scott Beamer, Krste Asanović, and David Patterson. 2017. The GAP Benchmark Suite. arxiv:1508.03619 [cs.DC]

[13]

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadat, and Onur Mutlu. 2022. Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction. In 2022 55th IEEE/ACM International Symposium on Microarchitecture(MICRO-55). IEEE Computer Society, Los Alamitos, CA, USA, 1–18. https://doi.org/10.1109/MICRO56248.2022.00015

Digital Library

[14]

Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu. 2021. Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 1121–1137. https://doi.org/10.1145/3466752.3480114

Digital Library

[15]

Rahul Bera, Anant V. Nori, Onur Mutlu, and Sreenivas Subramoney. 2019. DSPatch: Dual Spatial Pattern Prefetcher. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 531–544. https://doi.org/10.1145/3352460.3358325

Digital Library

[16]

Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology 25, 2 (2001), 163–177. https://doi.org/10.1080/0022250X.2001.9990249

[17]

David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software Prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, California, USA) (ASPLOS IV). Association for Computing Machinery, New York, NY, USA, 40–52. https://doi.org/10.1145/106972.106979

Digital Library

[18]

Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3, Article 28 (aug 2014), 25 pages. https://doi.org/10.1145/2629677

Digital Library

[19]

Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed Prefetching for Indirect Memory Accesses. ACM Transactions on Architecture and Code Optimization 17, 1, Article 4 (mar 2020), 29 pages. https://doi.org/10.1145/3374216

Digital Library

[20]

Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture (Atlanta, Georgia, USA) (ISCA ’99). IEEE Computer Society, Los Alamitos, CA, USA, 186–195. https://doi.org/10.1145/300979.300995

Digital Library

[21]

Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving Hash Join Performance through Prefetching. ACM Transactions on Database Systems 32, 3 (aug 2007), 17–es. https://doi.org/10.1145/1272743.1272747

Digital Library

[22]

Tien-Fu Chen and Jean-Loup Baer. 1992. Reducing Memory Latency via Non-blocking and Prefetching Caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, Massachusetts, USA) (ASPLOS V). Association for Computing Machinery, New York, NY, USA, 51–61. https://doi.org/10.1145/143365.143486

Digital Library

[23]

Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (May 1995), 609–623. https://doi.org/10.1109/12.381947

Digital Library

[24]

Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, and Donald Yeung. 2004. A General Framework for Prefetch Scheduling in Linked Data Structures and Its Application to Multi-chain Prefetching. ACM Transactions on Computer Systems 22, 2 (may 2004), 214–280. https://doi.org/10.1145/986533.986536

Digital Library

[25]

Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative Precomputation: Long-Range Prefetching of Delinquent Loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Göteborg, Sweden) (ISCA ’01). Association for Computing Machinery, New York, NY, USA, 14–25. https://doi.org/10.1145/379240.379248

Digital Library

[26]

Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A Stateless, Content-directed Data Prefetching Mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 279–290. https://doi.org/10.1145/605397.605427

Digital Library

[27]

Kenzo Van Craeynest, Stijn Eyerman, and Lieven Eeckhout. 2009. MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor. In High Performance Embedded Architectures and Compilers, Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings(Lecture Notes in Computer Science, Vol. 5409). Springer Berlin Heidelberg, Berlin, Heidelberg, 110–124. https://doi.org/10.1007/978-3-540-92990-1_10

Digital Library

[28]

Dr. Ian Cutress. 2018. Intel’s Architecture Day 2018: The future of core, Intel gpus, 10nm, and hybrid x86. AnandTech. https://www.anandtech.com/show/13699/intel-architecture-day-2018-core-future-hybrid-x86

[29]

James Dundas and Trevor Mudge. 1997. Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss. In Proceedings of the 11th International Conference on Supercomputing (Vienna, Austria) (ICS ’97). Association for Computing Machinery, New York, NY, USA, 68–75. https://doi.org/10.1145/263580.263597

Digital Library

[30]

Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 7–17. https://doi.org/10.1109/HPCA.2009.4798232

[31]

Jack Edmonds and Richard M. Karp. 1972. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 19, 2 (April 1972), 248–264. https://doi.org/10.1145/321694.321699

Digital Library

[32]

Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Springer Cham, Cham, Switzerland. https://doi.org/10.1007/978-3-031-01743-8

[33]

Andrei Frumusanu. 2020. Apple Announces The Apple Silicon M1: Ditching x86 - What to Expect, Based on A14. Anandtech. https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2

[34]

Andrei Frumusanu. 2021. The Snapdragon 888 vs The Exynos 2100: Cortex-X1 & 5nm - Who Does It Better? AnandTech. https://www.anandtech.com/show/16463/snapdragon-888-vs-exynos-2100-galaxy-s21-ultra/3

[35]

Ilya Ganusov and Martin Burtscher. 2006. Efficient Emulation of Hardware Prefetchers via Event-Driven Helper Threading. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (Seattle, Washington, USA) (PACT ’06). Association for Computing Machinery, New York, NY, USA, 144–153. https://doi.org/10.1145/1152154.1152178

Digital Library

[36]

Saurabh Gupta, Niranjan Soundararajan, Ragavendra Natarajan, and Sreenivas Subramoney. 2020. Opportunistic Early Pipeline Re-Steering for Data-Dependent Branches. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 305–316. https://doi.org/10.1145/3410463.3414628

Digital Library

[37]

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled Supply-compute Communication Management for Heterogeneous Architectures. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 191–203. https://doi.org/10.1145/2830772.2830800

Digital Library

[38]

Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating Dependent Cache Misses with an Enhanced Memory Controller. In Proceedings of the 43rd International Symposium on Computer Architecture (Seoul, Republic of Korea) (ISCA ’16). IEEE Computer Society, Los Alamitos, CA, USA, 444–455. https://doi.org/10.1109/ISCA.2016.46

Digital Library

[39]

Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Computer Society, Los Alamitos, CA, USA, Article 61, 12 pages. https://doi.org/10.1109/MICRO.2016.7783764

[40]

Milad Hashemi and Yale N. Patt. 2015. Filtered Runahead Execution with a Runahead Buffer. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 358–369. https://doi.org/10.1145/2830772.2830812

Digital Library

[41]

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient Execution of Memory Access Phases Using Dataflow Specialization. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 118–130. https://doi.org/10.1145/2749469.2750390

Digital Library

[42]

Akanksha Jain and Calvin Lin. 2013. Linearizing Irregular Memory Accesses for Improved Correlated Prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (Davis, California) (MICRO-46). Association for Computing Machinery, New York, NY, USA, 247–259. https://doi.org/10.1145/2540708.2540730

Digital Library

[43]

Doug Joseph and Dirk Grunwald. 1997. Prefetching Using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture (Denver, Colorado, USA) (ISCA ’97). Association for Computing Machinery, New York, NY, USA, 252–263. https://doi.org/10.1145/264107.264207

Digital Library

[44]

Changhee Jung, Daeseob Lim, Jaejin Lee, and Yan Solihin. 2006. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (Rhodes Island, Greece) (IPDPS’06). IEEE Computer Society, Los Alamitos, CA, USA, 10 pp.–. https://doi.org/10.1109/IPDPS.2006.1639375

[45]

Dongkeun Kim and Donald Yeung. 2002. Design and Evaluation of Compiler Algorithms for Pre-execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 159–170. https://doi.org/10.1145/605397.605415

Digital Library

[46]

Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path Confidence Based Lookahead Prefetching. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Computer Society, Los Alamitos, CA, USA, Article 60, 12 pages. https://doi.org/10.1109/MICRO.2016.7783763

[47]

Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous Memory Access Chaining. Proc. VLDB Endow. 9, 4 (dec 2015), 252–263. https://doi.org/10.14778/2856318.2856321

Digital Library

[48]

Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the Walkers: Accelerating Index Traversals for In-memory Databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (Davis, California) (MICRO-46). Association for Computing Machinery, New York, NY, USA, 468–479. https://doi.org/10.1145/2540708.2540748

Digital Library

[49]

Nicholas Kohout, Seungryul Choi, Dongkeun Kim, and Donald Yeung. 2001. Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques(PACT ’01). IEEE Computer Society, Los Alamitos, CA, USA, 268–279. https://doi.org/10.1109/PACT.2001.953307

[50]

Snehasish Kumar, Arrvindh Shriraman, Vijayalakshmi Srinivasan, Dan Lin, and Jordon Phillips. 2014. SQRL: Hardware Accelerator for Collecting Software Data Structures. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (Edmonton, AB, Canada) (PACT ’14). Association for Computing Machinery, New York, NY, USA, 475–476. https://doi.org/10.1145/2628071.2628118

Digital Library

[51]

Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman, and Vijayalakshmi Srinivasan. 2015. DASX: Hardware Accelerator for Software Data Structures. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS ’15). Association for Computing Machinery, New York, NY, USA, 361–372. https://doi.org/10.1145/2751205.2751231

Digital Library

[52]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (Vancouver, British Columbia, Canada) (PLDI ’00). Association for Computing Machinery, New York, NY, USA, 145–156. https://doi.org/10.1145/349299.349320

Digital Library

[53]

Eric Lau, Jason E. Miller, Inseok Choi, Donald Yeung, Saman Amarasinghe, and Anant Agarwal. 2011. Multicore Performance Optimization Using Partner Cores. In 3rd USENIX Workshop on Hot Topics in Parallelism(HotPar 11). USENIX Association, Berkeley, CA, 1–6. https://www.usenix.org/conference/hotpar11/multicore-performance-optimization-using-partner-cores

[54]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (March 2008), 39–55. https://doi.org/10.1109/MM.2008.31

Digital Library

[55]

Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (Beijing, China) (PLDI ’12). Association for Computing Machinery, New York, NY, USA, 347–358. https://doi.org/10.1145/2254064.2254106

Digital Library

[56]

Elliot Lockerman, Axel Feldmann, Mohammad Bakhshalipour, Alexandru Stanescu, Shashwat Gupta, Daniel Sanchez, and Nathan Beckmann. 2020. Livia: Data-Centric Computing Throughout the Memory Hierarchy. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 417–433. https://doi.org/10.1145/3373376.3378497

Digital Library

[57]

Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An Evaluation of Vectorizing Compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques(PACT ’11). IEEE Computer Society, Los Alamitos, CA, USA, 372–382. https://doi.org/10.1109/PACT.2011.68

Digital Library

[58]

Pierre Michaud. 2016. Best-offset hardware prefetching. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 469–480. https://doi.org/10.1109/HPCA.2016.7446087

[59]

Sparsh Mittal. 2016. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Comput. Surv. 49, 2, Article 35 (aug 2016), 35 pages. https://doi.org/10.1145/2907071

Digital Library

[60]

Andreas Moshovos, Dionisios N. Pnevmatikatos, and Amirali Baniasadi. 2001. Slice-Processors: An Implementation of Operation-Based Prediction. In Proceedings of the 15th International Conference on Supercomputing (Sorrento, Italy) (ICS ’01). Association for Computing Machinery, New York, NY, USA, 321–334. https://doi.org/10.1145/377792.377856

Digital Library

[61]

Todd Carl Mowry. 1995. Tolerating Latency through Software-Controlled Data Prefetching. Ph. D. Dissertation. Stanford University, Computer Systems Laboratory, Stanford, CA, USA. UMI Order No. GAX94-29983.

[62]

Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Techniques for Efficient Processing in Runahead Execution Engines. In Proceedings of the 32nd Annual International Symposium on Computer Architecture(ISCA ’05). IEEE Computer Society, Los Alamitos, CA, USA, 370–381. https://doi.org/10.1109/ISCA.2005.49

Digital Library

[63]

Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2006. Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. IEEE Trans. Comput. 55, 12 (Dec 2006), 1491–1508. https://doi.org/10.1109/TC.2006.191

Digital Library

[64]

Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2006. Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance. IEEE Micro 26, 1 (Jan 2006), 10–20. https://doi.org/10.1109/MM.2006.10

[65]

Onur Mutlu, Hyesoon Kim, Jared Stark, and Yale N. Patt. 2005. On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor. IEEE Computer Architecture Letters 4, 1 (Jan 2005), 2–2. https://doi.org/10.1109/L-CA.2005.1

Digital Library

[66]

Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: an alternative to very large instruction windows for out-of-order processors. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.IEEE Computer Society, Los Alamitos, CA, USA, 129–140. https://doi.org/10.1109/HPCA.2003.1183532

[67]

Ajeya Naithani, Sam Ainsworth, Timothy M. Jones, and Lieven Eeckhout. 2021. Vector Runahead. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA ’21). IEEE Computer Society, Los Alamitos, CA, USA, 195–208. https://doi.org/10.1109/ISCA52012.2021.00024

Digital Library

[68]

Ajeya Naithani, Sam Ainsworth, Timothy M. Jones, and Lieven Eeckhout. 2022. Vector Runahead for Indirect Memory Accesses. IEEE Micro 42, 4 (jul 2022), 116–123. https://doi.org/10.1109/MM.2022.3163132

Digital Library

[69]

Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout. 2019. Precise Runahead Execution. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 71–74. https://doi.org/10.1109/LCA.2019.2910518

Digital Library

[70]

Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout. 2020. Precise Runahead Execution. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 397–410. https://doi.org/10.1109/HPCA47549.2020.00040

[71]

Agustín Navarro-Torres, Biswabandan Panda, Jesús Alastruey-Benedé, Pablo Ibáñez, Víctor Viñals-Yúfera, and Alberto Ros. 2022. Berti: an Accurate Local-Delta Data Prefetcher. In 2022 55th IEEE/ACM International Symposium on Microarchitecture(MICRO-55). IEEE Computer Society, Los Alamitos, CA, USA, 975–991. https://doi.org/10.1109/MICRO56248.2022.00072

Digital Library

[72]

Kyle J. Nesbit and James E. Smith. 2004. Data Cache Prefetching Using a Global History Buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture(HPCA ’04). IEEE Computer Society, Los Alamitos, CA, USA, 96. https://doi.org/10.1109/HPCA.2004.10030

Digital Library

[73]

Quan M. Nguyen and Daniel Sanchez. 2020. Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 596–608. https://doi.org/10.1109/MICRO50266.2020.00056

[74]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-Vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (Ottawa, Ontario, Canada) (PLDI ’06). Association for Computing Machinery, New York, NY, USA, 132–143. https://doi.org/10.1145/1133981.1133997

Digital Library

[75]

Samuel Pakalapati and Biswabandan Panda. 2020. Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA, USA, 118–131. https://doi.org/10.1109/ISCA45697.2020.00021

Digital Library

[76]

Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)(PACT ’15). IEEE Computer Society, Los Alamitos, CA, USA, 432–444. https://doi.org/10.1109/PACT.2015.32

Digital Library

[77]

Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (San Francisco, California) (CGO ’15). IEEE Computer Society, Los Alamitos, CA, USA, 190–201. https://doi.org/10.1109/CGO.2015.7054199

[78]

Stephen Pruett and Yale Patt. 2021. Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 804–815. https://doi.org/10.1145/3466752.3480053

Digital Library

[79]

Tanausú Ramírez, Alex Pajuelo, Oliverio Jesus Santana, Onur Mutlu, and Mateo Valero. 2010. Efficient Runahead Threads. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (Vienna, Austria) (PACT ’10). Association for Computing Machinery, New York, NY, USA, 443–452. https://doi.org/10.1145/1854273.1854328

Digital Library

[80]

Tanausú Ramírez, Alex Pajuelo, Oliverio Jesus Santana, and Mateo Valero. 2008. Runahead Threads to improve SMT performance. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 149–158. https://doi.org/10.1109/HPCA.2008.4658635

[81]

Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques(PACT ’04). IEEE Computer Society, Los Alamitos, CA, USA, 177–188. https://doi.org/10.1109/PACT.2004.1342552

[82]

Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence Based Prefetching for Linked Data Structures. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA) (ASPLOS VIII). Association for Computing Machinery, New York, NY, USA, 115–126. https://doi.org/10.1145/291069.291034

Digital Library

[83]

André Seznec. 2016. TAGE-SC-L Branch Predictors Again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5) (Seoul, South Korea). INRIA HAL, rennes France, 1–4. https://inria.hal.science/hal-01354253

[84]

Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently Prefetching Complex Address Patterns. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 141–152. https://doi.org/10.1145/2830772.2830793

Digital Library

[85]

Zhan Shi, Akanksha Jain, Kevin Swersky, Milad Hashemi, Parthasarathy Ranganathan, and Calvin Lin. 2021. A Hierarchical Neural Model of Data Prefetching. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA, 861–873. https://doi.org/10.1145/3445814.3446752

Digital Library

[86]

Peng Sun, Giacomo Gabrielli, and Timothy M. Jones. 2021. Speculative Vectorisation with Selective Replay. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA ’21). IEEE Computer Society, Los Alamitos, CA, USA, 223–236. https://doi.org/10.1109/ISCA52012.2021.00026

Digital Library

[87]

Hikaru Takayashiki, Masayuki Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2019. A Hardware Prefetching Mechanism for Vector Gather Instructions. In 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE Computer Society, Los Alamitos, CA, USA, 59–66. https://doi.org/10.1109/IA349570.2019.00015

[88]

Nishil Talati, Kyle May, Armand Behroozi, Yichen Yang, Kuba Kaszyk, Christos Vasiladiotis, Tarunesh Verma, Lu Li, Brandon Nguyen, Jiawen Sun, John Magnus Morton, Agreen Ahmadi, Todd Austin, Michael O’Boyle, Scott Mahlke, Trevor Mudge, and Ronald Dreslinski. 2021. Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design, In 2021 IEEE International Symposium on High-Performance Computer Architecture. Proceedings - International Symposium on High-Performance Computer Architecture 2021-February, 654–667. https://doi.org/10.1109/HPCA51647.2021.00061

[89]

Sam Ainsworth Timothy and M. Jones. 2017. Software prefetching for indirect memory accesses. In CGO 2017 - Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Computer Society, Los Alamitos, CA, USA, 305–317. https://doi.org/10.1109/CGO.2017.7863749

[90]

Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Själander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: Look-Ahead Compile-Time Scheduling. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, USA) (CGO ’17). IEEE Computer Society, Los Alamitos, CA, USA, 171–184. https://doi.org/10.1109/CGO.2017.7863738

[91]

Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous Multithreading: Maximizing on-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (S. Margherita Ligure, Italy) (ISCA ’95). Association for Computing Machinery, New York, NY, USA, 392–403. https://doi.org/10.1145/223982.224449

Digital Library

[92]

Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative Decoupled Software Pipelining. In 2007 16th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Los Alamitos, CA, USA, 49–59. https://doi.org/10.1109/PACT.2007.66

[93]

Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, Terry Sych, Stephen F. Moore, and John P. Shen. 2004. Helper Threads via Virtual Multithreading. IEEE Micro 24, 6 (nov 2004), 74–82. https://doi.org/10.1109/MM.2004.75

Digital Library

[94]

Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided Region Prefetching: A Cooperative Hardware/Software Approach. In Proceedings of the 30th Annual International Symposium on Computer Architecture (San Diego, California) (ISCA ’03). Association for Computing Machinery, New York, NY, USA, 388–398. https://doi.org/10.1145/859618.859663

Digital Library

[95]

Hao Wu, Krishnendra Nathella, Joseph Pusdesris, Dam Sunwoo, Akanksha Jain, and Calvin Lin. 2019. Temporal Prefetching Without the Off-Chip Metadata. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 996–1008. https://doi.org/10.1145/3352460.3358300

Digital Library

[96]

Hao Wu, Krishnendra Nathella, Dam Sunwoo, Akanksha Jain, and Calvin Lin. 2019. Efficient Metadata Management for Irregular Data Prefetching. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA ’19). Association for Computing Machinery, New York, NY, USA, 449–461. https://doi.org/10.1145/3307650.3322225

Digital Library

[97]

Chia-Lin Yang and Alvin R. Lebeck. 2002. A Programmable Memory Hierarchy for Prefetching Linked Data Structures. In Proceedings of the 4th International Symposium on High Performance Computing(ISHPC ’02). Springer-Verlag, Berlin, Heidelberg, 160–174. https://doi.org/10.1007/3-540-47847-7_15

Digital Library

[98]

Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect Memory Prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 178–190. https://doi.org/10.1145/2830772.2830807

Digital Library

[99]

Chao Zhang, Yuan Zeng, John Shalf, and Xiaochen Guo. 2020. RnR: A Software-Assisted Record-and-Replay Hardware Prefetcher. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 609–621. https://doi.org/10.1109/MICRO50266.2020.00057

[100]

Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) (ASPLOS ’18). Association for Computing Machinery, New York, NY, USA, 593–607. https://doi.org/10.1145/3173162.3173197

Digital Library

Cited By

Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 23-Mar-2024
https://dl.acm.org/doi/10.1145/3641853
Bera RRanganathan ARakshit JMahto SNori AGaur JOlgun AKanellopoulos KSadrosadati MSubramoney SMutlu O(2024)Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00017(88-102)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00017

Index Terms

Decoupled Vector Runahead
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
    2. Serial architectures
      1. Superscalar architectures

Recommendations

Kilo-instruction processors, runahead and prefetching
CF '06: Proceedings of the 3rd conference on Computing frontiers

There is a continuous research effort devoted to overcome the memory wall problem. Prefetching is one of the most frequently used techniques. A prefetch mechanism anticipates the processor requests by moving data into the lower levels of the memory ...
Efficient runahead threads
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Runahead Threads (RaT) is a promising solution that enables a thread to speculatively run ahead and prefetch data instead of stalling for a long-latency load in a simultaneous multithreading processor. With this capability, RaT can reduces resource ...
Vector runahead
ISCA '21: Proceedings of the 48th Annual International Symposium on Computer Architecture

The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 2023

1528 pages

ISBN:9798400703294

DOI:10.1145/3613424

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

European Research Council

Conference

MICRO '23

Sponsor:

SIGMICRO

MICRO '23: 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 28 - November 1, 2023

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
465
Total Downloads

Downloads (Last 12 months)465
Downloads (Last 6 weeks)33

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xue FHan CLi XWu JZhang TLiu THao YDu ZGuo QZhang F(2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 23-Mar-2024
https://dl.acm.org/doi/10.1145/3641853
Bera RRanganathan ARakshit JMahto SNori AGaur JOlgun AKanellopoulos KSadrosadati MSubramoney SMutlu O(2024)Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00017(88-102)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00017

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents