survey

Evaluation of Hardware Data Prefetchers on Server Processors

Authors:

Mohammad Bakhshalipour,

Seyedali Tabaeiaghdaei,

Pejman Lotfi-Kamran,

Hamid Sarbazi-AzadAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 52, Issue 3

Article No.: 52, Pages 1 - 29

https://doi.org/10.1145/3312740

Published: 18 June 2019 Publication History

Abstract

Data prefetching, i.e., the act of predicting an application’s future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: Nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way.

In this survey, we evaluate the effectiveness of data prefetching in the context of server applications and shed light on its design trade-offs. To do so, we choose a target architecture based on a contemporary server processor and stack various state-of-the-art data prefetchers on top of it. We analyze the prefetchers in terms of their ability to predict memory accesses and enhance overall system performance, as well as their imposed overheads. Finally, by comparing the state-of-the-art prefetchers with impractical ideal prefetchers, we motivate further work on improving data prefetching techniques.

References

[1]

2012. CloudSuite. Retrieved from http://cloudsuite.ch.

[2]

2017. ChampSim. Retrieved from https://github.com/ChampSim/.

[3]

2017. Intel® Xeon® Processor E3-1245 v6. Retrieved from https://www.intel.com/content/www/us/en/products/processors/xeon/e3-processors/e3-1245-v6.html.

[4]

Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. 1999. DBMSs on a modern processor: Where does time go? In Proceedings of the International Conference on Very Large Data Bases (VLDB’99). 266--277.

[5]

Haitham Akkary and Michael A. Driscoll. 1998. A dynamic multithreading processor. In Proceedings of the International Symposium on Microarchitecture (MICRO’98). IEEE, 226--236.

[6]

Jean-Loup Baer and Tien-Fu Chen. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the ACM/IEEE Conference on Supercomputing. 176--186.

Digital Library

[7]

Mohammad Bakhshalipour, Aydin Faraji, Seyed Armin Vakil Ghahani, Farid Samandi, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Reducing writebacks through in-cache displacement. ACM Trans. Design Automat. Electron. Syst. 24, 2 (2019), 16.

Digital Library

[8]

Mohammad Bakhshalipour, Pejman Lotfi-Kamran, Abbas Mazloumi, Farid Samandi, Mahmood Naderan-Tahan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2018. Fast data delivery for many-core processors. IEEE Trans. Comput. 67, 10 (2018), 1416--1429.

[9]

Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2017. An efficient temporal data prefetcher for L1 caches. IEEE Comput. Architect. Lett. 16, 2 (2017), 99--102.

[10]

Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Domino temporal data prefetcher. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’18). IEEE, 131--142.

[11]

Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’19).

[12]

Mohammad Bakhshalipour, HamidReza Zare, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Die-stacked DRAM: Memory, cache, or MemCache? arXiv preprint arXiv:1809.08828.

[13]

Burton H. Bloom. 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426.

Digital Library

[14]

Ioana Burcea, Stephen Somogyi, Andreas Moshovos, and Babak Falsafi. 2008. Predictor virtualization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’08). 157--167.

Digital Library

[15]

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith. 2006. Stealth prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06). 274--282.

Digital Library

[16]

Chi F. Chen, Se-Hyun Yang, Babak Falsafi, and Andreas Moshovos. 2004. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’04). 276--287.

Digital Library

[17]

Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramanian, Anantha P. Chandrakasan, and Li-Shiuan Peh. 2013. SMART: A single-cycle reconfigurable NoC for SoC applications. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). 338--343.

Digital Library

[18]

Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving hash join performance through prefetching. ACM Trans. Database Syst. 32, 3 (Aug. 2007).

Digital Library

[19]

Trishul M. Chilimbi. 2001. On the stability of temporal data reference profiles. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’01). 151--160.

[20]

Trishul M. Chilimbi and Martin Hirzel. 2002. Dynamic hot data stream prefetching for general-purpose programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’02). 199--209.

[21]

Yuan Chou. 2007. Low-cost epoch-based correlation prefetching for commercial applications. In Proceedings of the International Symposium on Microarchitecture (MICRO’07). 301--313.

Digital Library

[22]

Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John P. Shen. 2001. Dynamic speculative precomputation. In Proceedings of the International Symposium on Microarchitecture (MICRO’01). 306--317.

[23]

Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture (ISCA’01). 14--25.

[24]

Pat Conway and Bill Hughes. 2007. The AMD opteron northbridge architecture. IEEE Micro 27, 2 (Mar. 2007), 10--21.

Digital Library

[25]

Heming Cui, Jingyue Wu, Chia-Che Tsai, and Junfeng Yang. 2010. Stable deterministic multithreading through schedule memoization. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, 207--221.

[26]

F. Dahlgren and P. Stenstrom. 1995. Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’95). 68.

[27]

Pedro Diaz and Marcelo Cintra. 2009. Stream chaining: Exploiting multiple levels of correlation in data prefetching. In Proceedings of the International Symposium on Computer Architecture (ISCA’09). 81--92.

Digital Library

[28]

Jack Doweck. 2006. Inside intel® core microarchitecture. In IEEE Hot Chips Symposium (HCS’06). 1--35.

[29]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2010. Fairness via source throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 335--346.

[30]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-aware shared resource management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture (ISCA’11). 141--152.

[31]

Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009. Coordinated control of multiple prefetchers in multi-core systems. In Proceedings of the International Symposium on Microarchitecture (MICRO’09). 316--326.

[32]

Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’09). 7--17.

[33]

Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and Nael Abu-Ghazaleh. 2019. CORF: Coalescing operand register file for GPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM.

[34]

Pouya Esmaili-Dokht, Mohammad Bakhshalipour, Behnam Khodabandeloo, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Scale-out processors 8 energy efficiency. arXiv preprint arXiv:1808.04864.

[35]

Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Morgan 8 Claypool Publishers.

[36]

Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’12). 37--48.

Digital Library

[37]

Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Quantifying the mismatch between emerging scale-out applications and modern processors. ACM Trans. Comput. Syst. 30, 4, Article 15 (Nov. 2012), 24 pages.

Digital Library

[38]

Ilya Ganusov and Martin Burtscher. 2006. Future execution: A prefetching mechanism that uses multiple cores to speed up single threads. ACM Trans. Architect. Code Optim. 3, 4 (Dec. 2006), 424--449.

Digital Library

[39]

Boris Grot, Damien Hardy, Pejman Lotfi-Kamran, Chrysostomos Nicopoulos, Yiannakis Sazeides, and Babak Falsafi. 2012. Optimizing data-center TCO with scale-out processors. IEEE Micro 32, 5 (Sept. 2012), 1--63.

Digital Library

[40]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the International Symposium on Computer Architecture (ISCA’10). ACM, 37--47.

Digital Library

[41]

Richard A. Hankins, Trung Diep, Murali Annavaram, Brian Hirano, Harald Eri, Hubert Nueckel, and John P. Shen. 2003. Scaling and characterizing database workloads: Bridging the gap between research and practice. In Proceedings of the International Symposium on Microarchitecture (MICRO’03). 116--120.

[42]

Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju G. Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007. Database servers on chip multiprocessors: Limitations and opportunities. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’07). 79--87.

[43]

Ruud Haring, Martin Ohmacht, Thomas Fox, Michael Gschwind, David Satterfield, Krishnan Sugavanam, Paul Coteus, Philip Heidelberger, Matthias Blumrich, Robert Wisniewski, Alan Gara, George Chiu, Peter Boyle, Norman Chist, and Changhoan Kim. 2012. The IBM blue Gene/Q compute chip. IEEE Micro 32, 2 (2012), 48--60.

Digital Library

[44]

Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating dependent cache misses with an enhanced memory controller. In Proceedings of the International Symposium on Computer Architecture (ISCA’16). 444--455.

[45]

Tim Horel and Gary Lauterbach. 1999. UltraSPARC-III: Designing third-generation 64-bit performance. IEEE Micro 19, 3 (1999), 73--85.

Digital Library

[46]

Christopher J. Hughes and Sarita V. Adve. 2005. Memory-side prefetching for linked data structures for processor-in-memory systems. J. Parallel and Distrib. Comput. 65, 4 (Apr. 2005), 448--463.

Digital Library

[47]

Jaehyuk Huh, Doug Burger, and Stephen W. Keckler. 2001. Exploring the design space of future CMPs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’01). 199--210.

[48]

Ibrahim Hur and Calvin Lin. 2006. Memory prefetching using adaptive stream detection. In Proceedings of the International Symposium on Microarchitecture (MICRO’06). 397--408.

Digital Library

[49]

Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proceedings of the International Conference on Supercomputing (ICS’04). 1--11.

[50]

Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access map pattern matching for data cache prefetch. In Proceedings of the International Conference on Supercomputing (ICS’09). 499--500.

Digital Library

[51]

Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the International Symposium on Microarchitecture (MICRO’13). 247--259.

Digital Library

[52]

Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache. In Proceedings of the International Symposium on Computer Architecture (ISCA’13). 404--415.

Digital Library

[53]

Daniel A. Jiménez and Calvin Lin. 2001. Dynamic branch prediction with perceptrons. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’01). 197--206.

[54]

Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 137--146.

[55]

Ryan Johnson, Stavros Harizopoulos, Nikos Hardavellas, Kivanc Sabirli, Ippokratis Pandis, Anastasia Ailamaki, Naju G. Mancheril, and Babak Falsafi. 2007. To share or not to share? In Proceedings of the International Conference on Very Large Data Bases (VLDB’07). 351--362.

[56]

Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proceedings of the International Symposium on Computer Architecture (ISCA’97). 252--263.

Digital Library

[57]

Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers. In Proceedings of the International Symposium on Computer Architecture (ISCA’90). 364--373.

Digital Library

[58]

David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proceedings of the International Symposium on Microarchitecture (MICRO’14). 623--634.

Digital Library

[59]

Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Inter-core prefetching for multicore processors using migrating helper threads. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). 393--404.

[60]

Mahmut Kandemir, Yuanrui Zhang, and Ozcan Ozturk. 2009. Adaptive prefetching for shared cache-based chip multiprocessors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’09). 773--778.

Digital Library

[61]

Tejas S. Karkhanis and James E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the International Symposium on Computer Architecture (ISCA’04). 338--349.

[62]

Mehmet Kayaalp, Khaled N. Khasawneh, Hodjat Asghari Esfeden, Jesse Elwell, Nael Abu-Ghazaleh, Dmitry Ponomarev, and Aamer Jaleel. 2017. RIC: Relaxed inclusion caches for mitigating LLC side-channel attacks. In Proceedings of the Design Automation Conference (DAC’17). ACM, Article 7, 6 pages.

Digital Library

[63]

Farzad Khorasani, Hodjat Asghari Esfeden, Nael Abu-Ghazaleh, and Vivek Sarkar. 2018. In-register parameter caching for dynamic neural nets with virtual persistent processor specialization. In Proceedings of the International Symposium on Microarchitecture (MICRO’18). IEEE, 377--389.

Digital Library

[64]

Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-warp GPU register time-sharing. In Proceedings of the International Symposium on Computer Architecture (ISCA’18). IEEE Press, 816--828.

Digital Library

[65]

Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence-based lookahead prefetching. In Proceedings of the International Symposium on Microarchitecture (MICRO’16). 60:1--60:12.

Digital Library

[66]

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’10). 1--12.

[67]

Sanjeev Kumar and Christopher Wilkerson. 1998. Exploiting spatial locality in data caches using spatial footprints. In Proceedings of the International Symposium on Computer Architecture (ISCA’98). 357--368.

[68]

James R. Larus and Michael Parkes. 2002. Using cohort-scheduling to enhance server performance. In Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference (ATEC’02). 103--114.

[69]

Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proceedings of the International Symposium on Microarchitecture (MICRO’08). 200--209.

[70]

Jaejin Lee, Changhee Jung, Daeseob Lim, and Yan Solihin. 2009. Prefetching with helper threads for loosely coupled multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sept. 2009), 1309--1324.

Digital Library

[71]

Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt. 2008. Understanding and designing new server architectures for emerging warehouse-computing environments. In Proceedings of the International Symposium on Computer Architecture (ISCA’08). 315--326.

Digital Library

[72]

Jack L. Lo, Luiz André Barroso, Susan J. Eggers, Kourosh Gharachorloo, Henry M. Levy, and Sujay S. Parekh. 1998. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’98). 39--50.

Digital Library

[73]

Pejman Lotfi-Kamran, Boris Grot, and Babak Falsafi. 2012. NOC-Out: Microarchitecting a scale-out processor. In Proceedings of the 45th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’12). 177--187.

Digital Library

[74]

Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, and Babak Falsafi. 2012. Scale-out processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’12). 500--511.

[75]

Pejman Lotfi-Kamran, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2016. An efficient hybrid-switched network-on-chip for chip multiprocessors. IEEE Trans. Comput. 65, 5 (May 2016), 1656--1662.

Digital Library

[76]

Pejman Lotfi-Kamran, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2017. Near-ideal networks-on-chip for servers. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’17). 277--288.

[77]

Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 222--233.

[78]

Sanyam Mehta, Zhenman Fang, Antonia Zhai, and Pen-Chung Yew. 2014. Multi-stage coordinated prefetching for present-day processors. In Proceedings of the International Conference on Supercomputing (ICS’14). 73--82.

Digital Library

[79]

Pierre Michaud. 2016. Best-offset hardware prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’16). 469--480.

[80]

Amirhossein Mirhosseini and Thomas F. Wenisch. 2019. The queuing-first approach for tail management of interactive services. IEEE Micro (2019).

[81]

Amirhossein Mirhosseini, Akshitha Sriraman, and Thomas F. Wenisch. 2019. Enhancing server efficiency in the face of killer microseconds. Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’19).

[82]

Sparsh Mittal. 2016. A survey of recent prefetching techniques for processor caches. ACM Comput. Surveys 49, 2, Article 35 (Aug. 2016), 35:1--35:35.

Digital Library

[83]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the International Symposium on Microarchitecture (MICRO’07). 3--14.

Digital Library

[84]

Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Techniques for efficient processing in runahead execution engines. In Proceedings of the International Symposium on Computer Architecture (ISCA’05). 370--381.

[85]

Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’03). 129.

[86]

Mario Nemirovsky and Dean M. Tullsen. 2013. Multithreading Architecture (1st ed.). Morgan 8 Claypool Publishers.

[87]

Kyle J. Nesbit, Ashutosh S. Dhodapkar, and James E. Smith. 2004. AC/DC: An adaptive data cache prefetcher. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’04). 135--145.

[88]

Kyle J. Nesbit and James E. Smith. 2004. Data cache prefetching using a global history buffer. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’04). 96.

Digital Library

[89]

Craig G. Nevill-Manning and Ian H. Witten. 1997. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artific. Intell. Res. 7, 1 (Sept. 1997), 67--82.

Digital Library

[90]

S. Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the International Symposium on Computer Architecture (ISCA’94). 24--33.

[91]

S. H. Pugsley, A. R. Alameldeen, C. Wilkerson, and H. Kim. 2015. The 2nd Data Prefetching Championship (DPC-2). http://comparch-conf.gatech.edu/dpc2/.

[92]

Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’14). 626--637.

[93]

Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, and Luiz André Barroso. 1998. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’98). 307--318.

Digital Library

[94]

Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the International Symposium on Computer Architecture (ISCA’09). 371--382.

Digital Library

[95]

Amir Roth and Gurindar S. Sohi. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the International Symposium on Computer Architecture (ISCA’99). 111--121.

[96]

Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, 73--82.

[97]

Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 489--502.

Digital Library

[98]

Suleyman Sair, Timothy Sherwood, and Brad Calder. 2003. A decoupled predictor-directed stream prefetching architecture. IEEE Trans. Comput. 52, 3 (March 2003), 260--276.

Digital Library

[99]

Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Architect. Code Optim. 11, 4, Article 51 (Jan. 2015), 22 pages.

[100]

Timothy Sherwood, Suleyman Sair, and Brad Calder. 2000. Predictor-directed stream buffers. In Proceedings of the International Symposium on Microarchitecture (MICRO’00). 42--53.

[101]

Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the International Symposium on Microarchitecture (MICRO’15). 141--152.

Digital Library

[102]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon Phi product. IEEE Micro 36, 2 (Mar. 2016), 34--46.

Digital Library

[103]

Yan Solihin, Jaejin Lee, and Josep Torrellas. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the International Symposium on Computer Architecture (ISCA’02). 171--182.

[104]

Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proceedings of the International Symposium on Computer Architecture (ISCA’09). 69--80.

Digital Library

[105]

Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proceedings of the International Symposium on Computer Architecture (ISCA’06). 252--263.

Digital Library

[106]

Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’07). 63--74.

[107]

Joel M. Tendler, J. Steve Dodson, J. S. Fields, Hung Le, and Balaram Sinharoy. 2002. POWER4 system microarchitecture. IBM J. Res. Dev. 46, 1 (2002), 5--25.

Digital Library

[108]

Pedro Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and Josep Torrellas. 1997. The memory performance of DSS commercial workloads in shared-memory multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’97). 250--260.

[109]

Armin Vakil-Ghahani, Sara Mahdizadeh-Shahri, Mohammad-Reza Lotfi-Namin, Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Cache replacement policy based on expected hit count. IEEE Comput. Architect. Lett. 17, 1 (2018), 64--67.

Digital Library

[110]

Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’09). 79--90.

[111]

Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. In Proceedings of the International Symposium on Computer Architecture (ISCA’05). 222--233.

Digital Library

[112]

Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proceedings of the International Symposium on Microarchitecture (MICRO’11). 442--453.

Digital Library

[113]

Carole-Jean Wu and Margaret Martonosi. 2011. Characterization and dynamic mitigation of intra-application cache interference. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’11). 2--11.

Digital Library

[114]

Wm. A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News 23, 1 (Mar. 1995), 20--24.

Digital Library

[115]

Praveen Yedlapalli, Jagadish Kotra, Emre Kultursay, Mahmut Kandemir, Chita R. Das, and Anand Sivasubramaniam. 2013. Meeting midway: Improving CMP performance with memory-side prefetching. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’13). 289--298.

[116]

Chengqiang Zhang and Sally A. McKee. 2000. Hardware-only stream prefetching and dynamic access ordering. In Proceedings of the International Conference on Supercomputing (ICS’00). 167--175.

Digital Library

[117]

Weifeng Zhang, Brad Calder, and Dean M. Tullsen. 2006. A self-repairing prefetcher in an event-driven dynamic optimization framework. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). 50--64.

Cited By

Scravaglieri LPopov MLima Pilla LGuermouche AAumage OSaillard E(2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
https://doi.org/10.1016/j.jpdc.2023.104720
Zhang PKannan RSrivastava ANori APrasanna VWolf FShende SCulhane CAlam SJagode H(2022)ReSembleProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571992(1-14)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571992
Qi XYang JZhang YXiao B(2022)BIOS-Based Server Intelligent OptimizationSensors10.3390/s2218673022:18(6730)Online publication date: 6-Sep-2022
https://doi.org/10.3390/s22186730
Show More Cited By

Index Terms

Evaluation of Hardware Data Prefetchers on Server Processors
1. Computer systems organization
  1. Architectures
2. General and reference
  1. Cross-computing tools and techniques
    1. Evaluation
  2. Document types
    1. Surveys and overviews

Recommendations

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems ...
Energy-efficient hardware data prefetching

Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 52, Issue 3

May 2020

734 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3341324

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2019

Accepted: 01 February 2019

Revised: 01 December 2018

Received: 01 August 2017

Published in CSUR Volume 52, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

Iran National Science Foundation (INSF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
1,030
Total Downloads

Downloads (Last 12 months)180
Downloads (Last 6 weeks)6

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Scravaglieri LPopov MLima Pilla LGuermouche AAumage OSaillard E(2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
https://doi.org/10.1016/j.jpdc.2023.104720
Zhang PKannan RSrivastava ANori APrasanna VWolf FShende SCulhane CAlam SJagode H(2022)ReSembleProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571992(1-14)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571992
Qi XYang JZhang YXiao B(2022)BIOS-Based Server Intelligent OptimizationSensors10.3390/s2218673022:18(6730)Online publication date: 6-Sep-2022
https://doi.org/10.3390/s22186730
Wu QEkanayake ALi RBeard JJohn L(2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545044
Zhang PKannan RSrivastava ANori APrasanna V(2022)ReSemble: Reinforced Ensemble Framework for Data PrefetchingSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00086(1-14)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00086
Sadrosadati MMirhosseini AHajiabadi AEhsani SFalahati HSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3419973
Li Q(2021)Computer Aided Framework and Data Evaluation for Ceramic Product Design2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV)10.1109/ICICV50876.2021.9388640(846-850)Online publication date: 4-Feb-2021
https://doi.org/10.1109/ICICV50876.2021.9388640
Golestani HWenisch T(2021)HyperData: A Data Transfer Accelerator for Software Data Planes Based on Targeted Prefetching2021 IEEE 39th International Conference on Computer Design (ICCD)10.1109/ICCD53106.2021.00059(326-334)Online publication date: Oct-2021
https://doi.org/10.1109/ICCD53106.2021.00059
Esfeden HAbdolrashidi ARahman SWong DAbu-Ghazaleh N(2020)BOW: Breathing Operand Windows to Exploit Bypassing in GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00084(996-1008)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00084
Golshan FBakhshalipour MShakerinava MAnsari ALotfi-Kamran PSarbazi-Azad H(2020)Harnessing Pairwise-Correlating Data Prefetching With Runahead MetadataIEEE Computer Architecture Letters10.1109/LCA.2020.301934319:2(130-133)Online publication date: 1-Jul-2020
https://doi.org/10.1109/LCA.2020.3019343
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents