Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Published: 28 March 2016 Publication History

Abstract

Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of others. The data prefetching can lead to significant performance degradation due to shared resource contention on shared memory multicore systems. This article proposes a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching modes and aggressiveness, mitigating the resource contention in the memory system. Our solution has three new components: (1) a self-tuning prefetcher that uses runtime feedback to dynamically adjust data prefetching modes and arguments of each thread, (2) a filtering mechanism that informs the hardware about which prefetching request can cause shared data invalidation and should be discarded, and (3) a limiter thread acceleration mechanism to estimate and accelerate the critical thread which has the longest completion time in the parallel region of execution. On a set of multithreaded parallel benchmarks, our thread-aware data prefetching mechanism improves the overall performance of 64-core system by 13% over a multimode prefetch baseline system with two-level cache organization and conventional modified, exclusive, shared, and invalid-based directory coherence protocol. We compare our approach with the feedback directed prefetching technique and find that it provides 9% performance improvement on multicore systems, while saving the memory bandwidth consumption.

References

[1]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 72--81.
[2]
D. Burger and T. M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Arch. News 25, 3 (June 1997), 13--25.
[3]
Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proc. Int’l Conf. on Supercomputing. ACM, 402--412.
[4]
Yong Chen, Huaiyu Zhu, Hui Jin, and Xian-He Sun. 2012. Algorithm-level feedback-controlled adaptive data prefetcher: Accelerating data access for high-performance processors. Parallel Comput. 38, 10--11 (October/November 2012), 533--551.
[5]
Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 279--290.
[6]
Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int’l Symp. on Parallel Processing. IEEE, 56--63.
[7]
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011a. Prefetch-aware shared resource management for multi-core systems. In Proc. Int’l Symp. on Comp. Arch. ACM, 141--152.
[8]
Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Jos A. Joao, Onur Mutlu, and Yale N. Patt. 2011b. Parallel application memory scheduling. In Proc. Int’l Symp. on Microarch. ACM, 362--373.
[9]
Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In Proc. Int’l Symp. on Microarch. IEEE, 316--326.
[10]
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 7--17.
[11]
John W. C. Fu, Janak H. Patel, and Bob L. Janssens. 1992. Stride directed prefetching in scalar processors. In Proc. Int’l Symp. on Microarch. IEEE, 102--110.
[12]
Ilya Ganusov and Martin Burtscher. 2005. On the importance of optimizing the configuration of stream prefetchers. In Proc. Workshop on Memory System Performance (MSP’05). ACM, New York, NY, 54--61.
[13]
Yan Huang, Zhi-min Gu, Jie Tang, Min Cai, Jianxun Zhang, and Ninghan Zheng. 2012. Reducing cache pollution of threaded prefetching by controlling prefetch distance. In Proc. Int’l Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE, 1812--1819.
[14]
Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proc. Int’l Conf. on Supercomputing. ACM, 1--11.
[15]
Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proc. Int’l Symp. on Microarch. ACM, 247--259.
[16]
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 208--219.
[17]
Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon C. Steely, and Joel Emer. 2012. CRUISE: Cache replacement and utility-aware scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 249--260.
[18]
Natalie D. Enright Jerger, Eric L. Hill, and Mikko H. Lipasti. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proc. Int’l Symp. on Performance Analysis of Systems and Software. IEEE, 177--188.
[19]
Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 137--146.
[20]
Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263.
[21]
Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. Int’l Symp. on Comp. Arch. IEEE, 364--373.
[22]
David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proc. Int’l Symp. on Microarch. IEEE Computer Society, 623--634.
[23]
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 1--12.
[24]
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2011. Thread cluster memory scheduling. IEEE Micro 31, 1 (Jan./Feb. 2011), 78--89.
[25]
Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proc. Int’l Symp. on Microarch. IEEE, 200--209.
[26]
Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for gpgpu applications. In Proc. Int’l Symp. on Microarch. IEEE, 213--224.
[27]
Shang Li. 2007. PoPNet simulator. Retrieved from http://www.princeton.edu/∼peh/orion.html.
[28]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. Int’l Symp. on Microarch. ACM, 469--480.
[29]
James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA, 281--297.
[30]
MIPS Technologies, Inc. 2008. MIPS32® 24KETM Processor Core Family Software User’s Manual. (Dec. 2008). Document Number: MD00468.
[31]
Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. Int’l Symp. on Comp. Arch. IEEE, 63--74.
[32]
Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proc. Int’l Symp. on Microarch. IEEE, 208--222.
[33]
Subbarao Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proc. Int’l Symp. on Comp. Arch. IEEE, 24--33.
[34]
Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 626--637.
[35]
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. Int’l Symp. on Microarch. IEEE, 423--432.
[36]
Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 115--126.
[37]
Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Archit. Code Optim. 11, 4 (Jan. 2015), 51:1--51:22.
[38]
Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 69--80.
[39]
Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263.
[40]
Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 63--74.
[41]
Chen Sun, C.-H. O. Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT: A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. Int’l Symp. on Networks on Chip. IEEE, Lyngby, Denmark, 201--210.
[42]
Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 79--90.
[43]
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. Int’l Symp. on Comp. Arch. ACM, 24--36.
[44]
Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proc. Int’l Symp. on Microarch. ACM, 442--453.
[45]
Jiyang Yu and Peng Liu. 2014. A thread-aware adaptive data prefetcher. In Proc. Int’l Conf. on Computer Design. IEEE, 278--285.
[46]
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010a. Addressing shared resource contention in multicore processors via scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 129--142.
[47]
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010b. AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 249--260.

Cited By

View all
  • (2022)Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory HierarchyACM Transactions on Architecture and Code Optimization10.1145/357030420:1(1-25)Online publication date: 16-Dec-2022
  • (2020)SB-FetchProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392735(1-12)Online publication date: 29-Jun-2020
  • (2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
  • Show More Cited By

Index Terms

  1. Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 1
      April 2016
      347 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2899032
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 March 2016
      Accepted: 01 January 2016
      Revised: 01 January 2016
      Received: 01 April 2015
      Published in TACO Volume 13, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data prefetcher
      2. multicore
      3. multithreaded
      4. self-tuning
      5. thread-aware

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)155
      • Downloads (Last 6 weeks)18
      Reflects downloads up to 18 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory HierarchyACM Transactions on Architecture and Code Optimization10.1145/357030420:1(1-25)Online publication date: 16-Dec-2022
      • (2020)SB-FetchProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392735(1-12)Online publication date: 29-Jun-2020
      • (2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
      • (2018)Accelerating BFS via Data Structure-Aware Prefetching on GPUIEEE Access10.1109/ACCESS.2018.2876201(1-1)Online publication date: 2018
      • (2017)Providing Predictable Performance via a Slowdown Estimation ModelACM Transactions on Architecture and Code Optimization10.1145/312445114:3(1-26)Online publication date: 22-Aug-2017

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media