research-article

Open access

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Authors:

Michael C. HuangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 1

Article No.: 13, Pages 1 - 25

https://doi.org/10.1145/2890505

Published: 28 March 2016 Publication History

Abstract

Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of others. The data prefetching can lead to significant performance degradation due to shared resource contention on shared memory multicore systems. This article proposes a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching modes and aggressiveness, mitigating the resource contention in the memory system. Our solution has three new components: (1) a self-tuning prefetcher that uses runtime feedback to dynamically adjust data prefetching modes and arguments of each thread, (2) a filtering mechanism that informs the hardware about which prefetching request can cause shared data invalidation and should be discarded, and (3) a limiter thread acceleration mechanism to estimate and accelerate the critical thread which has the longest completion time in the parallel region of execution. On a set of multithreaded parallel benchmarks, our thread-aware data prefetching mechanism improves the overall performance of 64-core system by 13% over a multimode prefetch baseline system with two-level cache organization and conventional modified, exclusive, shared, and invalid-based directory coherence protocol. We compare our approach with the feedback directed prefetching technique and find that it provides 9% performance improvement on multicore systems, while saving the memory bandwidth consumption.

References

[1]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 72--81.

Digital Library

[2]

D. Burger and T. M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Arch. News 25, 3 (June 1997), 13--25.

Digital Library

[3]

Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proc. Int’l Conf. on Supercomputing. ACM, 402--412.

[4]

Yong Chen, Huaiyu Zhu, Hui Jin, and Xian-He Sun. 2012. Algorithm-level feedback-controlled adaptive data prefetcher: Accelerating data access for high-performance processors. Parallel Comput. 38, 10--11 (October/November 2012), 533--551.

Digital Library

[5]

Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 279--290.

[6]

Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int’l Symp. on Parallel Processing. IEEE, 56--63.

Digital Library

[7]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011a. Prefetch-aware shared resource management for multi-core systems. In Proc. Int’l Symp. on Comp. Arch. ACM, 141--152.

[8]

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Jos A. Joao, Onur Mutlu, and Yale N. Patt. 2011b. Parallel application memory scheduling. In Proc. Int’l Symp. on Microarch. ACM, 362--373.

[9]

Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In Proc. Int’l Symp. on Microarch. IEEE, 316--326.

[10]

Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 7--17.

[11]

John W. C. Fu, Janak H. Patel, and Bob L. Janssens. 1992. Stride directed prefetching in scalar processors. In Proc. Int’l Symp. on Microarch. IEEE, 102--110.

[12]

Ilya Ganusov and Martin Burtscher. 2005. On the importance of optimizing the configuration of stream prefetchers. In Proc. Workshop on Memory System Performance (MSP’05). ACM, New York, NY, 54--61.

Digital Library

[13]

Yan Huang, Zhi-min Gu, Jie Tang, Min Cai, Jianxun Zhang, and Ninghan Zheng. 2012. Reducing cache pollution of threaded prefetching by controlling prefetch distance. In Proc. Int’l Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE, 1812--1819.

Digital Library

[14]

Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proc. Int’l Conf. on Supercomputing. ACM, 1--11.

[15]

Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proc. Int’l Symp. on Microarch. ACM, 247--259.

Digital Library

[16]

Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 208--219.

Digital Library

[17]

Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon C. Steely, and Joel Emer. 2012. CRUISE: Cache replacement and utility-aware scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 249--260.

[18]

Natalie D. Enright Jerger, Eric L. Hill, and Mikko H. Lipasti. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proc. Int’l Symp. on Performance Analysis of Systems and Software. IEEE, 177--188.

[19]

Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 137--146.

[20]

Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263.

[21]

Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. Int’l Symp. on Comp. Arch. IEEE, 364--373.

[22]

David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proc. Int’l Symp. on Microarch. IEEE Computer Society, 623--634.

Digital Library

[23]

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 1--12.

[24]

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2011. Thread cluster memory scheduling. IEEE Micro 31, 1 (Jan./Feb. 2011), 78--89.

Digital Library

[25]

Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proc. Int’l Symp. on Microarch. IEEE, 200--209.

[26]

Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for gpgpu applications. In Proc. Int’l Symp. on Microarch. IEEE, 213--224.

Digital Library

[27]

Shang Li. 2007. PoPNet simulator. Retrieved from http://www.princeton.edu/&sim;peh/orion.html.

[28]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. Int’l Symp. on Microarch. ACM, 469--480.

[29]

James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA, 281--297.

[30]

MIPS Technologies, Inc. 2008. MIPS32® 24KE^TM Processor Core Family Software User’s Manual. (Dec. 2008). Document Number: MD00468.

[31]

Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. Int’l Symp. on Comp. Arch. IEEE, 63--74.

Digital Library

[32]

Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proc. Int’l Symp. on Microarch. IEEE, 208--222.

[33]

Subbarao Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proc. Int’l Symp. on Comp. Arch. IEEE, 24--33.

[34]

Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 626--637.

[35]

Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. Int’l Symp. on Microarch. IEEE, 423--432.

[36]

Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 115--126.

[37]

Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Archit. Code Optim. 11, 4 (Jan. 2015), 51:1--51:22.

Digital Library

[38]

Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 69--80.

[39]

Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263.

Digital Library

[40]

Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 63--74.

[41]

Chen Sun, C.-H. O. Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT: A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. Int’l Symp. on Networks on Chip. IEEE, Lyngby, Denmark, 201--210.

Digital Library

[42]

Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 79--90.

[43]

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. Int’l Symp. on Comp. Arch. ACM, 24--36.

[44]

Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proc. Int’l Symp. on Microarch. ACM, 442--453.

Digital Library

[45]

Jiyang Yu and Peng Liu. 2014. A thread-aware adaptive data prefetcher. In Proc. Int’l Conf. on Computer Design. IEEE, 278--285.

[46]

Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010a. Addressing shared resource contention in multicore processors via scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 129--142.

[47]

Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010b. AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 249--260.

Digital Library

Cited By

Eris FLouis MEris KAbellán JJoshi A(2022)Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory HierarchyACM Transactions on Architecture and Code Optimization10.1145/357030420:1(1-25)Online publication date: 16-Dec-2022
https://dl.acm.org/doi/10.1145/3570304
AlBarakat LGratz PJiménez DAyguadé EHwu WBadia RHofstee H(2020)SB-FetchProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392735(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392735
Vijaykumar NEbrahimi EHsieh KGibbons PMutlu O(2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00074
Show More Cited By

Index Terms

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Introducing thread criticality awareness in prefetcher aggressiveness control
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

A single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent ...
The locality-aware adaptive cache coherence protocol
ICSA '13

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 1

April 2016

347 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2899032

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2016

Accepted: 01 January 2016

Revised: 01 January 2016

Received: 01 April 2015

Published in TACO Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
National Natural Science Foundation of China
National High Technology Research and Development Program of China
the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing
Huawei Technologies Co., Ltd

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
865
Total Downloads

Downloads (Last 12 months)155
Downloads (Last 6 weeks)18

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Eris FLouis MEris KAbellán JJoshi A(2022)Puppeteer: A Random Forest Based Manager for Hardware Prefetchers Across the Memory HierarchyACM Transactions on Architecture and Code Optimization10.1145/357030420:1(1-25)Online publication date: 16-Dec-2022
https://dl.acm.org/doi/10.1145/3570304
AlBarakat LGratz PJiménez DAyguadé EHwu WBadia RHofstee H(2020)SB-FetchProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392735(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392735
Vijaykumar NEbrahimi EHsieh KGibbons PMutlu O(2018)The locality descriptorProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00074(829-842)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00074
Guo HHuang LLu YMa JQian CMa SWang Z(2018)Accelerating BFS via Data Structure-Aware Prefetching on GPUIEEE Access10.1109/ACCESS.2018.2876201(1-1)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2876201
Xiong DHuang KJiang XYan X(2017)Providing Predictable Performance via a Slowdown Estimation ModelACM Transactions on Architecture and Code Optimization10.1145/312445114:3(1-26)Online publication date: 22-Aug-2017
https://dl.acm.org/doi/10.1145/3124451

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents