research-article

Public Access

Runtime-driven shared last-level cache management for task-parallel programs

Authors:

Vijay S. PaiAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 11, Pages 1 - 12

https://doi.org/10.1145/2807591.2807625

Published: 15 November 2015 Publication History

Abstract

Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency.

In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.

References

[1]

Barcelona Supercomputing Center bsc application repository. https://pm.bsc.es/projects/bar/wiki/Applications. Accessed: 2015-04-02.

[2]

Intel Corporation intel threading builing blocks. https://www.threadingbuildingblocks.org. Accessed: 2015-03-21.

[3]

OpenMP Application Programming Interface, Version 4.0, howpublished =http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, note = July 2013, Accessed: 2015-04-06.

[4]

M. D. Allen, S. Sridharan, and G. S. Sohi. Serialization sets: A dynamic dependence-based parallel execution model. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 85--96, New York, NY, USA, 2009. ACM.

Digital Library

[5]

E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3):404--418, March 2009.

Digital Library

[6]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 66:1--66:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Digital Library

[7]

L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2):78--101, June 1966.

Digital Library

[8]

K. Beyls and E. H. D'Hollander. Generating cache hints for improved program efficiency. J. Syst. Archit., 51(4):223--250, Apr. 2005.

Digital Library

[9]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM.

Digital Library

[10]

J. Brock, X. Gu, B. Bao, and C. Ding. Pacman: Program-assisted cache management. In Proceedings of the 2013 International Symposium on Memory Management, ISMM '13, pages 39--50, New York, NY, USA, 2013. ACM.

Digital Library

[11]

X. Ding, K. Wang, and X. Zhang. Ulcc: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 103--112, New York, NY, USA, 2011. ACM.

Digital Library

[12]

A. Druan, E. Aygude, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173--193, 2011.

[13]

N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '12, pages 389--400, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[14]

J. Engblom, D. Aarno, and B. Werner. Full-system simulation from embedded to high-performance systems. In R. Leupers and O. Temam, editors, Processor and System-on-Chip Simulation, chapter 3, pages 25--45. Springer US, 2010.

[15]

Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. Badia, E. Ayguade, J. Labarta, and M. Valero. Task superscalar: An out-of-order task pipeline. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 89--100, Dec 2010.

Digital Library

[16]

F. Guo, Y. Solihin, L. Zhao, and R. Iyer. Quality of service shared cache management in chip multiprocessor architecture. ACM Trans. Archit. Code Optim., 7(3):14:1--14:33, Dec. 2010.

Digital Library

[17]

L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In Proc. 15th Int'l Conf. Parallel Architectures and Compilation Techniques, PACT '06, pages 13--22, New York, NY, USA, 2006. ACM.

Digital Library

[18]

R. Iyer. Cqos: a framework for enabling qos in shared caches of cmp platforms. In Proc. 18th Annual Int'l Conf. Supercomputing, ICS '04, pages 257--266, New York, NY, USA, 2004. ACM.

Digital Library

[19]

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques, PACT '08, pages 208--219. ACM, 2008.

Digital Library

[20]

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 60--71, New York, NY, USA, 2010. ACM.

Digital Library

[21]

G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement based on reuse-distance prediction. In Computer Design, 2007. ICCD 2007. 25th International Conference on, pages 245--250, 2007.

[22]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. 13th Int'l Conf. Parallel Architectures and Compilation Techniques, PACT '04, pages 111--122, Washington, DC, USA, 2004. IEEE CS.

Digital Library

[23]

Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-olp: Improving hardware cache performance through software-controlled object-level partitioning. In Parallel Architectures and Compilation Techniques, 2009. PACT '09. 18th International Conference on, pages 246--257, Sept 2009.

Digital Library

[24]

M. Manivannan and P. Stenstrom. Runtime-guided cache coherence optimizations in multi-core architectures. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 625--636, May 2014.

Digital Library

[25]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33:92--99, Nov. 2005.

Digital Library

[26]

S. P. Muralidhara, M. Kandemir, and P. Raghavan. Intra-application cache partitioning. In Proc. 2010 IEEE Int'l Symp. Parallel & Distributed Processing (IPDPS), pages 1--12. IEEE, Apr. 2010.

[27]

A. Pan and V. S. Pai. Imbalanced cache partitioning for balanced data-parallel programs. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 297--309, New York, NY, USA, 2013. ACM.

Digital Library

[28]

V. Papaefstathiou, M. G. Katevenis, D. S. Nikolopoulos, and D. Pnevmatikatos. Prefetching and cache management using task lifetimes. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 325--334, New York, NY, USA, 2013. ACM.

Digital Library

[29]

J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Cluster Computing, 2008 IEEE International Conference on, pages 142--151, Sept 2008.

[30]

J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependencies under strided and aliased references. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 263--274, New York, NY, USA, 2010. ACM.

Digital Library

[31]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In Proc. 34th annual Int'l Symp. Computer Architecture, ISCA '07, pages 381--391. ACM, 2007.

Digital Library

[32]

M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture, MICRO 39, pages 423--432. IEEE CS, 2006.

Digital Library

[33]

S. Rus, R. Ashok, and D. Li. Automated locality optimization based on the reuse distance of string operations. In Code Generation and Optimization (CGO), 2011 9th Ann. IEEE/ACM Int'l Symp., pages 181--190, Apr. 2011.

Digital Library

[34]

A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[35]

H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE Trans. Computers, 41:1054--1068, 1992.

Digital Library

[36]

G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomput., 28:7--26, Apr. 2004.

Digital Library

[37]

G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. Memtracker: Efficient and programmable support for memory access monitoring and debugging. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 273--284, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[38]

Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, PACT '02, pages 199--, Washington, DC, USA, 2002. IEEE Computer Society.

Digital Library

[39]

X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S. McKinley. Why nothing matters: The impact of zeroing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 307--324, New York, NY, USA, 2011. ACM.

Digital Library

Cited By

Seyri APan AVamanan BLippautz MChisnall D(2022)MemSweeper: virtualizing cluster memory management for high memory utilization and isolationProceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management10.1145/3520263.3534651(15-28)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3520263.3534651
Vijaykumar NOlgun AKanellopoulos KBostanci FHassan HLotfi MGibbons PMutlu O(2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3505250
Monil MBelviranli MLee SVetter JMalony ASarkar VKim H(2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414671
Show More Cited By

Index Terms

Runtime-driven shared last-level cache management for task-parallel programs

Recommendations

Imbalanced cache partitioning for balanced data-parallel programs
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

This paper investigates partitioning the ways of a shared last-level cache among the threads of a symmetric data-parallel application running on a chip-multiprocessor. Unlike prior work on way-partitioning for unrelated threads in a multiprogramming ...
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

As cache hierarchies become deeper and the number of cores on a chip increases, managing caches becomes more important for performance and energy. However, current hardware cache management policies do not always adapt optimally to the applications ...
Managing shared last-level cache in a heterogeneous multicore processor
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
613
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)17

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Seyri APan AVamanan BLippautz MChisnall D(2022)MemSweeper: virtualizing cluster memory management for high memory utilization and isolationProceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management10.1145/3520263.3534651(15-28)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3520263.3534651
Vijaykumar NOlgun AKanellopoulos KBostanci FHassan HLotfi MGibbons PMutlu O(2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3505250
Monil MBelviranli MLee SVetter JMalony ASarkar VKim H(2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414671
Zafari ALarsson ETillenius M(2019)DuctTeipParallel Computing10.1016/j.parco.2019.10258290:COnline publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.102582
Wu KRen JLi D(2018)Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291698(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291698
Manivannan MPericás MPapaefstathiou VStenström P(2018)Global Dead-Block Management for Task-Parallel ProgramsACM Transactions on Architecture and Code Optimization10.1145/323433715:3(1-25)Online publication date: 4-Sep-2018
https://dl.acm.org/doi/10.1145/3234337
Alvarez LCasas MLabarta JAyguade EValero MMoreto M(2018)Runtime-Guided Management of Stacked DRAM Memories in Task Parallel ProgramsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205312(218-228)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205312
Wu KRen JLi D(2018)Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00034(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00034
Vijaykumar NJain AMajumdar DHsieh KPekhimenko GEbrahimi EHajinazar NGibbons PMutlu O(2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00027
Castillo EAlvarez LMoreto MCasas MVallejo EBosque JBeivide RValero M(2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
https://doi.org/10.1109/HPCA.2018.00033
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten