Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807625acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Runtime-driven shared last-level cache management for task-parallel programs

Published: 15 November 2015 Publication History

Abstract

Task-parallel programming models with input annotation-based concurrency extraction at runtime present a promising paradigm for programming multicore processors. Through management of dependencies, task assignments, and orchestration, these models markedly simplify the programming effort for parallelization while exposing higher levels of concurrency.
In this paper we show that for multicores with a shared last-level cache (LLC), the concurrency extraction framework can be used to improve the shared LLC performance. Based on the input annotations for future tasks, the runtime instructs the hardware to prioritize data blocks with future reuse while evicting blocks with no future reuse. These instructions allow the hardware to preserve all the blocks for at least some of the future tasks and evict dead blocks. This leads to a considerable improvement in cache efficiency over what is achieved by hardware-only replacement policies, which can replace blocks for all future tasks resulting in poor hit-rates for all future tasks. The proposed hardware-software technique leads to a mean improvement of 18% in application performance and a mean reduction of 26% in misses over a shared LLC managed by the Least Recently Used replacement policy for a set of input-annotated task-parallel programs using the OmpSs programming model implemented on the NANOS++ runtime. In contrast, the state-of-the-art thread-based partitioning scheme suffers an average performance loss of 2% and an average increase of 15% in misses over the baseline.

References

[1]
Barcelona Supercomputing Center bsc application repository. https://pm.bsc.es/projects/bar/wiki/Applications. Accessed: 2015-04-02.
[2]
Intel Corporation intel threading builing blocks. https://www.threadingbuildingblocks.org. Accessed: 2015-03-21.
[3]
OpenMP Application Programming Interface, Version 4.0, howpublished =http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, note = July 2013, Accessed: 2015-04-06.
[4]
M. D. Allen, S. Sridharan, and G. S. Sohi. Serialization sets: A dynamic dependence-based parallel execution model. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 85--96, New York, NY, USA, 2009. ACM.
[5]
E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3):404--418, March 2009.
[6]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 66:1--66:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[7]
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J., 5(2):78--101, June 1966.
[8]
K. Beyls and E. H. D'Hollander. Generating cache hints for improved program efficiency. J. Syst. Archit., 51(4):223--250, Apr. 2005.
[9]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM.
[10]
J. Brock, X. Gu, B. Bao, and C. Ding. Pacman: Program-assisted cache management. In Proceedings of the 2013 International Symposium on Memory Management, ISMM '13, pages 39--50, New York, NY, USA, 2013. ACM.
[11]
X. Ding, K. Wang, and X. Zhang. Ulcc: A user-level facility for optimizing shared cache performance on multicores. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 103--112, New York, NY, USA, 2011. ACM.
[12]
A. Druan, E. Aygude, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173--193, 2011.
[13]
N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum. Improving cache management policies using dynamic reuse distances. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '12, pages 389--400, Washington, DC, USA, 2012. IEEE Computer Society.
[14]
J. Engblom, D. Aarno, and B. Werner. Full-system simulation from embedded to high-performance systems. In R. Leupers and O. Temam, editors, Processor and System-on-Chip Simulation, chapter 3, pages 25--45. Springer US, 2010.
[15]
Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. Badia, E. Ayguade, J. Labarta, and M. Valero. Task superscalar: An out-of-order task pipeline. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on, pages 89--100, Dec 2010.
[16]
F. Guo, Y. Solihin, L. Zhao, and R. Iyer. Quality of service shared cache management in chip multiprocessor architecture. ACM Trans. Archit. Code Optim., 7(3):14:1--14:33, Dec. 2010.
[17]
L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: caches as a shared resource. In Proc. 15th Int'l Conf. Parallel Architectures and Compilation Techniques, PACT '06, pages 13--22, New York, NY, USA, 2006. ACM.
[18]
R. Iyer. Cqos: a framework for enabling qos in shared caches of cmp platforms. In Proc. 18th Annual Int'l Conf. Supercomputing, ICS '04, pages 257--266, New York, NY, USA, 2004. ACM.
[19]
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques, PACT '08, pages 208--219. ACM, 2008.
[20]
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 60--71, New York, NY, USA, 2010. ACM.
[21]
G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement based on reuse-distance prediction. In Computer Design, 2007. ICCD 2007. 25th International Conference on, pages 245--250, 2007.
[22]
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. 13th Int'l Conf. Parallel Architectures and Compilation Techniques, PACT '04, pages 111--122, Washington, DC, USA, 2004. IEEE CS.
[23]
Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-olp: Improving hardware cache performance through software-controlled object-level partitioning. In Parallel Architectures and Compilation Techniques, 2009. PACT '09. 18th International Conference on, pages 246--257, Sept 2009.
[24]
M. Manivannan and P. Stenstrom. Runtime-guided cache coherence optimizations in multi-core architectures. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 625--636, May 2014.
[25]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33:92--99, Nov. 2005.
[26]
S. P. Muralidhara, M. Kandemir, and P. Raghavan. Intra-application cache partitioning. In Proc. 2010 IEEE Int'l Symp. Parallel & Distributed Processing (IPDPS), pages 1--12. IEEE, Apr. 2010.
[27]
A. Pan and V. S. Pai. Imbalanced cache partitioning for balanced data-parallel programs. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 297--309, New York, NY, USA, 2013. ACM.
[28]
V. Papaefstathiou, M. G. Katevenis, D. S. Nikolopoulos, and D. Pnevmatikatos. Prefetching and cache management using task lifetimes. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 325--334, New York, NY, USA, 2013. ACM.
[29]
J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Cluster Computing, 2008 IEEE International Conference on, pages 142--151, Sept 2008.
[30]
J. M. Perez, R. M. Badia, and J. Labarta. Handling task dependencies under strided and aliased references. In Proceedings of the 24th ACM International Conference on Supercomputing, ICS '10, pages 263--274, New York, NY, USA, 2010. ACM.
[31]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In Proc. 34th annual Int'l Symp. Computer Architecture, ISCA '07, pages 381--391. ACM, 2007.
[32]
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture, MICRO 39, pages 423--432. IEEE CS, 2006.
[33]
S. Rus, R. Ashok, and D. Li. Automated locality optimization based on the reuse distance of string operations. In Code Generation and Optimization (CGO), 2011 9th Ann. IEEE/ACM Int'l Symp., pages 181--190, Apr. 2011.
[34]
A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.
[35]
H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE Trans. Computers, 41:1054--1068, 1992.
[36]
G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomput., 28:7--26, Apr. 2004.
[37]
G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. Memtracker: Efficient and programmable support for memory access monitoring and debugging. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, HPCA '07, pages 273--284, Washington, DC, USA, 2007. IEEE Computer Society.
[38]
Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, PACT '02, pages 199--, Washington, DC, USA, 2002. IEEE Computer Society.
[39]
X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S. McKinley. Why nothing matters: The impact of zeroing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 307--324, New York, NY, USA, 2011. ACM.

Cited By

View all
  • (2022)MemSweeper: virtualizing cluster memory management for high memory utilization and isolationProceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management10.1145/3520263.3534651(15-28)Online publication date: 14-Jun-2022
  • (2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
  • (2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multicore
  2. reuse distance
  3. shared cache partitioning
  4. task-based programming

Qualifiers

  • Research-article

Funding Sources

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)17
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)MemSweeper: virtualizing cluster memory management for high memory utilization and isolationProceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management10.1145/3520263.3534651(15-28)Online publication date: 14-Jun-2022
  • (2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
  • (2020)MEPHESTOProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414671(413-425)Online publication date: 30-Sep-2020
  • (2019)DuctTeipParallel Computing10.1016/j.parco.2019.10258290:COnline publication date: 1-Dec-2019
  • (2018)Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291698(1-13)Online publication date: 11-Nov-2018
  • (2018)Global Dead-Block Management for Task-Parallel ProgramsACM Transactions on Architecture and Code Optimization10.1145/323433715:3(1-25)Online publication date: 4-Sep-2018
  • (2018)Runtime-Guided Management of Stacked DRAM Memories in Task Parallel ProgramsProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205312(218-228)Online publication date: 12-Jun-2018
  • (2018)Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00034(1-13)Online publication date: 11-Nov-2018
  • (2018)A case for richer cross-layer abstractionsProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00027(207-220)Online publication date: 2-Jun-2018
  • (2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media