research-article

A data layout optimization framework for NUCA-based multicores

Authors:

Mahmut Kandemir,

Ohyoung JangAuthors Info & Claims

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 489 - 500

https://doi.org/10.1145/2155620.2155677

Published: 03 December 2011 Publication History

Abstract

Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.

References

[1]

Polylib - a library of polyhedral functions. http://icps.u-strasbg.fr/polylib/.

[2]

Singlechip cloud computer. http://techresearch.intel.com/articles/Tera-Scale/1826.htm.

[3]

Teraflops research chip. http://techresearch.intel.com/articles/Tera-Scale/1449.htm.

[4]

N. Agarwal et al. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report, Princeton University.

[5]

J. Anderson et al. Data and computation transformations for multiprocessors. PPOPP, 1995.

Digital Library

[6]

G. Ascia et al. Multi-objective mapping for mesh-based NoC architectures. CODES+ISSS, 2004.

Digital Library

[7]

V. Aslot et al. SPEComp: A new benchmark suite for measuring parallel computer performance. 2001.

Digital Library

[8]

C. Bastoul et al. Putting polyhedral loop transformations to work. LCPC, 2003.

[9]

B. Beckmann and D. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO, 2004.

Digital Library

[10]

U. Bondhugula et al. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. Compiler Construction, 2008.

Digital Library

[11]

U. Bondhugula et al. Towards effective automatic parallelization for multicore systems. IPDPS, 2008.

[12]

B. Chapman et al. Automatic support for data distribution on distributed memory multiprocessor systems. LCPC, 1994.

Digital Library

[13]

S. Chatterjee et al. Nonlinear array layouts for hierarchical memory systems. ICS, 1999.

Digital Library

[14]

G. Chen et al. Application mapping for chip multiprocessors. DAC, 2008.

Digital Library

[15]

W. Chen and E. Gehringer. A graph-oriented mapping strategy for a hypercube. Conference on Hypercube Concurrent Computers and Applications, 1988.

Digital Library

[16]

Z. Chishti et al. Distance associativity for high performance energy-efficient non-uniform cache architectures. Micro, 2003.

Digital Library

[17]

Z. Chishti, M. Powell, and T. Vijaykumar. Distance associativity for high-performance energy-effcient non-uniform cache architectures. MICRO, 2003.

Digital Library

[18]

S. Cho and L. Jin. Managing distributed, shared L2 caches through OS-level page allocation. MICRO, 2006.

Digital Library

[19]

C. Chou and R. Marculescu. Contention-aware application mapping for network-on-chip communication architectures. ICCD, 2008.

[20]

W. J. Dally and B. Towles. Route packets, not wires:on-chip interconnection networks. DAC, 2001.

Digital Library

[21]

P. G. de Massas and F. Pétrot. Comparison of memory write policies for NoC based multicore cache coherent systems. DATE, 2008.

Digital Library

[22]

C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. PLDI, 1999.

Digital Library

[23]

M. Franz and T. Kistler. Splitting data objects to increase cache utilization. 1998.

[24]

J. Garcia et al. A novel approach towards automatic data distribution. SC, 1995.

Digital Library

[25]

T. Grosser et al. Polly - Polyhedral optimization in LLVM. IMPACT, 2011.

[26]

M. Gupta and P. Banerjee. Paradigm: a compiler for automatic data distribution on multicomputers. ICS, 1993.

Digital Library

[27]

N. Hardavellas et al. R-NUCA: Data placement in distributed shared caches. ISCA, 2009.

[28]

N. Hardavellas et al. Reactive NUCA: Near-optimal block placement and replication in distributed caches. SIGARCH Comput. Archit. News, 2009.

Digital Library

[29]

M. Kandemir et al. A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans. Parallel Distrib. Syst., 1999.

Digital Library

[30]

M. Kandemir et al. A loop transformation algorithm based on explicit data layout representation for optimizing locality. LCPC, 1999.

Digital Library

[31]

M. Kandemir et al. Neighborhood-aware data locality optimization for NoC-based multicores. CGO, 2011.

Digital Library

[32]

K. Kennedy and U. Kremer. Automatic data layout for high performance FORTRAN. SC, 1995.

Digital Library

[33]

C. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS, 2002.

Digital Library

[34]

C. Kim et al. An adaptive, non-uniform cache structure for wire-dominated on-chip caches. ASPLOS, 2002.

Digital Library

[35]

I. Kodukula and K. Pingali. Transformations for imperfectly nested loops. SuperComputing, 1996.

Digital Library

[36]

C. Kulkarni et al. Cache conscious data layout organization for conflict miss reduction in embedded multimedia applications. Computers, 2005.

Digital Library

[37]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. CGO, 2004.

Digital Library

[38]

S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report, Dept. of Computer Science and Eng., Univ. of Washington, 1995.

[39]

F. Li et al. Design and management of 3D chip multiprocessors using network-in-memory. ISCA, 2006.

Digital Library

[40]

Q. Lu et al. Data layout transformation for enhancing data locality on NUCA chip multiprocessors. PACT, 2009.

Digital Library

[41]

P. S. Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 2002.

Digital Library

[42]

M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 2005.

Digital Library

[43]

K. S. McKinley et al. Improving data locality with loop transformations. ACM TOPLAS, 1996.

Digital Library

[44]

M. F. P. O'Boyle and P. M. W. Knijnenburg. Non-singular data transformations: definition, validity and applications. ICS, 1997.

Digital Library

[45]

R. Pop and S. Kumar. Mapping applications to NoC platforms with multithreaded processor resources. The NORCHIP Conference, 2005.

[46]

G. Rivera and C. Tseng. Data transformations for eliminating conflict misses. PLDI, 1998.

Digital Library

[47]

A. Ros et al. Distance-aware round-robin mapping for large NUCA caches. HIPC, 2009.

[48]

P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report, Western Research Laboratory.

[49]

B. So et al. Custom data layout for memory parallelism. CGO, 2004.

Digital Library

[50]

M. Wolf and M. Lam. A data locality optimizing algorithm. PLDI, 1991.

Digital Library

[51]

T. Yemliha et al. Integrated code and data placement in two-dimensional mesh based chip multiprocessors. ICCAD, 2008.

Digital Library

Cited By

Iyer RArgyraki KCandea GGavrilovska ATerry D(2024)Automatically reasoning about how systems code uses the CPU cacheProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691969(581-598)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691969
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Show More Cited By

Index Terms

A data layout optimization framework for NUCA-based multicores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Exploiting replication to improve performances of NUCA-based CMP systems
Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Improvements in semiconductor nanotechnology made chip multiprocessors the reference architecture for high-performance microprocessors. CMPs usually adopt large Last-Level Caches (LLC) shared among cores and private L1 caches, whose performances depend ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 2011

519 pages

ISBN:9781450310536

DOI:10.1145/2155620

Conference Chair:
Carlo Galuzzi
Technische Universiteit Delft, The Netherlands
,
General Chair:
Luigi Carro
Universidade Federal do Rio Grande do Sul, Brasil
,
Program Chairs:
Andreas Moshovos
University of Toronto, Canada
,
Milos Prvulovic
Georgia Institute of Technology, United States

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE
ACM: Association for Computing Machinery
UFRGS: Universidade Federal do Rio Grande do Sul
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MICRO-44

Sponsor:

ACM
UFRGS
SIGMICRO
IEEE-CS

MICRO-44: The 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 3 - 7, 2011

Porto Alegre, Brazil

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
459
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Iyer RArgyraki KCandea GGavrilovska ATerry D(2024)Automatically reasoning about how systems code uses the CPU cacheProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691969(581-598)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691969
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Liao MSampson J(2020)D-SOAP: Dynamic Spatial Orientation Affinity Prediction for Caching in Multi-Orientation Memory Systems2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00055(581-595)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00055
Zhao XAdileh AYu ZWang ZJaleel AEeckhout LManne SHunter HAltman E(2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322235
Farshin ARoozbeh AMaguire GKostić D(2019)Make the Most out of Last Level Cache in Intel ProcessorsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303977(1-17)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303977
George SLiao MJiang HKotra JKandemir MSampson JNarayanan VOskin MInoue K(2018)MDACacheProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00073(841-854)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00073
Kelefouras VKeramidas GVoros N(2017)Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.89(477-482)Online publication date: Jul-2017
https://doi.org/10.1109/ISVLSI.2017.89
Qiu KNi YZhang WWang JWu XXue CLi T(2016)An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor2016 IEEE 34th International Conference on Computer Design (ICCD)10.1109/ICCD.2016.7753255(9-16)Online publication date: Oct-2016
https://doi.org/10.1109/ICCD.2016.7753255
Ding WKandemir M(2014)CApRIACM SIGMETRICS Performance Evaluation Review10.1145/2637364.259199242:1(477-489)Online publication date: 16-Jun-2014
https://dl.acm.org/doi/10.1145/2637364.2591992
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents