Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2155620.2155677acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

A data layout optimization framework for NUCA-based multicores

Published: 03 December 2011 Publication History

Abstract

Future multicore architectures are likely to include a large number of cores connected using an on-chip network with Non-uniform Cache Access (NUCA). In such architectures, whether a data request is satisfied from a local cache or a remote cache can make an important difference. To exploit this NUCA property, prior research explored both architectural enhancements as well as compiler-based code optimization strategies. In this work, we take an alternate view, and explore data layout optimizations to improve locality of data accesses in a NUCA-based system. Our proposed approach includes three steps: array tiling, computation-to-core mapping, and layout customization. The first of these tries to identify the affinity between data and computation taking into account parallelization information, with the goal of minimizing remote accesses. The second step maps computations (and their associated data) to cores with the goal of minimizing average distance-to-data, and the last step further customizes the memory layout taking into account the data placement policy adopted by the underlying architecture. We evaluated the success of this three-step approach in enhancing on-chip cache behavior using all application programs from the SPECOMP suite on a full-system simulator. Our results show that the proposed approach improves on average data access latency and execution time by 24.7% and 18.4%, respectively, in the case of static NUCA, and 18.1% and 12.7%, respectively, in the case of dynamic NUCA.

References

[1]
Polylib - a library of polyhedral functions. http://icps.u-strasbg.fr/polylib/.
[2]
Singlechip cloud computer. http://techresearch.intel.com/articles/Tera-Scale/1826.htm.
[3]
Teraflops research chip. http://techresearch.intel.com/articles/Tera-Scale/1449.htm.
[4]
N. Agarwal et al. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report, Princeton University.
[5]
J. Anderson et al. Data and computation transformations for multiprocessors. PPOPP, 1995.
[6]
G. Ascia et al. Multi-objective mapping for mesh-based NoC architectures. CODES+ISSS, 2004.
[7]
V. Aslot et al. SPEComp: A new benchmark suite for measuring parallel computer performance. 2001.
[8]
C. Bastoul et al. Putting polyhedral loop transformations to work. LCPC, 2003.
[9]
B. Beckmann and D. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO, 2004.
[10]
U. Bondhugula et al. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. Compiler Construction, 2008.
[11]
U. Bondhugula et al. Towards effective automatic parallelization for multicore systems. IPDPS, 2008.
[12]
B. Chapman et al. Automatic support for data distribution on distributed memory multiprocessor systems. LCPC, 1994.
[13]
S. Chatterjee et al. Nonlinear array layouts for hierarchical memory systems. ICS, 1999.
[14]
G. Chen et al. Application mapping for chip multiprocessors. DAC, 2008.
[15]
W. Chen and E. Gehringer. A graph-oriented mapping strategy for a hypercube. Conference on Hypercube Concurrent Computers and Applications, 1988.
[16]
Z. Chishti et al. Distance associativity for high performance energy-efficient non-uniform cache architectures. Micro, 2003.
[17]
Z. Chishti, M. Powell, and T. Vijaykumar. Distance associativity for high-performance energy-effcient non-uniform cache architectures. MICRO, 2003.
[18]
S. Cho and L. Jin. Managing distributed, shared L2 caches through OS-level page allocation. MICRO, 2006.
[19]
C. Chou and R. Marculescu. Contention-aware application mapping for network-on-chip communication architectures. ICCD, 2008.
[20]
W. J. Dally and B. Towles. Route packets, not wires:on-chip interconnection networks. DAC, 2001.
[21]
P. G. de Massas and F. Pétrot. Comparison of memory write policies for NoC based multicore cache coherent systems. DATE, 2008.
[22]
C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. PLDI, 1999.
[23]
M. Franz and T. Kistler. Splitting data objects to increase cache utilization. 1998.
[24]
J. Garcia et al. A novel approach towards automatic data distribution. SC, 1995.
[25]
T. Grosser et al. Polly - Polyhedral optimization in LLVM. IMPACT, 2011.
[26]
M. Gupta and P. Banerjee. Paradigm: a compiler for automatic data distribution on multicomputers. ICS, 1993.
[27]
N. Hardavellas et al. R-NUCA: Data placement in distributed shared caches. ISCA, 2009.
[28]
N. Hardavellas et al. Reactive NUCA: Near-optimal block placement and replication in distributed caches. SIGARCH Comput. Archit. News, 2009.
[29]
M. Kandemir et al. A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans. Parallel Distrib. Syst., 1999.
[30]
M. Kandemir et al. A loop transformation algorithm based on explicit data layout representation for optimizing locality. LCPC, 1999.
[31]
M. Kandemir et al. Neighborhood-aware data locality optimization for NoC-based multicores. CGO, 2011.
[32]
K. Kennedy and U. Kremer. Automatic data layout for high performance FORTRAN. SC, 1995.
[33]
C. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS, 2002.
[34]
C. Kim et al. An adaptive, non-uniform cache structure for wire-dominated on-chip caches. ASPLOS, 2002.
[35]
I. Kodukula and K. Pingali. Transformations for imperfectly nested loops. SuperComputing, 1996.
[36]
C. Kulkarni et al. Cache conscious data layout organization for conflict miss reduction in embedded multimedia applications. Computers, 2005.
[37]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. CGO, 2004.
[38]
S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report, Dept. of Computer Science and Eng., Univ. of Washington, 1995.
[39]
F. Li et al. Design and management of 3D chip multiprocessors using network-in-memory. ISCA, 2006.
[40]
Q. Lu et al. Data layout transformation for enhancing data locality on NUCA chip multiprocessors. PACT, 2009.
[41]
P. S. Magnusson et al. Simics: A full system simulation platform. IEEE Computer, 2002.
[42]
M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 2005.
[43]
K. S. McKinley et al. Improving data locality with loop transformations. ACM TOPLAS, 1996.
[44]
M. F. P. O'Boyle and P. M. W. Knijnenburg. Non-singular data transformations: definition, validity and applications. ICS, 1997.
[45]
R. Pop and S. Kumar. Mapping applications to NoC platforms with multithreaded processor resources. The NORCHIP Conference, 2005.
[46]
G. Rivera and C. Tseng. Data transformations for eliminating conflict misses. PLDI, 1998.
[47]
A. Ros et al. Distance-aware round-robin mapping for large NUCA caches. HIPC, 2009.
[48]
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report, Western Research Laboratory.
[49]
B. So et al. Custom data layout for memory parallelism. CGO, 2004.
[50]
M. Wolf and M. Lam. A data locality optimizing algorithm. PLDI, 1991.
[51]
T. Yemliha et al. Integrated code and data placement in two-dimensional mesh based chip multiprocessors. ICCAD, 2008.

Cited By

View all
  • (2024)Automatically reasoning about how systems code uses the CPU cacheProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691969(581-598)Online publication date: 10-Jul-2024
  • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • Show More Cited By

Index Terms

  1. A data layout optimization framework for NUCA-based multicores

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2011
    519 pages
    ISBN:9781450310536
    DOI:10.1145/2155620
    • Conference Chair:
    • Carlo Galuzzi,
    • General Chair:
    • Luigi Carro,
    • Program Chairs:
    • Andreas Moshovos,
    • Milos Prvulovic
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 December 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. NUCA
    2. NoC
    3. data layout optimization
    4. multicore

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MICRO-44
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Automatically reasoning about how systems code uses the CPU cacheProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691969(581-598)Online publication date: 10-Jul-2024
    • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
    • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
    • (2020)D-SOAP: Dynamic Spatial Orientation Affinity Prediction for Caching in Multi-Orientation Memory Systems2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00055(581-595)Online publication date: Oct-2020
    • (2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
    • (2019)Make the Most out of Last Level Cache in Intel ProcessorsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303977(1-17)Online publication date: 25-Mar-2019
    • (2018)MDACacheProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00073(841-854)Online publication date: 20-Oct-2018
    • (2017)Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2017.89(477-482)Online publication date: Jul-2017
    • (2016)An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor2016 IEEE 34th International Conference on Computer Design (ICCD)10.1109/ICCD.2016.7753255(9-16)Online publication date: Oct-2016
    • (2014)CApRIACM SIGMETRICS Performance Evaluation Review10.1145/2637364.259199242:1(477-489)Online publication date: 16-Jun-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media