research-article

Array Interleaving—An Energy-Efficient Data Layout Transformation

Authors:

Preeti Ranjan Panda,

Francky Catthoor,

Praveen Raghavan,

Tom Vander AaAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 20, Issue 3

Article No.: 44, Pages 1 - 26

https://doi.org/10.1145/2747875

Published: 24 June 2015 Publication History

Abstract

Optimizations related to memory accesses and data storage make a significant difference to the performance and energy of a wide range of data-intensive applications. These techniques need to evolve with modern architectures supporting wide memory accesses. We investigate array interleaving, a data layout transformation technique that achieves energy efficiency by combining the storage of data elements from multiple arrays in contiguous locations, in an attempt to exploit spatial locality. The transformation reduces the number of memory accesses by loading the right set of data into vector registers, thereby minimizing redundant memory fetches. We perform a global analysis of array accesses, and account for possibly different array behavior in different loop nests that might ultimately lead to changes in data layout decisions for the same array across program regions. Our technique relies on detailed estimates of the savings due to interleaving, and also the cost of performing the actual data layout modifications. We also account for the vector register widths and the possibility of choosing the appropriate granularity for interleaving. Experiments on several benchmarks show a 6--34% reduction in memory energy due to the strategy.

References

[1]

Nasir Ahmed, T. Natarajan, and Kamisetty R. Rao. 1974. Discrete cosine transform. IEEE Trans. Comput. C23, 1, 90--93.

Digital Library

[2]

Alexander I. Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Math. Oper. Res. 19, 4, 769--779.

Digital Library

[3]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08). 101--113.

Digital Library

[4]

Erik Brockmeyer, Miguel Miranda, and Francky Catthoor. 2003. Layer assignment techniques for low energy in multilayered memory organisations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'03). 1070--1075.

Digital Library

[5]

Cadence. 2012. RTL compiler. http://www.cadence.com/.

[6]

Francky Catthoor, Eddy De Greef, and Sven Suytack. 1998. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic, Norwell, MA.

Digital Library

[7]

Francky Catthoor, Praveen Raghavan, Andy Lambrechts, Murali Jayapala, Angeliki Kritikakou, and Javed Absar. 2010. Ultra-Low Energy Domain-Specific Instruction-Set Processors. Springer, New York.

Digital Library

[8]

Trishul M. Chilimbi, Bob Davidson, and James R. Larus. 1999. Cache-conscious structure definition. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'99). 13--24.

Digital Library

[9]

Michal Cierniak and Wei Li. 1995. Unifying data and control transformations for distributed shared memory machines. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI'95). 205--217.

Digital Library

[10]

Victor Delaluz, Mahmut Kandemir, Narayanan Vijaykrishnan, M. J. Irwin, Anand Sivasubramaniam, and Ibrahim Kolcu. 2002. Compiler-directed array interleaving for reducing energy in multi-bank memories. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC'02). 288--293.

Digital Library

[11]

Stefan Valentin Gheorghita, Martin Palkovic, Juan Hamers, Arnout Vandecappelle, Stelios Mamagkakis, Twan Basten, Lieven Eeckhout, Henk Corporaal, Francky Catthoor, Frederik Vandeputte, and Koen De Bosschere. 2009. System-scenario-based design of dynamic embedded systems. ACM Trans. Des. Autom. Electron. Syst. 14, 1.

Digital Library

[12]

IEEE. 2004. IEEE P802.11 wireless LANs, TGn channel models. IEEE 802.11-03/940r4. http://www.scribd.com/doc/47500819/TGn-Model#scribd.

[13]

Mahmut Kandemir, J. Ramanujam, and Alok Choudhary. 1999. Improving cache locality by a combination of loop and data transformations. IEEE Trans. Comput. 48, 2, 159--167.

Digital Library

[14]

Chidamber Kulkarni, Francky Catthoor, and Hugo De Man. 2000. Advanced data layout optimization for multimedia applications. In Proceedings of the Workshop on Parallel and Distributed Computing in Image, Video and Multimedia Processing (PDIVM'00). 186--193.

Digital Library

[15]

Chidamber Kulkarni, C. Ghez, Miguel Miranda, Francky Catthoor, and Hugo De Man. 2005. Cache conscious data layout organization for conflict miss reduction in embedded multimedia applications. IEEE Trans. Comput. 54, 1, 76--81.

Digital Library

[16]

Dattatraya Kulkarni and Michael Stumm. 1995. Languages, Compilers and Run-Time Systems for Scalable Computers. Kluwer Academic, Boston.

Digital Library

[17]

Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David Wohlford, Cedric Bastoul, and Richard Lethin. 2010. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3^rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10).

Digital Library

[18]

Weiping Li and Ya-Qin Zhang. 1995. Vector-based signal processing and quantization for image and video compression. Proc. IEEE 83, 2, 317--335.

[19]

Naraig Manjikian and Tarek S. Abdelrahman. 1995. Array data layout for the reduction of cache conflicts. In Proceedings of the 8^th International Conference on Parallel and Distributed Computing Systems (ICPADS'95).

[20]

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'03). Lecture Notes in Computer Science, vol. 2778. Springer, 61--70.

[21]

Preeti Ranjan Panda, Francky Catthoor, Nikil D. Dutt, Koen Danckaert, Erik Brockmeyer, Chidamber Kulkarni, Arnout Vandecappelle, and Per Gunnar Kjeldsberg. 2001a. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst. 6, 2, 149--206.

Digital Library

[22]

Preeti Ranjan Panda, Luc Semeria, and Giovanni De Micheli. 2001b. Cache-efficient memory layout of aggregate data structures. In Proceedings of the International Symposium on System Synthesis (ISSS'01). 101--106.

Digital Library

[23]

Preeti Ranjan Panda and Nikil D. Dutt. 1995. High level synthesis design repository. In Proceedings of the International Symposium on System Synthesis (ISSS'95). 170--174.

Digital Library

[24]

Preeti Ranjan Panda and Nikil D. Dutt. 1999. Low-power memory mapping through reducing address bus activity. IEEE Trans. VLSI Syst. 7, 3, 309--320.

Digital Library

[25]

Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Memory data organization for improved cache performance in embedded processor applications. ACM Trans. Des. Autom. Electron. Syst. 2, 4, 384--409.

Digital Library

[26]

Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1998. Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration. Kluwer Academic.

Digital Library

[27]

Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3, 682--704.

Digital Library

[28]

Preeti Ranjan Panda, Aviral Shrivastava, B. V. N. Silpa, and Krishnaiah Gummidipudi. 2010. Power-Efficient System Design. Springer, New York.

Digital Library

[29]

Namita Sharma, Tom Vander Aa, Prashant Agrawal, Praveen Raghavan, Preeti Ranjan Panda, and Francky Catthoor. 2013. Data memory optimization in LTE downlink. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'13). 2610--2614.

[30]

Namita Sharma, Preeti Ranjan Panda, Min Li, Prashant Agrawal, and Francky Catthoor. 2014. Energy efficient data flow transformation for givens rotation based QR decomposition. In Proceedings of the Design, Automation and Test in Europe Conference (DATE'14). 1--4.

Digital Library

[31]

Michael Z. Spivey. 2008. A generalized recurrence for bell numbers. J. Integr. Sequenc. 11, 2.

[32]

Synopsys. 2006. PrimePower. http://www.synopsys.com/.

[33]

Technical Specification Group Radio Access Network. 2009. Release 8, 3GPP TS 36.211 V8.9.0 (2009-12). Tech. rep., 3^rd Generation Partnership Project (3GPP).

[34]

Tom Vander Aa, Martin Palkovic, Matthias Hartmann, Praveen Raghavan, Antoine Dejonghe, and Liesbet Van der Perre. 2011. A multi-threaded coarse-grained array processor for wireless baseband. In Proceedings of the 9^th IEEE Symposium on Application Specific Processors (SASP'11). 102--107.

Digital Library

[35]

Ingrid Verbauwhede, Francky Catthoor, Joos Vandewalle, and Hugo Man. 1991. In-place memory management of algebraic algorithms on application specific ICs. J. VLSI Signal Process. Syst. Signal Image Video Technol. 3, 3, 193--200.

Digital Library

[36]

Ingrid M. Verbauwhede, Chris J. Scheers, and Jan M. Rabaey. 1994. Memory estimation for high level synthesis. In Proceedings of the Design Automation Conference (DAC'94). 143--148.

Digital Library

[37]

Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, and Francky Catthoor. 2005. Experiences with enumeration of integer projections of parametric polytopes. In Proceedings of the International Conference on Compiler Construction (CC'05). 91--105.

Digital Library

[38]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013.Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 54.

Digital Library

[39]

Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2004. Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'04). 248--258.

Digital Library

[40]

Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting integer points in parametric polytopes using Barvinok's rational functions. Algorithmica 48, 1, 37--66.

Digital Library

[41]

Wayne Wolf and Mahmut T. Kandemir. 2003. Memory system optimization of embedded software. Proc. IEEE 91, 1, 165--182.

[42]

John W. Woods. 1991. Subband Image Coding. Kluwer Academic, Boston.

Digital Library

[43]

Ying Zhao and Sharad Malik. 1999. Exact memory size estimation for array computations without loop unrolling. In Proceedings of the 36^th Annual ACM/IEEE Design Automation Conference (DAC'99). 811--816.

Digital Library

[44]

Yutao Zhong, Maksim Orlovich, Xipeng Shen, and Chen Ding. 2004. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'04). 255--266.

Digital Library

Cited By

de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3656642
Kelefouras VDjemame K(2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-z75:10(6710-6745)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11227-019-02880-z
Marrinan TInsley JRizzi STessier FPapka M(2017)Automated Dynamic Data Redistribution2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.17(1208-1215)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.17
Show More Cited By

Index Terms

Recommendations

Extension VM: Interleaved Data Layout in Vector Memory
While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates ...
Optimizing Data Layout for Racetrack Memory in Embedded Systems
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Racetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too ...
A Study of Data Layout in Multi-channel Processing-In-Memory Architecture
ICSCA '18: Proceedings of the 2018 7th International Conference on Software and Computer Applications

In modern computing hardware, the performance gap between processor and memory is one of the most significant factors that limits overall performance improvement of computing system. Also, with the advent of multicore and manycore system, memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 20, Issue 3

June 2015

345 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/2796316

Editor:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 24 June 2015

Accepted: 01 March 2015

Revised: 01 January 2015

Received: 01 July 2014

Published in TODAES Volume 20, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
241
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3656642
Kelefouras VDjemame K(2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-z75:10(6710-6745)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11227-019-02880-z
Marrinan TInsley JRizzi STessier FPapka M(2017)Automated Dynamic Data Redistribution2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.17(1208-1215)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.17
Li YGao YWang DLi YXu J(2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
https://doi.org/10.1007/978-981-10-6442-5_14
Filippopoulos ISharma NCatthoor FKjeldsberg PPanda P(2016)Integrated Exploration Methodology for Data Interleaving and Data-to-Memory Mapping on SIMD ArchitecturesACM Transactions on Embedded Computing Systems10.1145/289475415:3(1-23)Online publication date: 23-May-2016
https://dl.acm.org/doi/10.1145/2894754
Sharma NPanda PCatthoor FLi MAgrawal P(2016)Data Flow Transformation for Energy-Efficient Implementation of Givens Rotation--Based QRDACM Transactions on Embedded Computing Systems (TECS)10.1145/283702515:1(1-23)Online publication date: 13-Jan-2016
https://dl.acm.org/doi/10.1145/2837025
Sharma NPanda PCatthoor FNicolescu GGerstlauer A(2015)Energy efficient FFT implementation through stage skipping and mergingProceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis10.5555/2830840.2830857(153-162)Online publication date: 4-Oct-2015
https://dl.acm.org/doi/10.5555/2830840.2830857
Sharma NPanda PCatthoor F(2015)Energy efficient FFT implementation through stage skipping and merging2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)10.1109/CODESISSS.2015.7331378(153-162)Online publication date: Oct-2015
https://doi.org/10.1109/CODESISSS.2015.7331378

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents