Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Array Interleaving—An Energy-Efficient Data Layout Transformation

Published: 24 June 2015 Publication History

Abstract

Optimizations related to memory accesses and data storage make a significant difference to the performance and energy of a wide range of data-intensive applications. These techniques need to evolve with modern architectures supporting wide memory accesses. We investigate array interleaving, a data layout transformation technique that achieves energy efficiency by combining the storage of data elements from multiple arrays in contiguous locations, in an attempt to exploit spatial locality. The transformation reduces the number of memory accesses by loading the right set of data into vector registers, thereby minimizing redundant memory fetches. We perform a global analysis of array accesses, and account for possibly different array behavior in different loop nests that might ultimately lead to changes in data layout decisions for the same array across program regions. Our technique relies on detailed estimates of the savings due to interleaving, and also the cost of performing the actual data layout modifications. We also account for the vector register widths and the possibility of choosing the appropriate granularity for interleaving. Experiments on several benchmarks show a 6--34% reduction in memory energy due to the strategy.

References

[1]
Nasir Ahmed, T. Natarajan, and Kamisetty R. Rao. 1974. Discrete cosine transform. IEEE Trans. Comput. C23, 1, 90--93.
[2]
Alexander I. Barvinok. 1994. A polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Math. Oper. Res. 19, 4, 769--779.
[3]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08). 101--113.
[4]
Erik Brockmeyer, Miguel Miranda, and Francky Catthoor. 2003. Layer assignment techniques for low energy in multilayered memory organisations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'03). 1070--1075.
[5]
Cadence. 2012. RTL compiler. http://www.cadence.com/.
[6]
Francky Catthoor, Eddy De Greef, and Sven Suytack. 1998. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic, Norwell, MA.
[7]
Francky Catthoor, Praveen Raghavan, Andy Lambrechts, Murali Jayapala, Angeliki Kritikakou, and Javed Absar. 2010. Ultra-Low Energy Domain-Specific Instruction-Set Processors. Springer, New York.
[8]
Trishul M. Chilimbi, Bob Davidson, and James R. Larus. 1999. Cache-conscious structure definition. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'99). 13--24.
[9]
Michal Cierniak and Wei Li. 1995. Unifying data and control transformations for distributed shared memory machines. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI'95). 205--217.
[10]
Victor Delaluz, Mahmut Kandemir, Narayanan Vijaykrishnan, M. J. Irwin, Anand Sivasubramaniam, and Ibrahim Kolcu. 2002. Compiler-directed array interleaving for reducing energy in multi-bank memories. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC'02). 288--293.
[11]
Stefan Valentin Gheorghita, Martin Palkovic, Juan Hamers, Arnout Vandecappelle, Stelios Mamagkakis, Twan Basten, Lieven Eeckhout, Henk Corporaal, Francky Catthoor, Frederik Vandeputte, and Koen De Bosschere. 2009. System-scenario-based design of dynamic embedded systems. ACM Trans. Des. Autom. Electron. Syst. 14, 1.
[12]
IEEE. 2004. IEEE P802.11 wireless LANs, TGn channel models. IEEE 802.11-03/940r4. http://www.scribd.com/doc/47500819/TGn-Model#scribd.
[13]
Mahmut Kandemir, J. Ramanujam, and Alok Choudhary. 1999. Improving cache locality by a combination of loop and data transformations. IEEE Trans. Comput. 48, 2, 159--167.
[14]
Chidamber Kulkarni, Francky Catthoor, and Hugo De Man. 2000. Advanced data layout optimization for multimedia applications. In Proceedings of the Workshop on Parallel and Distributed Computing in Image, Video and Multimedia Processing (PDIVM'00). 186--193.
[15]
Chidamber Kulkarni, C. Ghez, Miguel Miranda, Francky Catthoor, and Hugo De Man. 2005. Cache conscious data layout organization for conflict miss reduction in embedded multimedia applications. IEEE Trans. Comput. 54, 1, 76--81.
[16]
Dattatraya Kulkarni and Michael Stumm. 1995. Languages, Compilers and Run-Time Systems for Scalable Computers. Kluwer Academic, Boston.
[17]
Allen Leung, Nicolas Vasilache, Benoit Meister, Muthu Baskaran, David Wohlford, Cedric Bastoul, and Richard Lethin. 2010. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU'10).
[18]
Weiping Li and Ya-Qin Zhang. 1995. Vector-based signal processing and quantization for image and video compression. Proc. IEEE 83, 2, 317--335.
[19]
Naraig Manjikian and Tarek S. Abdelrahman. 1995. Array data layout for the reduction of cache conflicts. In Proceedings of the 8th International Conference on Parallel and Distributed Computing Systems (ICPADS'95).
[20]
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL'03). Lecture Notes in Computer Science, vol. 2778. Springer, 61--70.
[21]
Preeti Ranjan Panda, Francky Catthoor, Nikil D. Dutt, Koen Danckaert, Erik Brockmeyer, Chidamber Kulkarni, Arnout Vandecappelle, and Per Gunnar Kjeldsberg. 2001a. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst. 6, 2, 149--206.
[22]
Preeti Ranjan Panda, Luc Semeria, and Giovanni De Micheli. 2001b. Cache-efficient memory layout of aggregate data structures. In Proceedings of the International Symposium on System Synthesis (ISSS'01). 101--106.
[23]
Preeti Ranjan Panda and Nikil D. Dutt. 1995. High level synthesis design repository. In Proceedings of the International Symposium on System Synthesis (ISSS'95). 170--174.
[24]
Preeti Ranjan Panda and Nikil D. Dutt. 1999. Low-power memory mapping through reducing address bus activity. IEEE Trans. VLSI Syst. 7, 3, 309--320.
[25]
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Memory data organization for improved cache performance in embedded processor applications. ACM Trans. Des. Autom. Electron. Syst. 2, 4, 384--409.
[26]
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1998. Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration. Kluwer Academic.
[27]
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3, 682--704.
[28]
Preeti Ranjan Panda, Aviral Shrivastava, B. V. N. Silpa, and Krishnaiah Gummidipudi. 2010. Power-Efficient System Design. Springer, New York.
[29]
Namita Sharma, Tom Vander Aa, Prashant Agrawal, Praveen Raghavan, Preeti Ranjan Panda, and Francky Catthoor. 2013. Data memory optimization in LTE downlink. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'13). 2610--2614.
[30]
Namita Sharma, Preeti Ranjan Panda, Min Li, Prashant Agrawal, and Francky Catthoor. 2014. Energy efficient data flow transformation for givens rotation based QR decomposition. In Proceedings of the Design, Automation and Test in Europe Conference (DATE'14). 1--4.
[31]
Michael Z. Spivey. 2008. A generalized recurrence for bell numbers. J. Integr. Sequenc. 11, 2.
[32]
Synopsys. 2006. PrimePower. http://www.synopsys.com/.
[33]
Technical Specification Group Radio Access Network. 2009. Release 8, 3GPP TS 36.211 V8.9.0 (2009-12). Tech. rep., 3rd Generation Partnership Project (3GPP).
[34]
Tom Vander Aa, Martin Palkovic, Matthias Hartmann, Praveen Raghavan, Antoine Dejonghe, and Liesbet Van der Perre. 2011. A multi-threaded coarse-grained array processor for wireless baseband. In Proceedings of the 9th IEEE Symposium on Application Specific Processors (SASP'11). 102--107.
[35]
Ingrid Verbauwhede, Francky Catthoor, Joos Vandewalle, and Hugo Man. 1991. In-place memory management of algebraic algorithms on application specific ICs. J. VLSI Signal Process. Syst. Signal Image Video Technol. 3, 3, 193--200.
[36]
Ingrid M. Verbauwhede, Chris J. Scheers, and Jan M. Rabaey. 1994. Memory estimation for high level synthesis. In Proceedings of the Design Automation Conference (DAC'94). 143--148.
[37]
Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, and Francky Catthoor. 2005. Experiences with enumeration of integer projections of parametric polytopes. In Proceedings of the International Conference on Compiler Construction (CC'05). 91--105.
[38]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013.Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 54.
[39]
Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2004. Analytical computation of Ehrhart polynomials: Enabling more compiler analyses and optimizations. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'04). 248--258.
[40]
Sven Verdoolaege, Rachid Seghir, Kristof Beyls, Vincent Loechner, and Maurice Bruynooghe. 2007. Counting integer points in parametric polytopes using Barvinok's rational functions. Algorithmica 48, 1, 37--66.
[41]
Wayne Wolf and Mahmut T. Kandemir. 2003. Memory system optimization of embedded software. Proc. IEEE 91, 1, 165--182.
[42]
John W. Woods. 1991. Subband Image Coding. Kluwer Academic, Boston.
[43]
Ying Zhao and Sharad Malik. 1999. Exact memory size estimation for array computations without loop unrolling. In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference (DAC'99). 811--816.
[44]
Yutao Zhong, Maksim Orlovich, Xipeng Shen, and Chen Ding. 2004. Array regrouping and structure splitting using whole-program reference affinity. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'04). 255--266.

Cited By

View all
  • (2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
  • (2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-z75:10(6710-6745)Online publication date: 1-Oct-2019
  • (2017)Automated Dynamic Data Redistribution2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.17(1208-1215)Online publication date: May-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 20, Issue 3
June 2015
345 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/2796316
  • Editor:
  • Naehyuck Chang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 24 June 2015
Accepted: 01 March 2015
Revised: 01 January 2015
Received: 01 July 2014
Published in TODAES Volume 20, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data layout
  2. SIMD architecture
  3. memory energy optimization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 8-Apr-2024
  • (2019)A methodology correlating code optimizations with data memory accesses, execution time and energy consumptionThe Journal of Supercomputing10.1007/s11227-019-02880-z75:10(6710-6745)Online publication date: 1-Oct-2019
  • (2017)Automated Dynamic Data Redistribution2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.17(1208-1215)Online publication date: May-2017
  • (2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
  • (2016)Integrated Exploration Methodology for Data Interleaving and Data-to-Memory Mapping on SIMD ArchitecturesACM Transactions on Embedded Computing Systems10.1145/289475415:3(1-23)Online publication date: 23-May-2016
  • (2016)Data Flow Transformation for Energy-Efficient Implementation of Givens Rotation--Based QRDACM Transactions on Embedded Computing Systems (TECS)10.1145/283702515:1(1-23)Online publication date: 13-Jan-2016
  • (2015)Energy efficient FFT implementation through stage skipping and mergingProceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis10.5555/2830840.2830857(153-162)Online publication date: 4-Oct-2015
  • (2015)Energy efficient FFT implementation through stage skipping and merging2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)10.1109/CODESISSS.2015.7331378(153-162)Online publication date: Oct-2015

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media