Article

Optimizing the memory bandwidth with loop fusion

Authors:

José Ignacio Gómez,

Francky CatthoorAuthors Info & Claims

CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

Pages 188 - 193

https://doi.org/10.1145/1016720.1016767

Published: 08 September 2004 Publication History

Abstract

The memory bandwidth largely determines the performance and energy cost of embedded systems. At the compiler level, several techniques improve the memory bandwidth at the scope of a basic block, but often fail to exploit all. We propose a technique to optimize the memory bandwidth across the boundaries of a basic block. Our technique incrementally fuses loops to better use the available bandwidth. The resulting performance depends on how the data is assigned to the memories of the memory layer. At the same time, the assignment also strongly influences the energy cost. Therefore, we combine in our approach the fusion and assignment decisions. Designers can use our output to trade-off the energy cost with the system's performance.

References

[1]

O. Avissar, R. Barua, and D. Stewart. Heterogeneous Memory Management for Embedded Systems. In Proc. Cases, 2001.]]

Digital Library

[2]

F. Bodin, W. Jalby, C. Eisenbeis, and D. Windheiser. A quantitative algorithm for data locality optimization. In Proc. Int. Wkshp. on Code Generation, pages 119--145, 1991.]]

[3]

D. Gannon and W. Jalby abd K. Gallivan. Strategies for cache and local memory management by global progra, optimizations. J. of Parallel and Distributed Systems, 25:587--617, 1988.]]

Digital Library

[4]

P. Grun, N. Dutt, and A. Nicolau. Memory Aware Compilation through Timing Extraction. In Proc. 37th Dac, pages 316--321, Jun. 2001.]]

Digital Library

[5]

K. McKinley, S. Carr, and C. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424--453, July 1996.]]

Digital Library

[6]

L. Lamport. The parallel execution of do-loops. Communications of ACM, 17(2):83--93, Feb. 1974.]]

Digital Library

[7]

P. Marchal, J.I. Gomez, and F. Catthoor. Loop morphing to improve the performance on a VLIW. In accepted for ASAP 2004, 2004.]]

[8]

M. Wolf. Improving locality and parallelism in nested loops. Technical report, Technical report CSL-TR-92-538, Stanford Univ., CA, USA, Sep. 1992.]]

[9]

P. Panda, N. Dutt, and A. Nicolau. Exploiting Off-Chip Memory Access Modes in High-Level Synthesis. In Proc. Iccad, pages 333--340, Oct. 1997.]]

Digital Library

[10]

P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle, and P.G. Kjeldsberg. Data and Memory Optimizations for Embedded Systems. ACM Trans. on Design Automation for Embedded Systems (TODAES), 6(2):142--206, Apr. 2001.]]

Digital Library

[11]

Y. Qian, S. Carr, and P. Sweany. Loop Fusion for Clustered VLIW Architectures. In Proc. Joint Conference on Languages, Compilers and Tools for Embedded Systems and Software and Compilers for Embedded Systems, pages 19--21, June 2002.]]

Digital Library

[12]

B. Rau. Iterative Modulo Scheduling. Technical report, HP Labs, 1995.]]

[13]

M. Saghir, P. Chow, and C. Lee. Exploiting Dual Data Banks in Digital Signal Processors. In ASPLOS, Jun. 1997.]]

Digital Library

[14]

A. Vandecappelle, M. Miranda, E. Brockmeyer, F. Catthoor, and D. Verkest. Global Multimedia System Design Exploration using Accurate Memory Organization Feedback. In Proc. 39th DAC, 1999.]]

Digital Library

[15]

S. Verdoorlaege, M. Bruynooghe, G. Janssens, and F. Catthoor. Multi-dimensional incremental loop fusion for data locality. In Proceedings 2003 Application-specific Systems, Architectures and Processors, pages 17--27, 2003.]]

[16]

W. Verhaegh, E. Aarts, P. van Gorp, and P. Lippens. A Two-stage Solution Approach for Multidimensional Periodic Scheduling. IEEE Trans. Computer Aided Design of Integrated Circuits and Systems, 10(10):1185--1199, Oct. 2001.]]

Digital Library

[17]

S. Wuytack, F. Catthoor, G. De Jong, and H. De Man. Minimizing the required memory bandwidth in VLSI system realizations. IEEE Trans. VLSI Systems, 7(4):433--441, Dec. 1999.]]

Digital Library

Cited By

Ali Khan M(2020)THREE LEVELS EFFECTIVE MEMORY ACCESS OPTIMIZATION ADDRESSING HIGH LATENCY ISSUES IN MODERN MEMORY DEPENDENT SYSTEMSJOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES10.26782/jmcms.2020.08.0005115:8Online publication date: 18-Aug-2020
https://doi.org/10.26782/jmcms.2020.08.00051
Tolubaeva MYan YChapman B(2014)Compile Time Modeling of Off-Chip Memory Bandwidth for Parallel LoopsLanguages and Compilers for Parallel Computing10.1007/978-3-319-09967-5_17(292-306)Online publication date: 1-Oct-2014
https://doi.org/10.1007/978-3-319-09967-5_17
Khan S(2012)Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time ManagementInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2012.V4.602(897-901)Online publication date: 2012
https://doi.org/10.7763/IJCTE.2012.V4.602
Show More Cited By

Index Terms

Optimizing the memory bandwidth with loop fusion
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

General loop fusion technique for nested loops considering timing and code size
CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems

Loop fusion is commonly used to improve the instruction-level parallelism of loops for high-performance embedded computing systems. Loop fusion, however, is not always directly applicable because the fusion prevention dependencies may exist among loops. ...
Optimizing the Memory Bandwidth with Loop Fusion
CODES+ISSS '04: Proceedings of the international conference on Hardware/Software Codesign and System Synthesis: 2004

The memory bandwidth largely determines the performance and energy cost of embedded systems. At the compiler level, several techniques improve the memory bandwidth at the scope of a basic block, but often fail to exploit all. We propose a technique to ...
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

September 2004

266 pages

ISBN:158113 9373

DOI:10.1145/1016720

General Chairs:
Alex Orailoglu
University of California, San Diego, CA
,
Pai H. Chou
University of California, Irvine, CA
,
Program Chairs:
Petru Eles
Linköping University, Sweden
,
Axel Jantsch
Royal Institute of Technology, Sweden

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CODES/ISSS04

Sponsor:

CODES/ISSS04: Second IEEE/ACM/IFIP International CODES/ ISSS 2004 Merged Conference

September 8 - 10, 2004

Stockholm, Sweden

Acceptance Rates

Overall Acceptance Rate 280 of 864 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
399
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ali Khan M(2020)THREE LEVELS EFFECTIVE MEMORY ACCESS OPTIMIZATION ADDRESSING HIGH LATENCY ISSUES IN MODERN MEMORY DEPENDENT SYSTEMSJOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES10.26782/jmcms.2020.08.0005115:8Online publication date: 18-Aug-2020
https://doi.org/10.26782/jmcms.2020.08.00051
Tolubaeva MYan YChapman B(2014)Compile Time Modeling of Off-Chip Memory Bandwidth for Parallel LoopsLanguages and Compilers for Parallel Computing10.1007/978-3-319-09967-5_17(292-306)Online publication date: 1-Oct-2014
https://doi.org/10.1007/978-3-319-09967-5_17
Khan S(2012)Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time ManagementInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2012.V4.602(897-901)Online publication date: 2012
https://doi.org/10.7763/IJCTE.2012.V4.602
Ozturk O(2011)Data locality and parallelism optimization using a constraint-based approachJournal of Parallel and Distributed Computing10.1016/j.jpdc.2010.08.00571:2(280-287)Online publication date: 1-Feb-2011
https://dl.acm.org/doi/10.1016/j.jpdc.2010.08.005
Unnikrishnan PChen GKandemir MKarakoy MKolcu I(2009)Reducing memory requirements of resource-constrained applicationsACM Transactions on Embedded Computing Systems10.1145/1509288.15092898:3(1-37)Online publication date: 22-Apr-2009
https://dl.acm.org/doi/10.1145/1509288.1509289
Khan SShin H(2009)Effective memory access optimization by memory delay modeling, memory allocation, and buffer allocation2009 International SoC Design Conference (ISOCC)10.1109/SOCDC.2009.5423893(153-156)Online publication date: Nov-2009
https://doi.org/10.1109/SOCDC.2009.5423893
Girodias BBouchebaba YNicolescu GAboulhamid EPaulin PLavigueur B(2009)Multiprocessor, Multithreading and Memory Optimization for On-Chip Multimedia ApplicationsJournal of Signal Processing Systems10.1007/s11265-008-0293-457:2(263-283)Online publication date: 1-Nov-2009
https://dl.acm.org/doi/10.1007/s11265-008-0293-4
Bouchebaba YGirodias BNicolescu GAboulhamid ELavigueur BPaulin P(2007)MPSoC memory optimization using program transformationACM Transactions on Design Automation of Electronic Systems10.1145/1278349.127835612:4(43-es)Online publication date: 1-Sep-2007
https://dl.acm.org/doi/10.1145/1278349.1278356
Naci S(2007)Optimizing Inter-Nest Data Locality Using Loop Splitting and Reordering2007 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2007.370399(1-8)Online publication date: Mar-2007
https://doi.org/10.1109/IPDPS.2007.370399
Bouchebaba YBensoudane ELavigueur BPaulin PNicolescu G(2007)Two-level tiling for MPSoC architecture2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2007.4429999(314-319)Online publication date: Jul-2007
https://doi.org/10.1109/ASAP.2007.4429999
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents