research-article

Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors

Authors:

Murali Jayapala,

Francisco Barat,

Francky Catthoor,

Henk Corporaal,

Geert DeconinckAuthors Info & Claims

IEEE Transactions on Computers, Volume 54, Issue 6

Pages 672 - 683

https://doi.org/10.1109/TC.2005.92

Published: 01 June 2005 Publication History

Abstract

Current loop buffer organizations for very large instruction word processors are essentially centralized. As a consequence, they are energy inefficient and their scalability is limited. To alleviate this problem, we propose a clustered loop buffer organization, where the loop buffers are partitioned and functional units are logically grouped to form clusters, along with two schemes for buffer control which regulate the activity in each cluster. Furthermore, we propose a design-time scheme to generate clusters by analyzing an application profile and grouping closely related functional units. The simulation results indicate that the energy consumed in the clustered loop buffers is, on average, 63 percent lower than the energy consumed in an uncompressed centralized loop buffer scheme, 35 percent lower than a centralized compressed loop buffer scheme, and 22 percent lower than a randomly clustered loop buffer scheme.

References

[1]

M.F. Jacome and G. de Veciana, “Design Challenges for New Application-Specific Processors,” IEEE Design & Test of Computers, special issue on design of embedded systems, Apr.-June 2000.]]

Digital Library

[2]

Texas Instruments Inc., TMS320C6000 Power Consumption Summary, http://www.ti.com, Nov. 1999.]]

[3]

L. Benini D. Bruni M. Chinosi C. Silvano and V. Zaccaria, “A Power Modeling and Estimation Framework for VLIW-Based Embedded System,” ST J. System Research, vol. 3, pp. 110-118, Apr. 2002.]]

[4]

R.S. Bajwa M. Hiraki H. Kojima D.J. Gorny K. Nitta A. Shridhar K. Seki and K. Sasaki, “Instruction Buffering to Reduce Power in Processors for Signal Processing,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 5, pp. 417-424, Dec. 1997.]]

Digital Library

[5]

L.H. Lee W. Moyer and J. Arends, “Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops,” Proc. Int'l Symp. Low Power Electronic Design (ISLPED), Aug. 1999.]]

Digital Library

[6]

A. Gordon-Ross S. Cotterell and F. Vahid, “Exploiting Fixed Programs in Embedded Systems: A Loop Cache Example,” Proc. IEEE Computer Architecture Letters, Jan. 2002.]]

Digital Library

[7]

N. Bellas I. Hajj C. Polychronopoulos and G. Stamoulis, “Architectural and Compiler Support for Energy Reduction in the Memory Hierarchy of High Performance Microprocessors,” Proc. Int'l Symp. Low Power Electronic Design (ISLPED), Aug. 1998.]]

Digital Library

[8]

J.W. Sias H.C. Hunter and W.M.W. Hwu, “Enhancing Loop Buffering of Media and Telecommunications Applications Using Low-Overhead Predication,” Proc. 34th Ann. Int'l Symp. Microarchitecture (MICRO), Dec. 2001.]]

Digital Library

[9]

Texas Instruments Inc., TMS320C6000 CPU and Instruction Set Reference Guide, http://www.ti.com, Oct. 2000.]]

[10]

N. Liveris N.D. Zervas D. Soudris and C.E. Goutis, “A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications,” Proc. Design Automation and Test in Europe (DATE), Mar. 2002.]]

Digital Library

[11]

Trimaran: An Infrastructure for Research in Instruction-Level Parallelism, http://www.trimaran.org, 1999.]]

[12]

C. Lee, et al., “Mediabench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems,” Proc. Int'l Symp. Microarchitecture, pp. 330-335, 1997.]]

Digital Library

[13]

D. Brooks V. Tiwari and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” Proc. 27th Int'l Symp. Computer Architecture (ISCA), pp. 83-94, June 2000.]]

Digital Library

[14]

S.V. Adve D. Burger R. Eigenmann A. Rawsthorne M.D. Smith C.H. Gebotys M.T. Kandemir D.J. Lilja A.N. Choudhary J.Z. Fang and P.-C. Yew, “Changing Interaction of Compiler And Architecture,” Computer, vol. 30, no. 12, pp. 51-58, Dec. 1997.]]

Digital Library

[15]

C. Lee J.K. Lee and T. Hwang, “Compiler Optimization on Instruction Scheduling for Low Power,” Proc. Int'l Symp. System Synthesis (ISSS), Sept. 2000.]]

Digital Library

[16]

M. Mahendale S.D. Sherlekar and G. Venkatesh, “Extensions to Programmable DSP Architectures for Reduced Power Dissipation,” Proc. VLSI Design, Jan. 1998.]]

Digital Library

[17]

W.-C. Cheng and M. Pedram, “Power-Aware Bus Encoding Techniques for I/O and Data Busses in an Embedded System,” J. Circuits, Systems, and Computers, vol. 11, pp. 351-364, Aug. 2002.]]

[18]

L. Benini A. Macii E. Macii and M. Poncino, “Selective Instruction Compression for Memory Energy Reduction in Embedded Systems,” Proc. Int'l Symp. Low Power Electronic Design (ISLPED), Aug. 1999.]]

Digital Library

[19]

P. Centoducatte G. Araujo and R. Pannain, “Compressed Code Execution on DSP Architectures,” Proc. Int'l Symp. System Synthesis (ISSS), Nov. 1999.]]

Digital Library

[20]

H. Lekatsas J. Henkel and W. Wolf, “Code Compression for Low Power Embedded System Design,” Proc. Design Automation Conf. (DAC), June 2000.]]

Digital Library

[21]

S. Debray W. Evans R. Muth and B.D. Sutter, “Compiler Techniques for Code Compaction,” ACM Trans. Programming Languages and Systems (TOPLAS), vol. 22, pp. 378-415, Mar. 2000.]]

Digital Library

[22]

A. Halambi A. Shrivastava P. Biswas N. Dutt and A. Nicolau, “An Efficient Compiler Technique for Code Size Reduction Using Reduced Bit-Width ISAs,” Proc. Design Automation Conf. (DAC), Mar. 2002.]]

Digital Library

[23]

T. Ishihara and H. Yasuura, “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors,” Proc. Design Automation and Test in Europe (DATE), Mar. 2000.]]

Digital Library

[24]

S. Steinke L. Wehmeyer B.-S. Lee and P. Marwedel, “Assigning Program and Data Objects to Scratchpad for Energy Reduction,” Proc. Design Automation and Test in Europe (DATE), Mar. 2002.]]

Digital Library

[25]

S. Parameswaran and J. Henkel, “I-Copes: Fast Instruction Code Placement for Embedded Systems to Improve Performance and Energy Efficiency,” Proc. Int'l Conf. Computer Aided Design (ICCAD), Nov. 2001.]]

Digital Library

[26]

N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. Int'l Symp. Computer Architecture (ISCA), May 1990.]]

Digital Library

[27]

J.D. Bunda, “Instruction-Processing Optimization Technique for VLSI Microprocessors,” PhD dessertation, Univ. of Texas at Austin, May 1993.]]

Digital Library

[28]

J. Kin M. Gupta and W.H. Mangione-Smith, “Filtering Memory References to Increase Energy Efficiency,” IEEE Trans. Computers, vol. 49,no. 1, pp. 1-15, Jan. 2000.]]

Digital Library

[29]

W. Tang R. Gupta and A. Nicolau, “Design of a Predictive Filter Cache for Energy Savings in High Performance Processor Architectures,” Proc. Int'l Conf. Computer Design (ICCD), Sept. 2001.]]

Digital Library

[30]

T. Anderson and S. Agarwala, “Effective Hardware-Based Two-Way Loop Cache for High Performance Low Power Processors,” Proc. Int'l Conf. Computer Design (ICCD), Sept. 2000.]]

Digital Library

[31]

A. Gordon-Ross and F. Vahid, “Dynamic Loop Caching Meets Preloaded Loop Caching-A Hybrid Approach,” Proc. Int'l Conf. Computer Design (ICCD), Sept. 2002.]]

Digital Library

[32]

W.-T. Shiue and C. Chakrabarti, “Memory Exploration for Low Power Embedded Systems,” Proc. Design Automation Conf. (DAC), June 1999.]]

Digital Library

[33]

T.M. Conte S. Banerjia S.Y. Larin and K.N. Menezes, “Instruction Fetch Mechanisms for VLIW Architectures with Compressed Encodings,” Proc. 29th Int'l Symp. Microarchitecture (MICRO), Dec. 1996.]]

Digital Library

[34]

M.D. Powell, et al., “Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping,” Proc. 34th Int'l Symp. Microarchitecture (MICRO), Nov. 2001.]]

Digital Library

[35]

S. Kim N. Vijaykrishnan M. Kandemir A. Sivasubramaniam M.J. Irwin and E. Geethanjali, “Power-Aware Partitioned Cache Architectures,” Proc. ACM/IEEE Int'l Symp. Low Power Electronics (ISLPED), Aug. 2001.]]

Digital Library

[36]

R. Colwell R. Nix J. O'Donnell D. Papworth and P. Rodman, “A VLIW Architecture for a Trace Scheduling Compiler,” IEEE Trans. Computers, vol. 37, no. 8, pp. 967-979, Aug. 1988.]]

Digital Library

[37]

V. Lapinskii M.F. Jacome and G. de Veciana, “High Quality Operation Binding for Clustered VLIW Datapaths,” Proc. IEEE/ACM Design Automation Conf. (DAC), June 2001.]]

Digital Library

[38]

P. Faraboschi G. Brown J. Fischer G. Desoli and F. Homewood, “Lx: A Technology Platform for Customizable VLIW Embedded Processing,” Proc. 27th Int'l Symp. Computer Architecture (ISCA), June 2000.]]

Digital Library

[39]

J. Sánchez and A. González, “Modulo Scheduling for a Fully-Distributed Clustered VLIW Architectures,” Proc. 29th Int'l Symp. Microarchitecture (MICRO), Dec. 2001.]]

Digital Library

[40]

M.J. Flynn P. Hung and K.W. Rudd, “Deep-Submicron Microprocessor Design Issues,” IEEE MICRO, vol. 19, no. 4, July-Aug. 1999.]]

Digital Library

[41]

V.V. Zyuban and P.M. Kogge, “Inherently Lower-Power High-Performance Superscalar Architectures,” IEEE Trans. Computers, vol. 50,no. 3, pp. 268-285, Mar. 2001.]]

Digital Library

[42]

M. Franklin, “The Multiscalar Architecture,” PhD dessertation, Univ. of Wisconsin Madison, Nov. 1993.]]

Digital Library

[43]

S. Palacharla N. Jouppi and J. Smith, “Complexity-Effective Superscalar Processor,” Proc. Int'l Symp. Computer Architecture (ISCA), June 1997.]]

Digital Library

Cited By

Soudris DPapadopoulos LKessler CKehagias DPapadopoulos ASeferlis PChatzigeorgiou AAmpatzoglou AThibault SNamyst RPleiter DGaydadjiev GBecker THaefele MMudge TPnevmatikatos D(2018)EXA2PRO programming environmentProceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation10.1145/3229631.3239369(202-209)Online publication date: 15-Jul-2018
https://dl.acm.org/doi/10.1145/3229631.3239369
Clemente JGran RChocano Adel Prado CResano J(2016)Hardware Architectural Support for Caching Partitioned Reconfigurations in Reconfigurable SystemsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2015.241759524:2(530-543)Online publication date: 19-Jan-2016
https://dl.acm.org/doi/10.1109/TVLSI.2015.2417595
Farahini NHemani ASohofi HJafri STajammul MPaul K(2014)Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabricMicroprocessors & Microsystems10.1016/j.micpro.2014.05.00938:8(788-802)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1016/j.micpro.2014.05.009
Show More Cited By

Index Terms

Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors
1. Computer systems organization
2. Hardware
  1. Hardware validation
    1. Functional verification
      1. Simulation and emulation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

High-Performance and Low-Cost Dual-Thread VLIW Processor Using Weld Architecture Paradigm

This paper presents a cost-effective and high-performance dual-thread VLIW processor model. The dual-thread VLIW processor model is a low-cost subset of the Weld architecture paradigm. It supports one main thread and one speculative thread running ...
Distributed Loop Controller for Multithreading in Unithreaded ILP Architectures

Reduced energy consumption is one of the most important design goals for embedded application domains like wireless communication, multimedia and biomedical applications. The instruction memory hierarchy has been proven to be one of the most power ...
DIA: A Complexity-Effective Decoding Architecture

Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 54, Issue 6

June 2005

143 pages

ISSN:0018-9340

Issue’s Table of Contents

Copyright © 2005.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 June 2005

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Soudris DPapadopoulos LKessler CKehagias DPapadopoulos ASeferlis PChatzigeorgiou AAmpatzoglou AThibault SNamyst RPleiter DGaydadjiev GBecker THaefele MMudge TPnevmatikatos D(2018)EXA2PRO programming environmentProceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation10.1145/3229631.3239369(202-209)Online publication date: 15-Jul-2018
https://dl.acm.org/doi/10.1145/3229631.3239369
Clemente JGran RChocano Adel Prado CResano J(2016)Hardware Architectural Support for Caching Partitioned Reconfigurations in Reconfigurable SystemsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2015.241759524:2(530-543)Online publication date: 19-Jan-2016
https://dl.acm.org/doi/10.1109/TVLSI.2015.2417595
Farahini NHemani ASohofi HJafri STajammul MPaul K(2014)Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabricMicroprocessors & Microsystems10.1016/j.micpro.2014.05.00938:8(788-802)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1016/j.micpro.2014.05.009
Artes AFasthuber RAyala JRaghavan PCatthoor F(2013)Design Space Exploration of Distributed Loop Buffer Architectures with Incompatible Loop-Nest Organisations in Embedded SystemsJournal of Signal Processing Systems10.1007/s11265-013-0749-z72:1(69-85)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.1007/s11265-013-0749-z
Komalan MPérez JTenllado CMontañana JArtés AFernández JCatthoor F(2013)Design exploration of a NVM based hybrid instruction memory organization for embedded platformsDesign Automation for Embedded Systems10.1007/s10617-014-9151-817:3-4(459-483)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1007/s10617-014-9151-8
Park JBalfour JDally WKathail VTatge RBarua R(2010)Fine-grain dynamic instruction placement for L0 scratch-pad memoryProceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems10.1145/1878921.1878943(137-146)Online publication date: 24-Oct-2010
https://dl.acm.org/doi/10.1145/1878921.1878943
Raghavan PJayapala MLambrechts AAbsar JCatthoor F(2009)Playing the trade-off gameACM Transactions on Design Automation of Electronic Systems10.1145/1529255.152925814:3(1-37)Online publication date: 4-Jun-2009
https://dl.acm.org/doi/10.1145/1529255.1529258
Raghavan PLambrechts AAbsar JJayapala MCatthoor FVerkest D(2008)COFFEEProceedings of the 3rd international conference on High performance embedded architectures and compilers10.5555/1786054.1786074(193-208)Online publication date: 27-Jan-2008
https://dl.acm.org/doi/10.5555/1786054.1786074
Kobayashi YJayapala MRaghavan PCatthoor FImai M(2008)Efficient Method to Generate an Energy Efficient Schedule Using Operation ShufflingIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences10.1093/ietfec/e91-a.2.604E91-A:2(604-612)Online publication date: 1-Feb-2008
https://dl.acm.org/doi/10.1093/ietfec/e91-a.2.604
Atienza DRaghavan PAyala JDe Micheli GCatthoor FVerkest DLópez-Vallejo M(2008)Joint hardware-software leakage minimization approach for the register file of VLIW embedded architecturesIntegration, the VLSI Journal10.1016/j.vlsi.2007.04.00441:1(38-48)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.1016/j.vlsi.2007.04.004
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents