research-article

Compiling for an indirect vector register architecture

Authors:

Mircea Namolaru,

Jeff H. DerbyAuthors Info & Claims

CF '08: Proceedings of the 5th conference on Computing frontiers

Pages 199 - 208

https://doi.org/10.1145/1366230.1366266

Published: 05 May 2008 Publication History

Abstract

The iVMX architecture contains a novel vector register file of up to 4096 vector registers accessed indirectly via a mapping mechanism, providing compatibility with the VMX architecture, and potential for dramatic performance benefits [7]. The large number of vector registers and the unique indirection mechanism pose compilation challenges to be used efficiently: the indirection mechanism emphasizes spatial locality of registers and interaction among destination and source operands during register allocation, and the many vector registers call for aggressive automatic vectorization.

This work is a first step in addressing the compilability of iVMX, following the presentation and validation of its architectural aspects [7]. In this paper we present several compilation approaches to deal with the mapping mechanism and an outer-loop vectorization transformation developed to promote the use of many vector registers. We modified an existing register allocator to target all available registers and added a post-pass to rename live-ranges considering spatial locality and interaction among operand types. An FIR filter is used to demonstrate the effectiveness of the techniques developed compared to a version hand-optimized for iVMX. Initial results show that we can reduce the overhead of map management down to 29% of the total instruction count, compared to 22% obtained manually, and compared to 49% obtained using a naive scheme, while outperforming an equivalent VMX implementation by a factor of 2.

References

[1]

R. Allen and K. Kennedy. Automatic translation of fortran programs to vector form. ACM Trans. on Programming Languages and Systems, 9(4):491--542, October 1987.

Digital Library

[2]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2001.

Digital Library

[3]

Aart Bik. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance. Intel Press, 2004.

Digital Library

[4]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, June 1990.

Digital Library

[5]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, June 1990.

Digital Library

[6]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, June 1990.

Digital Library

[7]

J. H. Derby, R. K. Montoye, and J. Moreira. Victoria - a vmx indirect compute technology oriented towards in-line acceleration. In Computing Frontiers, May 2006.

Digital Library

[8]

J. H. Derby and J. H. Moreno. A high-performance embedded dsp core with novel simd features. In ICASSP, 2003.

[9]

Free Software Foundation. gcc.gnu.org/projects/tree-ssa/vectorization.html.

[10]

Freescale Semiconductor, http://www.freescale.com. Altivec real fir, October 2002.

[11]

H. C. Hunter and J. H. Moreno. A new look at exploiting data parallelism in embedded systems. In CASES, pages 159--169, October 2003.

Digital Library

[12]

S. Kim and S. Moon. Rotating register allocation for enhanced pipeline scheduling. In PACT, 2007.

Digital Library

[13]

J. H. Moreno, V. Zyuban, U. Shvadron, F. Neeser, J. Derby, M. Ware, K. Kailas, A. Zaks, A. Geva, S. Ben-David, S. Asaad, T. Fox, M. Biberstein, D. Naishlos, and H. Hunter. An innovative low-power high-performance programmable signal processor for digital communications. IBM J. of R&D, March 2003.

Digital Library

[14]

S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.

Digital Library

[15]

D. Naishlos. Autovectorization in gcc. In GCC Developer?s summit, pages 105--118, June 2004.

[16]

D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a simdd dsp architecture. In CASES, pages 2--11, October 2003.

Digital Library

[17]

M. Namolaru. Register allocation techniques for ivmx architecture. In Int?l Workshop on GCC for Research in Embedded and Parallel Systems, September 2007.

[18]

D. Nuzman and A. Zaks. Autovectorization in gcc - two years later. In GCC Developer?s summit, June 2006.

[19]

P. R. Panda, N. D. Dutt, and A. Nicolau. Efficient utilization of scratch-pad memory in embedded processor applications. In European Design and Test Conf., March 1997.

Digital Library

[20]

M. Postiff. Compiler and Microarchitecture Mechanisms for Exploiting Registers to Improve Memory Performance. PhD thesis, U. of Michigan, 2001.

Digital Library

[21]

R. G. Scarborough and H. G. Kolsky. A vectorizing fortran compiler. IBM J. of R&D, 30(2):163--171, March 1986.

Digital Library

[22]

J. Shin, J. Chame, and M. W. Hall. Compiler-controlled caching in superword register files for multimedia extension architectures. In PACT, September 2002.

Digital Library

[23]

N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. Intl? J. of Parallel Programming, 28(4):363--400, August 2000.

[24]

C. Tenllado, L. Piñuel, M. Prieto, and F. Catthoor. Pack transposition: Enhancing superword level parallelism exploitation. In Parallel Computing, 2005.

[25]

C. Tenllado, L. Piñuel, M. Prieto, F. Tirado, and F. Catthoor. Improving superword level parallelism support in modern compilers. In CODES+ISSS, 2005.

Digital Library

[26]

M. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley, 1996.

Digital Library

[27]

P. Wu, A. E. Eichenberger, A. Wang, and P. Zhao. An integrated simdization framework using virtual vectors. In ICS, June 2005.

Digital Library

Cited By

Raghavan PCatthoor F(2012)Storage Allocation for Streaming-Based Register FileEnergy-Aware Memory Management for Embedded Multimedia Systems10.1201/b11418-6(151-194)Online publication date: 4-Jan-2012
https://doi.org/10.1201/b11418-6
Jaeger JBarthou D(2012)Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs2012 19th International Conference on High Performance Computing10.1109/HiPC.2012.6507504(1-10)Online publication date: Dec-2012
https://doi.org/10.1109/HiPC.2012.6507504
Munk HAyguadé EBastoul CCarpenter PChamski ZCohen ACornero MDumont PDuranton MFellahi MFerrer RLadelsky RLindwer MMartorell XMiranda CNuzman DOrnstein APop APop SPouchet LRamírez ARódenas DRohou ERosen IShvadron UTrifunović KZaks A(2010)ACOTES Project: Advanced Compiler Technologies for Embedded StreamingInternational Journal of Parallel Programming10.1007/s10766-010-0132-739:3(397-450)Online publication date: 20-Apr-2010
https://doi.org/10.1007/s10766-010-0132-7
Show More Cited By

Index Terms

Compiling for an indirect vector register architecture
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Very long instruction word
    2. Serial architectures
      1. Complex instruction set computing
      2. Reduced instruction set computing
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 PLDI Conference

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data ...
Vectorizing for a SIMdD DSP architecture
CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

The Single Instruction Multiple Data (SIMD) model for finegrained parallelism was recently extended to support SIMD operations on disjoint vector elements. In this paper we demonstrate how SIMdD (SIMD on disjoint data) supports e#ective vectorization of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '08: Proceedings of the 5th conference on Computing frontiers

May 2008

334 pages

ISBN:9781605580777

DOI:10.1145/1366230

General Chair:
Alex Ramirez
UPC, Spain
,
Program Chairs:
Gianfranco Biliardi
University of Padova, Italy
,
Michael Gschwind
IBM TJ Watson Research Center, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 May 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF '08

Sponsor:

CF '08: Computing Frontiers Conference

May 5 - 7, 2008

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
339
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Raghavan PCatthoor F(2012)Storage Allocation for Streaming-Based Register FileEnergy-Aware Memory Management for Embedded Multimedia Systems10.1201/b11418-6(151-194)Online publication date: 4-Jan-2012
https://doi.org/10.1201/b11418-6
Jaeger JBarthou D(2012)Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs2012 19th International Conference on High Performance Computing10.1109/HiPC.2012.6507504(1-10)Online publication date: Dec-2012
https://doi.org/10.1109/HiPC.2012.6507504
Munk HAyguadé EBastoul CCarpenter PChamski ZCohen ACornero MDumont PDuranton MFellahi MFerrer RLadelsky RLindwer MMartorell XMiranda CNuzman DOrnstein APop APop SPouchet LRamírez ARódenas DRohou ERosen IShvadron UTrifunović KZaks A(2010)ACOTES Project: Advanced Compiler Technologies for Embedded StreamingInternational Journal of Parallel Programming10.1007/s10766-010-0132-739:3(397-450)Online publication date: 20-Apr-2010
https://doi.org/10.1007/s10766-010-0132-7
Raghavan PCatthoor FRosenstiel WWakabayashi K(2009)SARAProceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis10.1145/1629435.1629442(41-50)Online publication date: 11-Oct-2009
https://dl.acm.org/doi/10.1145/1629435.1629442
Nuzman DZaks AMoshovos ATarditi DOlukotun K(2008)Outer-loop vectorizationProceedings of the 17th international conference on Parallel architectures and compilation techniques10.1145/1454115.1454119(2-11)Online publication date: 25-Oct-2008
https://dl.acm.org/doi/10.1145/1454115.1454119

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents