research-article

Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

Authors:

Wonyong SungAuthors Info & Claims

CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems

Pages 167 - 176

https://doi.org/10.1145/1450095.1450121

Published: 19 October 2008 Publication History

Abstract

Automatic vectorization of programs for partitioned-ALU SIMD (Single Instruction Multiple Data) processors has been difficult because of not only data dependency issues but also non-aligned and irregular data access problems. A non-aligned or irregular data access operation incurs many overhead cycles for data alignment. Moreover, this causes difficulty in efficient code generation and hinders automatic vectorization. In this paper, we employ special memory access hardware for improving the performance of SIMD processors; one is the split line buffer and the other is the packing buffer. The former solves the non-aligned memory access problem, while the latter simplifies irregular and stride data access. The addition of these hardware units not only requires very small changes to the instruction set architecture but also contributes to the significant performance improvement by vectorizing more loops and reducing the overhead cycles. We have also developed an auto-vectorization compiler which utilizes these special hardware units. Experiments have been conducted to compare the proposed method with the conventional one, which show 50% increase in the number of vectorized loops and 77% increase in the total performance of an MPEG2 encoder program.

References

[1]

Intel Integrated Performance Primitives for Intel Pentium Processors and Intel Itanium Architectures. Intel Corporation.

[2]

TMS320C64x Technical Overview. Texas Instruments, 2000.

[3]

Cortex-A8 Technical Reference Manual. ARM, 2007.

[4]

Realview Compilation Tools: NEON Vectorizing Compiler Guide. ARM, 2007.

[5]

M. Alvarez, E. Salami, A. Ramirez, and M. Valero. Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications. In Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software, pages 62--71, 2007.

[6]

M. C. August, G. M. Brost, C. C. Hsiung, and A. J. Schiffleger. Cray X-MP: The Birth of a Supercomputer. IEEE Computer, 22(1):45--52.

Digital Library

[7]

A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic Intra-Register Vectorization for the Intel Architecture. International Journal of Parallel Programming, 30(2):65--98.

Digital Library

[8]

H. Chang, J. Cho, and W. Sung. Performance Evaluation of an SIMD Architecture with a Multi-Bank Vector Memory Unit. In Proceedings of IEEE Workshop on Signal Processing Systems Design and Implementation, 2006.

[9]

J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297--301, 1965.

[10]

A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD Architectures with Alignment Constraints. SIGPLAN Notices, 39(6):82--93.

Digital Library

[11]

E. J. Fluhr and S. B. Levenstein. Method and Apparatus for Efficiently Accessing Both Aligned and Unaligned Data from a Memory. US Patent 7302525, 2007.

[12]

G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, 5(1):1--13, 2001.

[13]

M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(2):243--260, 2004.

Digital Library

[14]

A. Kudriavtsev and P. Kogge. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, Chicago, Illinois, USA. ACM.

Digital Library

[15]

S. Larsen, E. Witchel, and S. P. Amarasinghe. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society.

Digital Library

[16]

C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In Proceedings of Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pages 330--335, 1997.

Digital Library

[17]

J. Lorenz, S. Kral, F. Franchetti, and C. W. Ueberhuber. Vectorization Techniques for the Blue Gene/L Double FPU. IBM Journal of Research and Development, 49(2/3):437--446, 2005.

Digital Library

[18]

D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP Architecture. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, San Jose, California, USA. ACM.

Digital Library

[19]

D. Nuzman and A. Zaks. Autovectorization in GCC - Two Years Later. In Proceedings of the 2006 GCC Developers Summit, pages 145--58, 2006.

[20]

N. C. Paver, B. C. Aldrich, and M. H. Khan. Intel Wireless MMX Technology: A 64-Bit SIMD Architecture for Mobile Multimedia. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2003.

[21]

G. Ren, P. Wu, and D. Padua. Optimizing Data Permutations for SIMD Devices. SIGPLAN Notices, 41(6):118--131.

Digital Library

[22]

D. Talla, L. K. John, and D. Burger. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements. IEEE Transactions on Computers, 52(8):1015--1031, 2003.

Digital Library

[23]

S. Udayakumaran and R. Barua. Compiler-Decided Dynamic Memory Allocation for Scratch-Pad Based Embedded Systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM.

Digital Library

[24]

Z. Wang. Fast Algorithms for the Discrete W Transform and for the Discrete Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-32(4):803--816, 1984.

[25]

P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In Proceedings of the International Symposium on Code Generation and Optimization.

Digital Library

[26]

K. X. Zhang. Buffer for a Split Cache Line Access. US Patent 6862225, 2005.

Cited By

Seiculescu C(2021)Efficient Unaligned Memory Access of Tightly Packed Weights for Deep Neural Network Inference on Edge Devices2021 IEEE 27th International Symposium for Design and Technology in Electronic Packaging (SIITME)10.1109/SIITME53254.2021.9663723(242-245)Online publication date: 27-Oct-2021
https://doi.org/10.1109/SIITME53254.2021.9663723
Gajaria DAdegbija T(2021)Exploring Domain-Specific Architectures for Energy-Efficient Wearable ComputingJournal of Signal Processing Systems10.1007/s11265-021-01682-y94:6(559-577)Online publication date: 24-Jul-2021
https://doi.org/10.1007/s11265-021-01682-y
van den Brink ABekooij M(2020)Conflict-Free Vectorized In-order In-place Radix-r Belief Propagation Polar Code Decoder AlgorithmProceedings of the 2020 8th International Conference on Communications and Broadband Networking10.1145/3390525.3390539(18-23)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3390525.3390539
Show More Cited By

Index Terms

Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware
1. Hardware
  1. Hardware validation
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 PLDI Conference

Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data ...
Vectorization for SIMD architectures with alignment constraints
PLDI '04

When vectorizing for SIMD architectures that are commonly employed by today's multimedia extensions, one of the new challenges that arise is the handling of memory alignment. Prior research has focused primarily on vectorizing loops where all memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems

October 2008

274 pages

ISBN:9781605584690

DOI:10.1145/1450095

Program Chair:
Erik Altman
IBM

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESWEEK 08

Sponsor:

ESWEEK 08: Fourth Embedded Systems Week

October 19 - 24, 2008

GA, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
637
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Seiculescu C(2021)Efficient Unaligned Memory Access of Tightly Packed Weights for Deep Neural Network Inference on Edge Devices2021 IEEE 27th International Symposium for Design and Technology in Electronic Packaging (SIITME)10.1109/SIITME53254.2021.9663723(242-245)Online publication date: 27-Oct-2021
https://doi.org/10.1109/SIITME53254.2021.9663723
Gajaria DAdegbija T(2021)Exploring Domain-Specific Architectures for Energy-Efficient Wearable ComputingJournal of Signal Processing Systems10.1007/s11265-021-01682-y94:6(559-577)Online publication date: 24-Jul-2021
https://doi.org/10.1007/s11265-021-01682-y
van den Brink ABekooij M(2020)Conflict-Free Vectorized In-order In-place Radix-r Belief Propagation Polar Code Decoder AlgorithmProceedings of the 2020 8th International Conference on Communications and Broadband Networking10.1145/3390525.3390539(18-23)Online publication date: 15-Apr-2020
https://dl.acm.org/doi/10.1145/3390525.3390539
Vincze D(2018)Parallelization by Vectorization in Fuzzy Rule Interpolation Adapted to FRIQ-Learning2018 World Symposium on Digital Intelligence for Systems and Machines (DISA)10.1109/DISA.2018.8490614(131-136)Online publication date: Aug-2018
https://doi.org/10.1109/DISA.2018.8490614
Ye HGu NZhang XLin C(2017)An efficient conflict-free memory-addressing unit for SIMD VLIW DSP2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS)10.23919/SPECTS.2017.8046778(1-7)Online publication date: Jul-2017
https://doi.org/10.23919/SPECTS.2017.8046778
GAO WHAN LZHAO RLI YLIU J(2017)Insufficient Vectorization: A New Method to Exploit Superword Level ParallelismIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7236E100.D:1(91-106)Online publication date: 2017
https://doi.org/10.1587/transinf.2016EDP7236
Ho NWong W(2017)Exploiting half precision arithmetic in Nvidia GPUs2017 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2017.8091072(1-7)Online publication date: Sep-2017
https://doi.org/10.1109/HPEC.2017.8091072
Li YGao YWang DLi YXu J(2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
https://doi.org/10.1007/978-981-10-6442-5_14
Li PZhao RZhang QHan L(2016)An SIMD Code Generation Technology for Indirect ArrayInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2016.V8.10478:3(218-222)Online publication date: Jun-2016
https://doi.org/10.7763/IJCTE.2016.V8.1047
Sun HZhao RGao WGong YLi G(2015)Exploiting Pure Superword Level Parallelism for Array IndirectionsProceedings of the 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)10.1109/PAAP.2015.14(13-19)Online publication date: 12-Dec-2015
https://dl.acm.org/doi/10.1109/PAAP.2015.14
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten