Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1450095.1450121acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

Published: 19 October 2008 Publication History

Abstract

Automatic vectorization of programs for partitioned-ALU SIMD (Single Instruction Multiple Data) processors has been difficult because of not only data dependency issues but also non-aligned and irregular data access problems. A non-aligned or irregular data access operation incurs many overhead cycles for data alignment. Moreover, this causes difficulty in efficient code generation and hinders automatic vectorization. In this paper, we employ special memory access hardware for improving the performance of SIMD processors; one is the split line buffer and the other is the packing buffer. The former solves the non-aligned memory access problem, while the latter simplifies irregular and stride data access. The addition of these hardware units not only requires very small changes to the instruction set architecture but also contributes to the significant performance improvement by vectorizing more loops and reducing the overhead cycles. We have also developed an auto-vectorization compiler which utilizes these special hardware units. Experiments have been conducted to compare the proposed method with the conventional one, which show 50% increase in the number of vectorized loops and 77% increase in the total performance of an MPEG2 encoder program.

References

[1]
Intel Integrated Performance Primitives for Intel Pentium Processors and Intel Itanium Architectures. Intel Corporation.
[2]
TMS320C64x Technical Overview. Texas Instruments, 2000.
[3]
Cortex-A8 Technical Reference Manual. ARM, 2007.
[4]
Realview Compilation Tools: NEON Vectorizing Compiler Guide. ARM, 2007.
[5]
M. Alvarez, E. Salami, A. Ramirez, and M. Valero. Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications. In Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software, pages 62--71, 2007.
[6]
M. C. August, G. M. Brost, C. C. Hsiung, and A. J. Schiffleger. Cray X-MP: The Birth of a Supercomputer. IEEE Computer, 22(1):45--52.
[7]
A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Automatic Intra-Register Vectorization for the Intel Architecture. International Journal of Parallel Programming, 30(2):65--98.
[8]
H. Chang, J. Cho, and W. Sung. Performance Evaluation of an SIMD Architecture with a Multi-Bank Vector Memory Unit. In Proceedings of IEEE Workshop on Signal Processing Systems Design and Implementation, 2006.
[9]
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297--301, 1965.
[10]
A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD Architectures with Alignment Constraints. SIGPLAN Notices, 39(6):82--93.
[11]
E. J. Fluhr and S. B. Levenstein. Method and Apparatus for Efficiently Accessing Both Aligned and Unaligned Data from a Memory. US Patent 7302525, 2007.
[12]
G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, 5(1):1--13, 2001.
[13]
M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(2):243--260, 2004.
[14]
A. Kudriavtsev and P. Kogge. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, Chicago, Illinois, USA. ACM.
[15]
S. Larsen, E. Witchel, and S. P. Amarasinghe. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society.
[16]
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems. In Proceedings of Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pages 330--335, 1997.
[17]
J. Lorenz, S. Kral, F. Franchetti, and C. W. Ueberhuber. Vectorization Techniques for the Blue Gene/L Double FPU. IBM Journal of Research and Development, 49(2/3):437--446, 2005.
[18]
D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP Architecture. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, San Jose, California, USA. ACM.
[19]
D. Nuzman and A. Zaks. Autovectorization in GCC - Two Years Later. In Proceedings of the 2006 GCC Developers Summit, pages 145--58, 2006.
[20]
N. C. Paver, B. C. Aldrich, and M. H. Khan. Intel Wireless MMX Technology: A 64-Bit SIMD Architecture for Mobile Multimedia. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2003.
[21]
G. Ren, P. Wu, and D. Padua. Optimizing Data Permutations for SIMD Devices. SIGPLAN Notices, 41(6):118--131.
[22]
D. Talla, L. K. John, and D. Burger. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements. IEEE Transactions on Computers, 52(8):1015--1031, 2003.
[23]
S. Udayakumaran and R. Barua. Compiler-Decided Dynamic Memory Allocation for Scratch-Pad Based Embedded Systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM.
[24]
Z. Wang. Fast Algorithms for the Discrete W Transform and for the Discrete Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-32(4):803--816, 1984.
[25]
P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In Proceedings of the International Symposium on Code Generation and Optimization.
[26]
K. X. Zhang. Buffer for a Split Cache Line Access. US Patent 6862225, 2005.

Cited By

View all
  • (2021)Efficient Unaligned Memory Access of Tightly Packed Weights for Deep Neural Network Inference on Edge Devices2021 IEEE 27th International Symposium for Design and Technology in Electronic Packaging (SIITME)10.1109/SIITME53254.2021.9663723(242-245)Online publication date: 27-Oct-2021
  • (2021)Exploring Domain-Specific Architectures for Energy-Efficient Wearable ComputingJournal of Signal Processing Systems10.1007/s11265-021-01682-y94:6(559-577)Online publication date: 24-Jul-2021
  • (2020)Conflict-Free Vectorized In-order In-place Radix-r Belief Propagation Polar Code Decoder AlgorithmProceedings of the 2020 8th International Conference on Communications and Broadband Networking10.1145/3390525.3390539(18-23)Online publication date: 15-Apr-2020
  • Show More Cited By

Index Terms

  1. Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
      October 2008
      274 pages
      ISBN:9781605584690
      DOI:10.1145/1450095
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 October 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. SIMD
      2. compiler
      3. irregular access
      4. non-aligned access
      5. packing buffer
      6. split line buffer
      7. vectorization

      Qualifiers

      • Research-article

      Conference

      ESWEEK 08
      ESWEEK 08: Fourth Embedded Systems Week
      October 19 - 24, 2008
      GA, Atlanta, USA

      Acceptance Rates

      Overall Acceptance Rate 52 of 230 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)8
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 02 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Efficient Unaligned Memory Access of Tightly Packed Weights for Deep Neural Network Inference on Edge Devices2021 IEEE 27th International Symposium for Design and Technology in Electronic Packaging (SIITME)10.1109/SIITME53254.2021.9663723(242-245)Online publication date: 27-Oct-2021
      • (2021)Exploring Domain-Specific Architectures for Energy-Efficient Wearable ComputingJournal of Signal Processing Systems10.1007/s11265-021-01682-y94:6(559-577)Online publication date: 24-Jul-2021
      • (2020)Conflict-Free Vectorized In-order In-place Radix-r Belief Propagation Polar Code Decoder AlgorithmProceedings of the 2020 8th International Conference on Communications and Broadband Networking10.1145/3390525.3390539(18-23)Online publication date: 15-Apr-2020
      • (2018)Parallelization by Vectorization in Fuzzy Rule Interpolation Adapted to FRIQ-Learning2018 World Symposium on Digital Intelligence for Systems and Machines (DISA)10.1109/DISA.2018.8490614(131-136)Online publication date: Aug-2018
      • (2017)An efficient conflict-free memory-addressing unit for SIMD VLIW DSP2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS)10.23919/SPECTS.2017.8046778(1-7)Online publication date: Jul-2017
      • (2017)Insufficient Vectorization: A New Method to Exploit Superword Level ParallelismIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7236E100.D:1(91-106)Online publication date: 2017
      • (2017)Exploiting half precision arithmetic in Nvidia GPUs2017 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2017.8091072(1-7)Online publication date: Sep-2017
      • (2017)Optimizations of the Whole Function Vectorization Based on SIMD CharacteristicsParallel Architecture, Algorithm and Programming10.1007/978-981-10-6442-5_14(152-171)Online publication date: 6-Oct-2017
      • (2016)An SIMD Code Generation Technology for Indirect ArrayInternational Journal of Computer Theory and Engineering10.7763/IJCTE.2016.V8.10478:3(218-222)Online publication date: Jun-2016
      • (2015)Exploiting Pure Superword Level Parallelism for Array IndirectionsProceedings of the 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)10.1109/PAAP.2015.14(13-19)Online publication date: 12-Dec-2015
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media