research-article

Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements

Authors:

Lizy Kurian John,

Doug BurgerAuthors Info & Claims

IEEE Transactions on Computers, Volume 52, Issue 8

Pages 1015 - 1031

https://doi.org/10.1109/TC.2003.1223637

Published: 01 August 2003 Publication History

Abstract

Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.

References

[1]

R.B. Lee, “Multimedia Extensions for General-Purpose Processors,” Proc. IEEE Workshop Signal Processing Systems, pp. 9-23, Nov. 1997.

[2]

K. Diefendorff P.K. Dubey R. Hochsprung and H. Scales, “AltiVec Extension to PowerPC Accelerates Media Processing,” IEEE Micro, vol. 20, no. 2, pp. 85-95, Mar./Apr. 2000.

Digital Library

[3]

TMS320C64x DSP Technical Brief, available: http://www.ti.com/sc/docs/products/dsp/c6000/c64xmptb.pdf, 2000.

[4]

J. Fridman and Z. Greenfield, “The TigerSHARC DSP Architecture,” IEEE Micro, vol. 20, no. 1, pp. 66-76, Jan./Feb. 2000.

Digital Library

[5]

P. Ranganathan S. Adve and N. Jouppi, “Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 124-135, May 1999.

Digital Library

[6]

E. Salami J. Corbal M. Valero and R. Espasa, “An Evaluation of Different DLP Alternatives for the Embedded Domain,” Proc. Workshop Media Processors and DSPs in conjunction with Micro-32, Nov. 1999.

[7]

R. Bhargava L.K. John B.L. Evans and R. Radhakrishnan, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 37-46, Dec. 1998.

Digital Library

[8]

H.V. Nguyen and L.K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology,” Proc. ACM Int'l Conf. Supercomputing, pp. 11-20, June 1999.

Digital Library

[9]

Sample source code for the Benchmarks, available: http://www.ece.utexas.edu/projects/ece/lca/mediabenchmarks/, 2001.

[10]

C. Lee M. Potkonjak and W.H. Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems,” Proc IEEE/ACM Int'l Symp. Microarchitecture, pp. 330-335, Dec. 1997.

Digital Library

[11]

D. Burger and T.M. Austin, “The SimpleScalar Tool Set,” version 2.0. Technical Report 1342, Computer Science Dept., Univ. of Wisconsin-Madison, 1997.

[12]

J. Fritts and W. Wolf, “Dynamic Parallel Media Processing Using Speculative Broadcast Loop (SBL),” Proc. Workshop Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (held in conjunction with IPDPS '01), Apr. 2001.

Digital Library

[13]

P.T. Hulina L.D. Coraor L. Kurian and E. John, “Design and VLSI Implementation of an Address Generation Coprocessor,” IEE Proc. Computers and Digital Techniques, vol. 142, no. 2, pp. 145-151, Mar. 1995.

[14]

J.E. Smith, “Decoupled Access/Execute Computer Architectures,” ACM Trans. Computer Systems, vol. 2, no. 4, pp. 289-308, Nov. 1984.

Digital Library

[15]

J.E. Smith S. Weiss and N.Y. Pang, “A Simulation Study of Decoupled Architecture Computers,” IEEE Trans. Computers, vol. 35,no. 8, pp. 692-701, Aug. 1986.

Digital Library

[16]

J. Corbal R. Espasa and M. Valero, “On the Efficiency of Reductions in Micro-SIMD Media Extensions,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.

Digital Library

[17]

Intel Architecture Optimization Reference Manual, available: http://developer.intel.com/design/pentiumii/manuals/245127.htm, 1999.

[18]

P. Lapsley J. Bier A. Shoham and E.A. Lee, DSP Processor Fundamentals: Architectures and Features. chapter 8. IEEE Press, 1997.

Digital Library

[19]

A.R. Pleszkun and E.S. Davidson, “Structured Memory Access Architecture,” Proc. IEEE Int'l Conf. Parallel Processing, pp. 461-471, 1983.

[20]

F. Vermeulen L. Nachtergaele F. Catthoor D. Verkest and H. De Man, “Flexible Hardware Acceleration for Multimedia Oriented Microprocessors,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 171-177, Dec. 2000.

Digital Library

[21]

Synopsis Sold Documentation, version 2000-0.5-1, distributed with Synopsys CAD tools, 2001.

[22]

LSI Logic ASIC technologies, available at: http://www.lsilogic/products/asic/technologies/index.html, 2001.

[23]

LSI Logic, ASKK Documentation System, distributed with LSI Logic CAD tools, 2001

[24]

H.G. Cragon and W.J. Watson, “The TI Advanced Scientific Computer,” Computer, pp. 55-64, vol. 22, no. 1, Jan. 1989.

Digital Library

[25]

L. Gwennap, “AltiVec Vectorizes PowerPC,” Microprocessor Report, vol. 12, no. 6, May 1998.

[26]

Pentium III implementation (IA-32), available: http://www. sandpile.org/impl/p3.htm, 2000.

[27]

K. Wilcox and S. Manne, “Alpha Processors: A History of Power Issues and a Look at the Future,” Cool Chips Tutorial in Conjunction with IEEE/ACM Int'l Symp. Microarchitecture, Nov. 1999.

[28]

J. Fridman, “Subword Parallelism in Digital Signal Processing,” IEEE Signal Processing Magazine, vol. 17, no. 2, pp. 27-35, Mar. 2000.

[29]

S. Thakkar and T. Huff, “Internet Streaming SIMD Extensions,” Computer, vol. 32, no. 12, pp. 26-34, Dec. 1999.

Digital Library

[30]

J.E. Thornton, “Parallel Operation in the Control Data 6600,” Proc. Fall Joint Computers Conf., vol. 26, pp. 33-40, 1961.

[31]

R.R. Shively, “Architecture of a Programmable Digital Signal Processor,” IEEE Trans. Computers, vol. 31, no. 1, pp. 16-22, Jan. 1978.

Digital Library

[32]

J.R. Goodman T.J. Hsieh K. Liou A.R. Pleszkun P.B. Schechter and H.C. Young, “PIPE: A VLSI Decoupled Architecture,” Proc. IEEE Int'l Symp. Computer Architecture, pp. 20-27, June 1985.

Digital Library

[33]

W.A. Wolf, “Evaluation of the WM Architecture,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 382-390, May 1992.

Digital Library

[34]

Y. Zhang and G.B. Adams, “Performance Modeling and Code Partitioning for the DS Architecture,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 293-304, June 1998.

Digital Library

[35]

A.S. Berrached P.T. Hulina and L.D. Coraor, “Specification of a Coprocessor for Efficient Access of Data Structures,” Proc. Ann. Hawaii Int'l Conf. System Sciences, pp. 496-505, Jan. 1992.

[36]

J. Corbal M. Valero and R. Espasa, “Exploiting a New Level of DLP in Multimedia Applications,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 72-79, Nov. 1999.

Digital Library

[37]

S. Vassiliadis B. Juurlink and E.A. Hakkennes, “Complex Streamed Instructions: Introduction and Initial Evaluation,” Proc. IEEE Euromicro Conf., vol. 1, pp. 400-408, Sept. 2000.

[38]

B. Juurlink D. Tcheressiz S. Vassiliadis and H. Wijshoff, “Implementation and Evaluation of the Complex Streamed Instruction Set,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.

Digital Library

[39]

C.G. Lee and M.G. Stoodley, “Simple Vector Microprocessors for Multimedia Applications,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 25-36, Dec. 1998.

Digital Library

[40]

S. Rixner W.J. Dally U.J. Kapasi B. Khailany A. Lopez-Lagunas P.R. Mattson and J.D. Owens, “A Bandwidth-Efficient Architecture for Media Processing,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 3-13, Dec. 1998.

Digital Library

[41]

S.C. Goldstein H. Schmit M. Moe M. Nudiu S. Cadambi R.R. Taylor and R. Laufer, “PipeRench: A Coprocessor for Streaming Multimedia Acceleration,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 28-39, May 1999.

Digital Library

[42]

D.J. Kuck and R.A. Stokes, “The Burroughs Scientific Processor (BSP),” IEEE Trans. Computers, vol. 31, no. 5, pp. 363-376, May 1982.

Digital Library

[43]

T.M. Conte P.K. Dubey M.D. Jennings R.B. Lee A. Peleg S. Rathnam M. Schlansker P. Song and A. Wolfe, “Challenges to Combining General-Purpose and Multimedia Processors,” Computer, vol. 30, no. 12, pp. 33-37, Dec. 1997.

Digital Library

[44]

P. Ranganathan S. Adve and N. Jouppi, “Reconfigurable Caches and their Application to Media Processing,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 214-224, June 2000.

Digital Library

[45]

S.A. Mckee, “Maximizing Memory Bandwidth for Streamed Computations,” PhD Thesis, School of Eng. and Applied Science, Univ. of Virginia, Charlottesville, May 1995.

Digital Library

[46]

Z.A. Ye A. Moshovos S. Hauck and P. Banerjee, “CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 225-235, June 2000.

Digital Library

[47]

H. Lieske J. Wittenburg W. Hinrichs H. Kloos M. Ohmacht and P. Pirsch, “Enhancements for a Second Generation Parallel Multimedia-DSP,” Proc. Workshop Media Processors and DSPs in Conjunction with Micro-32, Nov. 1999.

[48]

D. Talla and L.K. John, “Cost-Effective Hardware Acceleration of Multimedia Applications,” Proc. IEEE Int'l Conf. Computer Design, pp. 415-424, Sept. 2001.

Digital Library

[49]

D. Talla, “Architectural Techniques to Accelerate Multimedia Applications on General-Purpose Processors,” PhD thesis, Dept. of Electrical and Computer Eng., Univ. of Texas, Austin, Aug. 2001, available at: http://www.ece.utexas.edu/projects/ece/lca/ps/deepu_talla_dissertation.pdf.

[50]

N. Sreraman and R. Govindarajan, “A Vectorizing Compiler for Multimedia Extensions,” Int'l J. Parallel Programming, vol. 28, no. 4, pp. 363-400, Aug. 2000.

[51]

G. Pokam J. Simonnet and F. Bodin, “A Retargetable Preprocessor for Multimedia Instructions,” Proc. Workshop Compilers for Parallel Computers, June 2001.

[52]

A. Bik M. Girkar P. Grey and X. Tian, “Experiments with Automatic Vectorization for the Pentium 4 Processor,” Proc. Workshop Compilers for Parallel Computers, June 2001.

[53]

G. Cheong and M.S. Lam, “An Optimizer for Multimedia Instruction Sets,” Proc. SUIF Compiler Workshop, Aug. 1997.

[54]

S.P. Amarasinghe, “Parallelizing Compiler Techniques Based on Linear Inequalities,” PhD thesis, Computer Systems Laboratory, Stanford Univ., Jan. 1997.

Digital Library

[55]

M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.

Digital Library

[56]

D. Rice, “High-Performance Image Processing Using Special-Purpose CPU Instructions: The UltraSPARC Visual Instruction Set,” master's thesis, Stanford Univ., 1996.

Digital Library

[57]

D. Talla and L.K. John, “MediaBreeze: A Decoupled Architecture for Accelerating Multimedia Applications,” ACM Computer Architecture News, vol. 29, no. 5, Dec. 2001.

Digital Library

[58]

D. Talla L.K. John and D. Burger, “Hardware Support to Reduce Overhead in Fine-Grain Media Codes,” technical report, Laboratory for Computer Architecture, Dept. of Electrical and Computer Eng., Univ. of Texas, Austin, Nov. 2001.

Cited By

Davies MMcDougall IAnandaraj SMachchhar DJain RSankaralingam KTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640367(20-36)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640367
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Amiri HShahbahrami A(2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.jpdc.2019.09.012
Show More Cited By

Index Terms

Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements

Recommendations

Retargetable code optimization with SIMD instructions
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis

Retargetable C compilers are nowadays widely used to quickly obtain compiler support for new embedded processors and to perform early processor architecture exploration. One frequent concern about retargetable compilers, though, is their lack of machine-...
Hardware/software co-design of a fuzzy RISC processor
DATE '98: Proceedings of the conference on Design, automation and test in Europe

In this paper, we show how hardware/software co-evaluation can be applied to instruction set definition. As a case study, we show the definition and evaluation of instruction set extensions for fuzzy processing. These instructions are based on the use ...
A SIMD optimization framework for retargetable compilers

Retargetable C compilers are currently widely used to quickly obtain compiler support for new embedded processors and to perform early processor architecture exploration. A partially inherent problem of the retargetable compilation approach, though, is ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 52, Issue 8

August 2003

112 pages

ISSN:0018-9340

Issue’s Table of Contents

Copyright © Copyright © 2003 IEEE. All Rights Reserved.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 August 2003

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Davies MMcDougall IAnandaraj SMachchhar DJain RSankaralingam KTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640367(20-36)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640367
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Amiri HShahbahrami A(2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.jpdc.2019.09.012
Fu SHong DLiu YWu JHsu W(2019)Optimizing data permutations in structured loads/stores translation and SIMD register mapping for a cross-ISA dynamic binary translatorJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2019.07.00898:C(173-190)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1016/j.sysarc.2019.07.008
Aleen FZakharin VKrishnaiyer RGupta GKreitzer DLin C(2018)Automated Compiler Optimization of Multiple Vector Loads/StoresInternational Journal of Parallel Programming10.1007/s10766-016-0485-746:2(471-503)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10766-016-0485-7
Aleen FZakharin VKrishaniyer RGupta GKreitzer DLin CPalermo GFeo JTumeo AFranke H(2016)Automated compiler optimization of multiple vector loads/storesProceedings of the ACM International Conference on Computing Frontiers10.1145/2903150.2903169(82-91)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1145/2903150.2903169
Gebrewahid EArslan MKarlsson AUl-Abdin Z(2016)Support for data parallelism in the CAL actor languageProceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing10.1145/2870650.2870656(1-8)Online publication date: 13-Mar-2016
https://dl.acm.org/doi/10.1145/2870650.2870656
Ren HZhang ZWu J(2016)SWIFTMobile Networks and Applications10.1007/s11036-016-0717-521:6(974-982)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1007/s11036-016-0717-5
Anderson AMalik AGregg D(2015)Automatic Vectorization of Interleaved Data RevisitedACM Transactions on Architecture and Code Optimization10.1145/283873512:4(1-25)Online publication date: 8-Dec-2015
https://dl.acm.org/doi/10.1145/2838735
Totoni EDikmen MGarzarán M(2013)Easy, fast, and energy-efficient object detection on heterogeneous on-chip architecturesACM Transactions on Architecture and Code Optimization10.1145/2541228.255530210:4(1-25)Online publication date: 1-Dec-2013
https://dl.acm.org/doi/10.1145/2541228.2555302
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents