Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements

Published: 01 August 2003 Publication History

Abstract

Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units' throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), in this paper, we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.

References

[1]
R.B. Lee, “Multimedia Extensions for General-Purpose Processors,” Proc. IEEE Workshop Signal Processing Systems, pp. 9-23, Nov. 1997.
[2]
K. Diefendorff P.K. Dubey R. Hochsprung and H. Scales, “AltiVec Extension to PowerPC Accelerates Media Processing,” IEEE Micro, vol. 20, no. 2, pp. 85-95, Mar./Apr. 2000.
[3]
TMS320C64x DSP Technical Brief, available: http://www.ti.com/sc/docs/products/dsp/c6000/c64xmptb.pdf, 2000.
[4]
J. Fridman and Z. Greenfield, “The TigerSHARC DSP Architecture,” IEEE Micro, vol. 20, no. 1, pp. 66-76, Jan./Feb. 2000.
[5]
P. Ranganathan S. Adve and N. Jouppi, “Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 124-135, May 1999.
[6]
E. Salami J. Corbal M. Valero and R. Espasa, “An Evaluation of Different DLP Alternatives for the Embedded Domain,” Proc. Workshop Media Processors and DSPs in conjunction with Micro-32, Nov. 1999.
[7]
R. Bhargava L.K. John B.L. Evans and R. Radhakrishnan, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 37-46, Dec. 1998.
[8]
H.V. Nguyen and L.K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology,” Proc. ACM Int'l Conf. Supercomputing, pp. 11-20, June 1999.
[9]
Sample source code for the Benchmarks, available: http://www.ece.utexas.edu/projects/ece/lca/mediabenchmarks/, 2001.
[10]
C. Lee M. Potkonjak and W.H. Smith, “MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems,” Proc IEEE/ACM Int'l Symp. Microarchitecture, pp. 330-335, Dec. 1997.
[11]
D. Burger and T.M. Austin, “The SimpleScalar Tool Set,” version 2.0. Technical Report 1342, Computer Science Dept., Univ. of Wisconsin-Madison, 1997.
[12]
J. Fritts and W. Wolf, “Dynamic Parallel Media Processing Using Speculative Broadcast Loop (SBL),” Proc. Workshop Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (held in conjunction with IPDPS '01), Apr. 2001.
[13]
P.T. Hulina L.D. Coraor L. Kurian and E. John, “Design and VLSI Implementation of an Address Generation Coprocessor,” IEE Proc. Computers and Digital Techniques, vol. 142, no. 2, pp. 145-151, Mar. 1995.
[14]
J.E. Smith, “Decoupled Access/Execute Computer Architectures,” ACM Trans. Computer Systems, vol. 2, no. 4, pp. 289-308, Nov. 1984.
[15]
J.E. Smith S. Weiss and N.Y. Pang, “A Simulation Study of Decoupled Architecture Computers,” IEEE Trans. Computers, vol. 35,no. 8, pp. 692-701, Aug. 1986.
[16]
J. Corbal R. Espasa and M. Valero, “On the Efficiency of Reductions in Micro-SIMD Media Extensions,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.
[17]
Intel Architecture Optimization Reference Manual, available: http://developer.intel.com/design/pentiumii/manuals/245127.htm, 1999.
[18]
P. Lapsley J. Bier A. Shoham and E.A. Lee, DSP Processor Fundamentals: Architectures and Features. chapter 8. IEEE Press, 1997.
[19]
A.R. Pleszkun and E.S. Davidson, “Structured Memory Access Architecture,” Proc. IEEE Int'l Conf. Parallel Processing, pp. 461-471, 1983.
[20]
F. Vermeulen L. Nachtergaele F. Catthoor D. Verkest and H. De Man, “Flexible Hardware Acceleration for Multimedia Oriented Microprocessors,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 171-177, Dec. 2000.
[21]
Synopsis Sold Documentation, version 2000-0.5-1, distributed with Synopsys CAD tools, 2001.
[22]
LSI Logic ASIC technologies, available at: http://www.lsilogic/products/asic/technologies/index.html, 2001.
[23]
LSI Logic, ASKK Documentation System, distributed with LSI Logic CAD tools, 2001
[24]
H.G. Cragon and W.J. Watson, “The TI Advanced Scientific Computer,” Computer, pp. 55-64, vol. 22, no. 1, Jan. 1989.
[25]
L. Gwennap, “AltiVec Vectorizes PowerPC,” Microprocessor Report, vol. 12, no. 6, May 1998.
[26]
Pentium III implementation (IA-32), available: http://www. sandpile.org/impl/p3.htm, 2000.
[27]
K. Wilcox and S. Manne, “Alpha Processors: A History of Power Issues and a Look at the Future,” Cool Chips Tutorial in Conjunction with IEEE/ACM Int'l Symp. Microarchitecture, Nov. 1999.
[28]
J. Fridman, “Subword Parallelism in Digital Signal Processing,” IEEE Signal Processing Magazine, vol. 17, no. 2, pp. 27-35, Mar. 2000.
[29]
S. Thakkar and T. Huff, “Internet Streaming SIMD Extensions,” Computer, vol. 32, no. 12, pp. 26-34, Dec. 1999.
[30]
J.E. Thornton, “Parallel Operation in the Control Data 6600,” Proc. Fall Joint Computers Conf., vol. 26, pp. 33-40, 1961.
[31]
R.R. Shively, “Architecture of a Programmable Digital Signal Processor,” IEEE Trans. Computers, vol. 31, no. 1, pp. 16-22, Jan. 1978.
[32]
J.R. Goodman T.J. Hsieh K. Liou A.R. Pleszkun P.B. Schechter and H.C. Young, “PIPE: A VLSI Decoupled Architecture,” Proc. IEEE Int'l Symp. Computer Architecture, pp. 20-27, June 1985.
[33]
W.A. Wolf, “Evaluation of the WM Architecture,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 382-390, May 1992.
[34]
Y. Zhang and G.B. Adams, “Performance Modeling and Code Partitioning for the DS Architecture,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 293-304, June 1998.
[35]
A.S. Berrached P.T. Hulina and L.D. Coraor, “Specification of a Coprocessor for Efficient Access of Data Structures,” Proc. Ann. Hawaii Int'l Conf. System Sciences, pp. 496-505, Jan. 1992.
[36]
J. Corbal M. Valero and R. Espasa, “Exploiting a New Level of DLP in Multimedia Applications,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 72-79, Nov. 1999.
[37]
S. Vassiliadis B. Juurlink and E.A. Hakkennes, “Complex Streamed Instructions: Introduction and Initial Evaluation,” Proc. IEEE Euromicro Conf., vol. 1, pp. 400-408, Sept. 2000.
[38]
B. Juurlink D. Tcheressiz S. Vassiliadis and H. Wijshoff, “Implementation and Evaluation of the Complex Streamed Instruction Set,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2001.
[39]
C.G. Lee and M.G. Stoodley, “Simple Vector Microprocessors for Multimedia Applications,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 25-36, Dec. 1998.
[40]
S. Rixner W.J. Dally U.J. Kapasi B. Khailany A. Lopez-Lagunas P.R. Mattson and J.D. Owens, “A Bandwidth-Efficient Architecture for Media Processing,” Proc. IEEE/ACM Int'l Symp. Microarchitecture, pp. 3-13, Dec. 1998.
[41]
S.C. Goldstein H. Schmit M. Moe M. Nudiu S. Cadambi R.R. Taylor and R. Laufer, “PipeRench: A Coprocessor for Streaming Multimedia Acceleration,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 28-39, May 1999.
[42]
D.J. Kuck and R.A. Stokes, “The Burroughs Scientific Processor (BSP),” IEEE Trans. Computers, vol. 31, no. 5, pp. 363-376, May 1982.
[43]
T.M. Conte P.K. Dubey M.D. Jennings R.B. Lee A. Peleg S. Rathnam M. Schlansker P. Song and A. Wolfe, “Challenges to Combining General-Purpose and Multimedia Processors,” Computer, vol. 30, no. 12, pp. 33-37, Dec. 1997.
[44]
P. Ranganathan S. Adve and N. Jouppi, “Reconfigurable Caches and their Application to Media Processing,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 214-224, June 2000.
[45]
S.A. Mckee, “Maximizing Memory Bandwidth for Streamed Computations,” PhD Thesis, School of Eng. and Applied Science, Univ. of Virginia, Charlottesville, May 1995.
[46]
Z.A. Ye A. Moshovos S. Hauck and P. Banerjee, “CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit,” Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 225-235, June 2000.
[47]
H. Lieske J. Wittenburg W. Hinrichs H. Kloos M. Ohmacht and P. Pirsch, “Enhancements for a Second Generation Parallel Multimedia-DSP,” Proc. Workshop Media Processors and DSPs in Conjunction with Micro-32, Nov. 1999.
[48]
D. Talla and L.K. John, “Cost-Effective Hardware Acceleration of Multimedia Applications,” Proc. IEEE Int'l Conf. Computer Design, pp. 415-424, Sept. 2001.
[49]
D. Talla, “Architectural Techniques to Accelerate Multimedia Applications on General-Purpose Processors,” PhD thesis, Dept. of Electrical and Computer Eng., Univ. of Texas, Austin, Aug. 2001, available at: http://www.ece.utexas.edu/projects/ece/lca/ps/deepu_talla_dissertation.pdf.
[50]
N. Sreraman and R. Govindarajan, “A Vectorizing Compiler for Multimedia Extensions,” Int'l J. Parallel Programming, vol. 28, no. 4, pp. 363-400, Aug. 2000.
[51]
G. Pokam J. Simonnet and F. Bodin, “A Retargetable Preprocessor for Multimedia Instructions,” Proc. Workshop Compilers for Parallel Computers, June 2001.
[52]
A. Bik M. Girkar P. Grey and X. Tian, “Experiments with Automatic Vectorization for the Pentium 4 Processor,” Proc. Workshop Compilers for Parallel Computers, June 2001.
[53]
G. Cheong and M.S. Lam, “An Optimizer for Multimedia Instruction Sets,” Proc. SUIF Compiler Workshop, Aug. 1997.
[54]
S.P. Amarasinghe, “Parallelizing Compiler Techniques Based on Linear Inequalities,” PhD thesis, Computer Systems Laboratory, Stanford Univ., Jan. 1997.
[55]
M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[56]
D. Rice, “High-Performance Image Processing Using Special-Purpose CPU Instructions: The UltraSPARC Visual Instruction Set,” master's thesis, Stanford Univ., 1996.
[57]
D. Talla and L.K. John, “MediaBreeze: A Decoupled Architecture for Accelerating Multimedia Applications,” ACM Computer Architecture News, vol. 29, no. 5, Dec. 2001.
[58]
D. Talla L.K. John and D. Burger, “Hardware Support to Reduce Overhead in Fine-Grain Media Codes,” technical report, Laboratory for Computer Architecture, Dept. of Electrical and Computer Eng., Univ. of Texas, Austin, Nov. 2001.

Cited By

View all
  • (2024)A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640367(20-36)Online publication date: 27-Apr-2024
  • (2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
  • (2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 52, Issue 8
August 2003
112 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 August 2003

Author Tags

  1. Media processing
  2. bottlenecks in SIMD extensions
  3. data reorganization
  4. hardware address generation
  5. low-overhead looping
  6. performance evaluation
  7. subword parallelism
  8. superscalar general-purpose processors.
  9. workload characterization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640367(20-36)Online publication date: 27-Apr-2024
  • (2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
  • (2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
  • (2019)Optimizing data permutations in structured loads/stores translation and SIMD register mapping for a cross-ISA dynamic binary translatorJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2019.07.00898:C(173-190)Online publication date: 1-Sep-2019
  • (2018)Automated Compiler Optimization of Multiple Vector Loads/StoresInternational Journal of Parallel Programming10.1007/s10766-016-0485-746:2(471-503)Online publication date: 1-Apr-2018
  • (2016)Automated compiler optimization of multiple vector loads/storesProceedings of the ACM International Conference on Computing Frontiers10.1145/2903150.2903169(82-91)Online publication date: 16-May-2016
  • (2016)Support for data parallelism in the CAL actor languageProceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing10.1145/2870650.2870656(1-8)Online publication date: 13-Mar-2016
  • (2016)SWIFTMobile Networks and Applications10.1007/s11036-016-0717-521:6(974-982)Online publication date: 1-Dec-2016
  • (2015)Automatic Vectorization of Interleaved Data RevisitedACM Transactions on Architecture and Code Optimization10.1145/283873512:4(1-25)Online publication date: 8-Dec-2015
  • (2013)Easy, fast, and energy-efficient object detection on heterogeneous on-chip architecturesACM Transactions on Architecture and Code Optimization10.1145/2541228.255530210:4(1-25)Online publication date: 1-Dec-2013
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media