Article

Efficient orchestration of sub-word parallelism in media processors

Authors:

Venkatesh Akella,

Frederic ChongAuthors Info & Claims

SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures

Pages 225 - 234

https://doi.org/10.1145/1007912.1007946

Published: 27 June 2004 Publication History

Abstract

Communication and multimedia applications with increased data rates and enhanced functionality continuously raise the bar for the computational requirements of future microprocessors. In order to meet these computational demands it is necessary to exploit sub-word parallelism efficiently. We propose to make sub-word data movement a first-class operation in microprocessor architectures by introducing a Sub-word Permutation Unit (SPU)in the execution pipeline. The SPU is evaluated in the context of the MMX media co-processor for the Intel Pentium architectures, but our results can be extended to any processor that supports sub-word parallelism. We find that the SPU all ws us to orchestrate sub-word data placement prior to computation, thus all wing the MMX functional units to concentrate on performing calculations. Furthermore, we introduce a decoupled SPU control mechanism at the basic block level which allows static optimization to eliminate data-movement verhead in tight loops, where most media and signal processing occurs. We demonstrated that anywhere from 4% to 20% improvement can be obtained on key media and signal processing kernels with as little as 1% increase in hardware resources.

References

[1]

Virtual press kit: Intel Pentium 4 processor. http://www.intel.com/pressroom/archive/photos/p4_photos.htm.

[2]

K. Diefendorff and P. Dubey. How multimedia workloads will change rocessor design. IEEE Computer,30(9):43--45, sept 1997.

Digital Library

[3]

S. Dutta, K. Connor, W. Wolf, and A. Wolfe. A Design Study of a 0.25um Video Signal Processor. IEEE Transactions on Circuits and Systems for Vide Technology, 8:501--519, august 1998.

Digital Library

[4]

J. Fridman. Subword parallelism in digital signal processing. IEEE Signal Processing Magazine, 17(2):270--35, march 2000.

[5]

J. Fridman and Z. Greenfield. The TigerSHARC DSP Architecture. IEEE Micro pages 66--76, 2000.

Digital Library

[6]

S. R. Gerrit Slavenburg and H. Dijkstra. The TriMedia TM-1 PCI VLIW Media Processor. In Proceedings of the HotChips 8: A Symposium on High Performance Chips, august 1996.

[7]

J. L. Hennessy and D. A.Patterson. Computer Architecture: A Quantitative Approach, 2002.

Digital Library

[8]

J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, 2002. Figure 2.37, page 142, Third Edition.

Digital Library

[9]

Intel. Vtune performance analyzers. http://www.intel.com/software/prodcuts/vtune/.

[10]

IPP Intel. Intel Integrated Performance Primitives for Intel Pentium Processors and Intel Itanium Architectures. http://www.intel.com/software/rodcuts/ip/ip30/.

[11]

S. L. Johnsson and C.-T. Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, 38(9):1249--1268, September 1989.

Digital Library

[12]

P. D. Keith Diefendorff, R. Hochsprung, and H. Scales. Altivec extension to powerpc accelerates media processing. IEEE Micro, pages 85--96, march 2000.

Digital Library

[13]

D.J. Kuck and R. A. Stokes. The Burroughs Scientific Processor (BSP). IEEE Transaction on Computers, 31:363--376, may 1982.

Digital Library

[14]

R. B. Lee. Subword parallelism with MAX-2 --accelerating media rocessing with a minimal set of instruction extensions supporting efficient subword parallelism. IEEE Micro, 16(4):51--59, 1996.

Digital Library

[15]

R. B. Lee. Multimedia extensions for general-purpose processors. In IEEE Workshop on Signal Processing Systems, pages 9--23, november 1997.

[16]

P. Mattson, W. Dally, S. Rixner, and J. Owens. Communication Scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, november 2000.

Digital Library

[17]

S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C. Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L. Wright, W. A. Wulf, and J. H. Aylor. Design andevaluation of dynamic access ordering hardware. In International Conference on Supercomputing, pages 125--132, 1996.

Digital Library

[18]

Klenke, T.C. Landon, C.W. Oliver, M.H. Salinas, A.E. Szymkowiak, K.L. Wright, W.A. Wulf, and J.H. Aylor. Design and evaluation of dynamic access ordering hardware. In International Conference on Supercomputing, pages 125--132, 1996.

Digital Library

[19]

D. O. Michael Kagan, Simcha Gochman and D. Lin. MMX microarchitecture of Pentium rocessors with MMX technology and Pentium II microprocessors. (Q3):8, 1997.

[20]

A. Peleg and U. Weiser. MMX technology extension to Intel architecture. IEEE Micro, 16(4):42--50, 1996.

Digital Library

[21]

N. Seshan. High VelociTI Processing. IEEE Signal Processing Magazine, pages 86--101, march 1998.

[22]

D. Talla. Architectural techniques to accelerate multimedia applications on general-purpose processors, 2001.

[23]

M. Taylor, W. Lee, S. Amarsinghe, and A. Agarwal. Scalar operand network: On-chip interconnect for ilp in partitioned architectures. In HPCA, february 2003.

Digital Library

[24]

A. Wolfe, J. Fritts, S. Dutta, and E. Fernandes. Datapath Design for a VLIW Signal Processor. In Proceedings of HPCA-3, 1997, february 1997.

Digital Library

[25]

W. Wulf. Compilers and Computer Architecture. IEEE Computers, pages 41--48, July 1981.

Digital Library

Cited By

Shahbahrami AJuurlink BBorodin DVassiliadis S(2018)Avoiding conversion and rearrangement overhead in SIMD architecturesInternational Journal of Parallel Programming10.1007/s10766-006-0015-034:3(237-260)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-006-0015-0
Patronik PPiestrak S(2017)Hardware/Software Approach to Designing Low-Power RNS-Enhanced Arithmetic UnitsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2017.266910864:5(1031-1039)Online publication date: May-2017
https://doi.org/10.1109/TCSI.2017.2669108
Shahbahrami AJuurlink BVassiliadis SBagherzadeh NValero MRamirez A(2005)Matrix register file and extended subwordsProceedings of the 2nd conference on Computing frontiers10.1145/1062261.1062291(171-179)Online publication date: 4-May-2005
https://dl.acm.org/doi/10.1145/1062261.1062291

Index Terms

Efficient orchestration of sub-word parallelism in media processors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Media Processors
VLSID '97: Proceedings of the Tenth International Conference on VLSI Design: VLSI in Multimedia Applications

An overview of various media processors' architecture is presented in this short tutorial. The media processors discussed here provide compute powers in terms of billions of operations per second along with the memory bandwidth required to sustain those ...
Exploiting Instruction- and Data-Level Parallelism

Historically, there have been two different approaches to high performance computing: instruction-level parallelism (ILP) and data-level parallelism (DLP). The ILP paradigm seeks to execute several instructions each cycle by exploring a sequential ...
Memory-level parallelism aware fetch policies for simultaneous multithreading processors

A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures

June 2004

332 pages

ISBN:1581138407

DOI:10.1145/1007912

General Chair:
Phil Gibbons
Intel Research
,
Program Chair:
Micah Adler
University of Massachusetts

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA04

Sponsor:

SPAA04: 16th ACM Symposium on Parallelism in Algorithms and Architectures 2004

June 27 - 30, 2004

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
459
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shahbahrami AJuurlink BBorodin DVassiliadis S(2018)Avoiding conversion and rearrangement overhead in SIMD architecturesInternational Journal of Parallel Programming10.1007/s10766-006-0015-034:3(237-260)Online publication date: 27-Dec-2018
https://dl.acm.org/doi/10.1007/s10766-006-0015-0
Patronik PPiestrak S(2017)Hardware/Software Approach to Designing Low-Power RNS-Enhanced Arithmetic UnitsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2017.266910864:5(1031-1039)Online publication date: May-2017
https://doi.org/10.1109/TCSI.2017.2669108
Shahbahrami AJuurlink BVassiliadis SBagherzadeh NValero MRamirez A(2005)Matrix register file and extended subwordsProceedings of the 2nd conference on Computing frontiers10.1145/1062261.1062291(171-179)Online publication date: 4-May-2005
https://dl.acm.org/doi/10.1145/1062261.1062291

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents