research-article

Mighty-morphing power-SIMD

Authors:

Trevor Mudge, and

Scott MahlkeAuthors Info & Claims

CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems

October 2010

Pages 67 - 76

https://doi.org/10.1145/1878921.1878934

Published: 24 October 2010 Publication History

Abstract

In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level parallelism, such as encryption. Wide single-instruction multiple-data (SIMD) processors have become popular for providing high performance, yet power efficient data engines for applications with abundant data parallelism. However, the non-data-parallel applications are relegated to a low-performance scalar datapath on these data engines while the SIMD resources are left idle. To accelerate both types of applications, we propose the design of a more flexible SIMD datapath called SIMD-Morph. In SIMD-Morph, code with data-level parallelism can be executed across the lanes in the traditional manner, but the lanes can be morphed into a feed-forward subgraph accelerator to execute scalar applications more efficiently. The morphed SIMD lanes form an accelerator that exploits both instruction-level parallelism as well as operation chaining to improve the performance of scalar code by exploiting the available resources in the SIMD lanes. Experimental results show that the performance impact is a 2.6X improvement for purely non-SIMD applications and a 1.4X improvement for the non-SIMD-ized portions of applications with data parallelism.

References

[1]

J. H. Ahn et al. Evaluating the Imagine stream architecture. In Proc. of the 31st Annual International Symposium on Computer Architecture, pages 14--25, June 2004.

Digital Library

[2]

K. Atasu, L. Pozzi, and P. Ienne. Automatic application-specific instruction-set extensions under microarchitectural constraints. In Proc. of the 40th Design Automation Conference, pages 256--261, June 2003.

Digital Library

[3]

H.-M. Bluethgen, C. Grassmann, W. Raab, and U. Ramacher. A programmable platform for software-defined radio. International Symposium on System-on-Chip, pages 15--, Nov. 2003.

[4]

P. Brisk et al. Instruction generation and regularity extraction for reconfigurable processors. In Proc. of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 262--269, 2002.

Digital Library

[5]

N. Clark et al. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 30--40, Dec. 2004.

Digital Library

[6]

N. Clark et al. An architecture framework for transparent instruction set customization in embedded processors. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 272--283, June 2005.

Digital Library

[7]

N. Clark et al. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proc. of the 13th International Symposium on High-Performance Computer Architecture, pages 216--227, 2007.

Digital Library

[8]

N. Clark, A. Hormati, S. Mahlke, and S. Yehia. Scalable subgraph mapping for acyclic computation accelerators. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 147--157, Oct. 2006.

Digital Library

[9]

N. Clark, H. Zhong, and S. Mahlke. Processor acceleration through automated instruction set customization. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 129--140, Dec. 2003.

Digital Library

[10]

D. Friendly, S. Patel, and Y. Patt. Putting the fill unit to work: Dynamic optimizations for trace cache microprocessors. In Proc. of the 25th Annual International Symposium on Computer Architecture, pages 173--181, June 1998.

Digital Library

[11]

J. Glossner, E. Hokenek, and M. Moudgill. The Sandbridge Sandblaster Communications Processor. In 3rd Workshop on Application Specific Processors, pages 53--58, Sept. 2004.

[12]

D. Goodwin and D. Petkov. Automatic generation of application specific processors. In Proc. of the 2003 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 137--147, 2003.

Digital Library

[13]

M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown. MiBench: A free, commercially representative embedded benchmark suite. In Proc. of the 4th IEEE Workshop on Workload Characterization, pages 10--22, Dec. 2001.

Digital Library

[14]

I. Huang. Co-Synthesis of Instruction Sets and Microarchitectures. PhD thesis, University of Southern California, 1994.

[15]

Q. Jacobson and J. E. Smith. Instruction pre-processing in trace processors. In Proc. of the 5th International Symposium on High-Performance Computer Architecture, pages 125--133, 1999.

Digital Library

[16]

C. Kozyrakis and C. Patterson. Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks. In Proc. of the 35th Intl. Symposium on Microarchitecture, pages 283--293, Nov. 2002.

Digital Library

[17]

C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proc. of the 30th Annual International Symposium on Microarchitecture, pages 330--335, 1997.

Digital Library

[18]

Y. Lin, H. Lee, M. Woh, Y. Harel, S. Mahlke, T. Mudge, and C. Chakrabarti. SODA: A low-power architecture for software radio. In In Proc. of the 33rd Annual International Symposium on Computer Architecture, pages 89--101, 2006.

Digital Library

[19]

G. Memik, W. H. Mangione-Smith, and W. Hu. NetBench: A benchmarking suite for network processors. In Proc. of the 2001 International Conference on Computer Aided Design, pages 39--42, 2001.

Digital Library

[20]

P. Paulin. Real-life challenges on mapping high-end video to mp-soc, 2009. 9th International Forum on Embedded MPSoC and Multicore.

[21]

D. Pham et al. The design and implementation of a first generation CELL processor. In IEEE Intl. Solid State Circuits Symposium, Feb. 2005.

[22]

J. Phillips and S. Vassiliadis. High-performance 3--1 interlock collapsing ALU's. IEEE Transactions on Computers, 43(3):257--268, 1994.

Digital Library

[23]

I. T. U. M. Recommendation. Framework and overall objectives of the future development of IMT-2000 and systems beyond IMT-2000". http://www.ieee802.org/secmail/pdf00204.pdf.

[24]

P. Sassone and D. S. Wills. Dynamic strands: Collapsing speculative dependence chains for reducing pipeline communication. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 7--17, Dec. 2004.

Digital Library

[25]

Y. Sazeides, S. Vassiliadis, and J. E. Smith. The performance potential of data dependence speculation & collapsing. In Proc. of the 29th Annual International Symposium on Microarchitecture, pages 238--247. IEEE Computer Society, 1996.

Digital Library

[26]

F. Sun et al. Synthesis of custom processors based on extensible platforms. In Proc. of the 2002 International Conference on Computer Aided Design, pages 641--648, Nov. 2002.

Digital Library

[27]

K. van Berkel, F. Heinle, P. P. E. Meuwissen, K. Moerman, and M. Weiss. Vector processing as an enabler for software-defined radio in handheld devices. EURASIP J. Appl. Signal Process., 2005(1):2613--2625, 2005.

Digital Library

[28]

M. Woh et al. The next generation challenge for software defined radio. In Proc. of the 7thInternational Symposium on Systems, Architectures, Modeling, and Simulation, pages 343--354, July 2007.

Digital Library

[29]

M. Woh, Y. Lin, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, R. Bruce, D. Kershaw, A. Reid, M. Wilder, and K. Flautner. From soda to scotch: The evolution of a wireless baseband processor. Proceedings. 41th Annual IEEE/ACM International Symposium on Microarchitecture, 2008. MICRO-41., pages 152--163, Nov. 2008.

Digital Library

[30]

M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. AnySP: Anytime Anywhere Anyway Signal Processing. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 128--139, June 2009.

Digital Library

[31]

S. Yehia and O. Temam. From sequences of dependent instructions to functions: An approach for improving performance without ILP or speculation. In Proc. of the 31st Annual International Symposium

Digital Library

Cited By

Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Son YKang SUm HLee SHam JKim DPark Y(2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
https://doi.org/10.3390/electronics10232960
Wang CLi XZhang HWang AZhou X(2017)Hot spots profiling and dataflow analysis in custom dataflow computing SoftProcessorsJournal of Systems and Software10.1016/j.jss.2016.07.025125:C(427-438)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.jss.2016.07.025
Show More Cited By

Index Terms

Mighty-morphing power-SIMD
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Read More
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Read More
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems

October 2010

276 pages

ISBN:9781605589039

DOI:10.1145/1878921

Program Chairs:
Vinod Kathail
USA
,
Reid Tatge
Texas Instruments, USA
,
Rajeev Barua
University of Maryland, College Park, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

CEDA
IEEE CAS
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESWeek '10

Sponsor:

ESWeek '10: Sixth Embedded Systems Week

October 24 - 29, 2010

Arizona, Scottsdale, USA

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Upcoming Conference

ESWEEK '24

Sponsor:
sigbed
sigbed
sigbed

Twentieth Embedded Systems Week

September 29 - October 4, 2024

Raleigh , NC , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
231
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZOu YLiu YWang CZhou YWang XZhang YOuyang YShan JWang YXue JCui HFeng XAamodt TJerger NSwift M(2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582046
Son YKang SUm HLee SHam JKim DPark Y(2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
https://doi.org/10.3390/electronics10232960
Wang CLi XZhang HWang AZhou X(2017)Hot spots profiling and dataflow analysis in custom dataflow computing SoftProcessorsJournal of Systems and Software10.1016/j.jss.2016.07.025125:C(427-438)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.jss.2016.07.025
Diken EJordans RCorvino RJóźwiak LCorporaal HChies F(2014)Construction and exploitation of VLIW ASIPs with heterogeneous vector-widthsMicroprocessors & Microsystems10.5555/2948290.294836538:8(947-959)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.5555/2948290.2948365
WANG YCHEN SCHEN HWAN JZHANG KLIU S(2013)Dual-Core Framework: Eliminating the Bottleneck Effect of Scalar Kernels on SIMD ArchitecturesIEICE Transactions on Information and Systems10.1587/transinf.E96.D.365E96.D:2(365-369)Online publication date: 2013
https://doi.org/10.1587/transinf.E96.D.365
Han KAhn JChoi K(2013)Power-Efficient Predication Techniques for Acceleration of Control Flow Execution on CGRAACM Transactions on Architecture and Code Optimization10.1145/2459316.245931910:2(1-25)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1145/2459316.2459319
Yang HChen SWu TLiu S(2012)Control-enhanced power-SIMDIEICE Electronics Express10.1587/elex.9.11479:14(1147-1152)Online publication date: 2012
https://doi.org/10.1587/elex.9.1147
Park YPark JPark HMahlke S(2012)LibraProceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2012.17(84-95)Online publication date: 1-Dec-2012
https://dl.acm.org/doi/10.1109/MICRO.2012.17
Wang YZhang KWan JLiu SNing XChen S(2012)Architectural Implications for SIMD Processors in the Wireless Communication DomainProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.176(1199-1204)Online publication date: 25-Jun-2012
https://dl.acm.org/doi/10.1109/HPCC.2012.176
Kim YLee JLee JMai THeo IPaek Y(2012)Exploiting both pipelining and data parallelism with SIMD reconfigurable architectureProceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications10.1007/978-3-642-28365-9_4(40-52)Online publication date: 19-Mar-2012
https://dl.acm.org/doi/10.1007/978-3-642-28365-9_4

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents