Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1878921.1878934acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Mighty-morphing power-SIMD

Published: 24 October 2010 Publication History
  • Get Citation Alerts
  • Abstract

    In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level parallelism, such as encryption. Wide single-instruction multiple-data (SIMD) processors have become popular for providing high performance, yet power efficient data engines for applications with abundant data parallelism. However, the non-data-parallel applications are relegated to a low-performance scalar datapath on these data engines while the SIMD resources are left idle. To accelerate both types of applications, we propose the design of a more flexible SIMD datapath called SIMD-Morph. In SIMD-Morph, code with data-level parallelism can be executed across the lanes in the traditional manner, but the lanes can be morphed into a feed-forward subgraph accelerator to execute scalar applications more efficiently. The morphed SIMD lanes form an accelerator that exploits both instruction-level parallelism as well as operation chaining to improve the performance of scalar code by exploiting the available resources in the SIMD lanes. Experimental results show that the performance impact is a 2.6X improvement for purely non-SIMD applications and a 1.4X improvement for the non-SIMD-ized portions of applications with data parallelism.

    References

    [1]
    J. H. Ahn et al. Evaluating the Imagine stream architecture. In Proc. of the 31st Annual International Symposium on Computer Architecture, pages 14--25, June 2004.
    [2]
    K. Atasu, L. Pozzi, and P. Ienne. Automatic application-specific instruction-set extensions under microarchitectural constraints. In Proc. of the 40th Design Automation Conference, pages 256--261, June 2003.
    [3]
    H.-M. Bluethgen, C. Grassmann, W. Raab, and U. Ramacher. A programmable platform for software-defined radio. International Symposium on System-on-Chip, pages 15--, Nov. 2003.
    [4]
    P. Brisk et al. Instruction generation and regularity extraction for reconfigurable processors. In Proc. of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 262--269, 2002.
    [5]
    N. Clark et al. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 30--40, Dec. 2004.
    [6]
    N. Clark et al. An architecture framework for transparent instruction set customization in embedded processors. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 272--283, June 2005.
    [7]
    N. Clark et al. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proc. of the 13th International Symposium on High-Performance Computer Architecture, pages 216--227, 2007.
    [8]
    N. Clark, A. Hormati, S. Mahlke, and S. Yehia. Scalable subgraph mapping for acyclic computation accelerators. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 147--157, Oct. 2006.
    [9]
    N. Clark, H. Zhong, and S. Mahlke. Processor acceleration through automated instruction set customization. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 129--140, Dec. 2003.
    [10]
    D. Friendly, S. Patel, and Y. Patt. Putting the fill unit to work: Dynamic optimizations for trace cache microprocessors. In Proc. of the 25th Annual International Symposium on Computer Architecture, pages 173--181, June 1998.
    [11]
    J. Glossner, E. Hokenek, and M. Moudgill. The Sandbridge Sandblaster Communications Processor. In 3rd Workshop on Application Specific Processors, pages 53--58, Sept. 2004.
    [12]
    D. Goodwin and D. Petkov. Automatic generation of application specific processors. In Proc. of the 2003 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 137--147, 2003.
    [13]
    M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown. MiBench: A free, commercially representative embedded benchmark suite. In Proc. of the 4th IEEE Workshop on Workload Characterization, pages 10--22, Dec. 2001.
    [14]
    I. Huang. Co-Synthesis of Instruction Sets and Microarchitectures. PhD thesis, University of Southern California, 1994.
    [15]
    Q. Jacobson and J. E. Smith. Instruction pre-processing in trace processors. In Proc. of the 5th International Symposium on High-Performance Computer Architecture, pages 125--133, 1999.
    [16]
    C. Kozyrakis and C. Patterson. Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks. In Proc. of the 35th Intl. Symposium on Microarchitecture, pages 283--293, Nov. 2002.
    [17]
    C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proc. of the 30th Annual International Symposium on Microarchitecture, pages 330--335, 1997.
    [18]
    Y. Lin, H. Lee, M. Woh, Y. Harel, S. Mahlke, T. Mudge, and C. Chakrabarti. SODA: A low-power architecture for software radio. In In Proc. of the 33rd Annual International Symposium on Computer Architecture, pages 89--101, 2006.
    [19]
    G. Memik, W. H. Mangione-Smith, and W. Hu. NetBench: A benchmarking suite for network processors. In Proc. of the 2001 International Conference on Computer Aided Design, pages 39--42, 2001.
    [20]
    P. Paulin. Real-life challenges on mapping high-end video to mp-soc, 2009. 9th International Forum on Embedded MPSoC and Multicore.
    [21]
    D. Pham et al. The design and implementation of a first generation CELL processor. In IEEE Intl. Solid State Circuits Symposium, Feb. 2005.
    [22]
    J. Phillips and S. Vassiliadis. High-performance 3--1 interlock collapsing ALU's. IEEE Transactions on Computers, 43(3):257--268, 1994.
    [23]
    I. T. U. M. Recommendation. Framework and overall objectives of the future development of IMT-2000 and systems beyond IMT-2000". http://www.ieee802.org/secmail/pdf00204.pdf.
    [24]
    P. Sassone and D. S. Wills. Dynamic strands: Collapsing speculative dependence chains for reducing pipeline communication. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 7--17, Dec. 2004.
    [25]
    Y. Sazeides, S. Vassiliadis, and J. E. Smith. The performance potential of data dependence speculation & collapsing. In Proc. of the 29th Annual International Symposium on Microarchitecture, pages 238--247. IEEE Computer Society, 1996.
    [26]
    F. Sun et al. Synthesis of custom processors based on extensible platforms. In Proc. of the 2002 International Conference on Computer Aided Design, pages 641--648, Nov. 2002.
    [27]
    K. van Berkel, F. Heinle, P. P. E. Meuwissen, K. Moerman, and M. Weiss. Vector processing as an enabler for software-defined radio in handheld devices. EURASIP J. Appl. Signal Process., 2005(1):2613--2625, 2005.
    [28]
    M. Woh et al. The next generation challenge for software defined radio. In Proc. of the 7thInternational Symposium on Systems, Architectures, Modeling, and Simulation, pages 343--354, July 2007.
    [29]
    M. Woh, Y. Lin, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, R. Bruce, D. Kershaw, A. Reid, M. Wilder, and K. Flautner. From soda to scotch: The evolution of a wireless baseband processor. Proceedings. 41th Annual IEEE/ACM International Symposium on Microarchitecture, 2008. MICRO-41., pages 152--163, Nov. 2008.
    [30]
    M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. AnySP: Anytime Anywhere Anyway Signal Processing. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 128--139, June 2009.
    [31]
    S. Yehia and O. Temam. From sequences of dependent instructions to functions: An approach for improving performance without ILP or speculation. In Proc. of the 31st Annual International Symposium

    Cited By

    View all
    • (2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
    • (2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
    • (2017)Hot spots profiling and dataflow analysis in custom dataflow computing SoftProcessorsJournal of Systems and Software10.1016/j.jss.2016.07.025125:C(427-438)Online publication date: 1-Mar-2017
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
    October 2010
    276 pages
    ISBN:9781605589039
    DOI:10.1145/1878921
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • CEDA
    • IEEE CAS
    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data-level parallelism
    2. instruction-level parallelism
    3. operation chaining
    4. simd processing

    Qualifiers

    • Research-article

    Conference

    ESWeek '10
    ESWeek '10: Sixth Embedded Systems Week
    October 24 - 29, 2010
    Arizona, Scottsdale, USA

    Acceptance Rates

    Overall Acceptance Rate 52 of 230 submissions, 23%

    Upcoming Conference

    ESWEEK '24
    Twentieth Embedded Systems Week
    September 29 - October 4, 2024
    Raleigh , NC , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU CoresProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582046(483-497)Online publication date: 25-Mar-2023
    • (2021)A Collaborative CPU Vector Offloader: Putting Idle Vector Resources to Work on Commodity ProcessorsElectronics10.3390/electronics1023296010:23(2960)Online publication date: 28-Nov-2021
    • (2017)Hot spots profiling and dataflow analysis in custom dataflow computing SoftProcessorsJournal of Systems and Software10.1016/j.jss.2016.07.025125:C(427-438)Online publication date: 1-Mar-2017
    • (2014)Construction and exploitation of VLIW ASIPs with heterogeneous vector-widthsMicroprocessors & Microsystems10.5555/2948290.294836538:8(947-959)Online publication date: 1-Nov-2014
    • (2013)Dual-Core Framework: Eliminating the Bottleneck Effect of Scalar Kernels on SIMD ArchitecturesIEICE Transactions on Information and Systems10.1587/transinf.E96.D.365E96.D:2(365-369)Online publication date: 2013
    • (2013)Power-Efficient Predication Techniques for Acceleration of Control Flow Execution on CGRAACM Transactions on Architecture and Code Optimization10.1145/2459316.245931910:2(1-25)Online publication date: 1-May-2013
    • (2012)Control-enhanced power-SIMDIEICE Electronics Express10.1587/elex.9.11479:14(1147-1152)Online publication date: 2012
    • (2012)LibraProceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2012.17(84-95)Online publication date: 1-Dec-2012
    • (2012)Architectural Implications for SIMD Processors in the Wireless Communication DomainProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.176(1199-1204)Online publication date: 25-Jun-2012
    • (2012)Exploiting both pipelining and data parallelism with SIMD reconfigurable architectureProceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications10.1007/978-3-642-28365-9_4(40-52)Online publication date: 19-Mar-2012

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media