Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Speculative hardware/software co-designed floating-point multiply-add fusion

Published: 24 February 2014 Publication History
  • Get Citation Alerts
  • Abstract

    A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.

    References

    [1]
    IEEE Standard for Floating-Point Arithmetic. IEEE Std 754, 2008.
    [2]
    http://www.microquill.com/smartheap/.
    [3]
    S. Boldo and J. Muller. Exact and Approximated Error of the FMA. IEEE Transactions on Computers, 60(2):157--164, 2011.
    [4]
    A. Brankovic, K. Stavrou, E. Gibert, and A. Gonzalez. Warmup Simulation Methodology for HW/SW Co-Designed Processors. In International Symposium on Code Generation and Optimization, 2014.
    [5]
    J. Bruguera and T. Lang. Floating-Point Fused Multiply- Add: Reduced Latency for Floating-Point Addition. In 17th Symposium on Computer Arithmetic, 2005.
    [6]
    N. Brunie, F. de Dinechin, and B. de Dinechin. A Mixed-Precision Fused Multiply and Add. In 45th Asilomar Conference on Signals, Systems and Computers, 2011.
    [7]
    B. Buros, E. Stahl, P. Wong, C. Skawratananond, and D. Jones. An Assessment of Leadership Performance with POWER6 Processors and Red Hat Enterprise Linux 5.1. In White paper, 2008.
    [8]
    M. Butler, L. Barnes, D. Sarma, and B. Gelinas. Bulldozer: An Approach to Multithreaded Compute Performance. IEEE Micro, 31(2):6--15, 2011.
    [9]
    N. Clark, A. Hormati, and S. Mahlke. VEAL: Virtualized Execution Accelerator for Loops. In 35th International Symposium on Computer Architecture, 2008.
    [10]
    M. Collins, F. J. Vecchio, R. G. Selby, and P. R. Gupta. The Failure of an Offshore Platform. Concrete Internationa Detroit, 19(8):28--35, 1997.
    [11]
    A. Deb, J. Codina, and A. Gonzalez. A HW/SW Co-designed Programmable Functional Unit. IEEE Computer Architecture Letters, 11, 2012.
    [12]
    J. Dehnert, B. Grant, J. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life challenges. In International Symposium on Code Generation and Optimization, 2003.
    [13]
    J. Doweck. Inside the Core? Microarchitecture. Hot Chips: A Symposium on High Performance Chips, 18, 2006.
    [14]
    K. Ebcioglu, E. Altman, M. Gschwind, and S. Sathaye. Dynamic binary translation and optimization. IEEE Transactions on Computers, 50(6):529--548, 2001.
    [15]
    B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, and S. Lumetta. Performance Characterization of a Hardware Mechanism for Dynamic Optimization. In 34th International Symposium on Microarchitecture, 2001.
    [16]
    S. Galal and M. Horowitz. Latency Sensitive FMA Design. In 20th Symposium on Computer Arithmetic, 2011.
    [17]
    GNU project, http://gcc.gnu.org/gcc-4.7/.
    [18]
    A. Heinecke, T. Auckenthaler, and C. Trinitis. Exploiting State-of-the-Art x86 Architectures in Scientific Computing. In 11th International Symposium on Parallel and Distributed Computing, 2012.
    [19]
    N. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, 2002.
    [20]
    S. Hu, I. Kim, M. Lipasti, and J. Smith. An Approach for Implementing Efficient Superscalar CISC Processors. In 12th Symposium on High-Performance Computer Architecture, 2012.
    [21]
    Intel Corporation, IntelR Architecture Instruction Set Extensions Programming Reference. Chapter 8: Intel Transactional Synchronization Extensions.
    [22]
    Intel Corporation. IntelR Advanced Vector Extensions 2 Programming Reference. www.intel.com, 2013.
    [23]
    Intel Corporation, http://software.intel.com/en-us/intelcompilers/.
    [24]
    W. Kahan. IEEE Standard 754 for Binary Floating-Point Arithmetic. Lecture Notes on the Status of IEEE 754, 1996.
    [25]
    A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Whitepaper, 2000.
    [26]
    K. Krewell. Transmeta Gets More Efficeon. Microprocessor Report, 17(10), 2003.
    [27]
    R. Kumar, A. Martinez, and A. Gonzalez. Speculative Dynamic Vectorization to Assist Static Vectorization in a HW/SW Co-designed Environment. In 19th International Conference on High Performance Computing, 2013.
    [28]
    T. Lanier. Exploring the Design of the Cortex-A15 Processor. In ARM, Tech. Rep, 2011.
    [29]
    LLVM Project, http://llvm.org/docs/LangRef.html/.
    [30]
    J. Mars and N. Kumar. Blockchop: Dynamic squash elimination for hybrid processor architecture. In International Symposium on Computer Architecture, 2012.
    [31]
    B. D. McCullough and H. D. Vinod. The Numerical Reliability of Econometric Software. Journal of Economic Literature, 37.6:633--665, 1999.
    [32]
    R. Montoye, E. Hokenek, and S. Runyon. Design of the IBM RISC System/6000 Floating-Point Execution Unit. IBM Journal of Research and Development, 1990.
    [33]
    J. Muller, N. Brisebarre, F. de Dinechin, C. Jeannerod, V. Lefevre, G. Melquiond, N. Revol, D. Stehle, and S. Torres. Handbook of Floating-Point Arithmetic. Birkhauser, 2009.
    [34]
    A. Naini, A. Dhablania, W. James, and D. D. Sarma. 1 GHz HAL SPARC64 Dual Floating Point Unit with RAS features. In Symposium on Computer Arithmetic, 2001.
    [35]
    N. Neelakantam, D. Ditzel, and C. Zilles. A Real System Evaluation of Hardware Atomicity for Software Speculation. In 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010.
    [36]
    F. P. O'Connell and S.W. White. POWER3: The Next Generation of PowerPC Processors. IBM Journal of Research and Development, 2000.
    [37]
    G. Ottoni, G. Chinya, G. Hoflehner, J. Collins, A. Kumar, E. Schuchman, D. Ditzel, R. Singhal, and H. Wang. AstroLIT: Enabling Simulation-Based Microarchitecture Comparison Between Intel and Transmeta Designs. In 8th International Conference on Computing Frontiers, 2011.
    [38]
    D. Pavlou, E. Gibert, F. Latorre, and A. Gonzalez. DDGacc: Boosting Dynamic DDG-based Binary Optimizations through Specialized Hardware Support. In 8th International Conference on Virtual Execution Environments, 2012.
    [39]
    E. Quinnell, E. Swartzlander, and C. Lemonds. Floating-Point Fused Multiply-Add Architectures. In 41st Asilomar Conference on Signals, Systems and Computers, 2007.
    [40]
    E. Quinnell, E. Swartzlander, and C. Lemonds. Bridge Floating-Point Fused Multiply-Add Design. In IEEE Transactions on Very Large Scale Integration Systems, 2008.
    [41]
    R. Rosner, Y. Almog, M. Moffie, N. Schwartz, and A. Mendelson. Power Awareness through Selective Dynamically Optimized Traces. In 31st International Symposium on Computer Architecture, 2004.
    [42]
    S. Sathaye, J. L. P. Ledak, M. G. S. Kosonocky, J. Fritts, A. Bright, E. Altman, and C. Agricola. Boa: Targeting multigigahertz with binary translation. In Workshop on Binary Translation, 1999.
    [43]
    H. Sharangpani and H. Arora. Itanium processor microarchitecture. IEEE Micro, 20(5):24--43, 2000.
    [44]
    T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
    [45]
    R. Skeel. Roundoff error and the Patriot missile. SIAM News, 25(4):11, 1992.
    [46]
    C.Wang, Y.Wu, and M. Cintra. Acceldroid: Co-Designed Acceleration of Android Bytecode. In International Symposium on Code Generation and Optimization, 2012.
    [47]
    Y. Wu, S. Hu, E. Borin, and C. Wang. A HW/SW Co-Designed Heterogeneous Multi-Core Virtual Machine for Energy-Efficient General Purpose Computing. In International Symposium on Code Generation and Optimization, 2011.
    [48]
    S. Yehia and O. Temam. From Sequences of Dependent Instructions to Functions: an Approach for Improving Performance without ILP or Speculation. In 31st International Symposium on Computer Architecture, 2004.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 49, Issue 4
    ASPLOS '14
    April 2014
    729 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2644865
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
      February 2014
      780 pages
      ISBN:9781450323055
      DOI:10.1145/2541940
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 February 2014
    Published in SIGPLAN Volume 49, Issue 4

    Check for updates

    Author Tags

    1. binary translator
    2. combined multiply-add
    3. fma
    4. hw/sw co-designed processors
    5. instruction fusion

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)5

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media