research-article

Speculative hardware/software co-designed floating-point multiply-add fusion

Authors:

Grigorios Magklis,

Sridhar Samudrala,

Raúl Martínez,

Kyriakos Stavrou, and

David R. DitzelAuthors Info & Claims

ACM SIGPLAN Notices, Volume 49, Issue 4

Pages 623 - 638

https://doi.org/10.1145/2644865.2541978

Published: 24 February 2014 Publication History

Abstract

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.

References

[1]

IEEE Standard for Floating-Point Arithmetic. IEEE Std 754, 2008.

[2]

http://www.microquill.com/smartheap/.

[3]

S. Boldo and J. Muller. Exact and Approximated Error of the FMA. IEEE Transactions on Computers, 60(2):157--164, 2011.

Digital Library

[4]

A. Brankovic, K. Stavrou, E. Gibert, and A. Gonzalez. Warmup Simulation Methodology for HW/SW Co-Designed Processors. In International Symposium on Code Generation and Optimization, 2014.

Digital Library

[5]

J. Bruguera and T. Lang. Floating-Point Fused Multiply- Add: Reduced Latency for Floating-Point Addition. In 17th Symposium on Computer Arithmetic, 2005.

Digital Library

[6]

N. Brunie, F. de Dinechin, and B. de Dinechin. A Mixed-Precision Fused Multiply and Add. In 45th Asilomar Conference on Signals, Systems and Computers, 2011.

[7]

B. Buros, E. Stahl, P. Wong, C. Skawratananond, and D. Jones. An Assessment of Leadership Performance with POWER6 Processors and Red Hat Enterprise Linux 5.1. In White paper, 2008.

[8]

M. Butler, L. Barnes, D. Sarma, and B. Gelinas. Bulldozer: An Approach to Multithreaded Compute Performance. IEEE Micro, 31(2):6--15, 2011.

Digital Library

[9]

N. Clark, A. Hormati, and S. Mahlke. VEAL: Virtualized Execution Accelerator for Loops. In 35th International Symposium on Computer Architecture, 2008.

Digital Library

[10]

M. Collins, F. J. Vecchio, R. G. Selby, and P. R. Gupta. The Failure of an Offshore Platform. Concrete Internationa Detroit, 19(8):28--35, 1997.

[11]

A. Deb, J. Codina, and A. Gonzalez. A HW/SW Co-designed Programmable Functional Unit. IEEE Computer Architecture Letters, 11, 2012.

Digital Library

[12]

J. Dehnert, B. Grant, J. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life challenges. In International Symposium on Code Generation and Optimization, 2003.

Digital Library

[13]

J. Doweck. Inside the Core? Microarchitecture. Hot Chips: A Symposium on High Performance Chips, 18, 2006.

[14]

K. Ebcioglu, E. Altman, M. Gschwind, and S. Sathaye. Dynamic binary translation and optimization. IEEE Transactions on Computers, 50(6):529--548, 2001.

Digital Library

[15]

B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. Patel, and S. Lumetta. Performance Characterization of a Hardware Mechanism for Dynamic Optimization. In 34th International Symposium on Microarchitecture, 2001.

Digital Library

[16]

S. Galal and M. Horowitz. Latency Sensitive FMA Design. In 20th Symposium on Computer Arithmetic, 2011.

Digital Library

[17]

GNU project, http://gcc.gnu.org/gcc-4.7/.

[18]

A. Heinecke, T. Auckenthaler, and C. Trinitis. Exploiting State-of-the-Art x86 Architectures in Scientific Computing. In 11th International Symposium on Parallel and Distributed Computing, 2012.

Digital Library

[19]

N. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, 2002.

Digital Library

[20]

S. Hu, I. Kim, M. Lipasti, and J. Smith. An Approach for Implementing Efficient Superscalar CISC Processors. In 12th Symposium on High-Performance Computer Architecture, 2012.

[21]

Intel Corporation, IntelR Architecture Instruction Set Extensions Programming Reference. Chapter 8: Intel Transactional Synchronization Extensions.

[22]

Intel Corporation. IntelR Advanced Vector Extensions 2 Programming Reference. www.intel.com, 2013.

[23]

Intel Corporation, http://software.intel.com/en-us/intelcompilers/.

[24]

W. Kahan. IEEE Standard 754 for Binary Floating-Point Arithmetic. Lecture Notes on the Status of IEEE 754, 1996.

[25]

A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Whitepaper, 2000.

[26]

K. Krewell. Transmeta Gets More Efficeon. Microprocessor Report, 17(10), 2003.

[27]

R. Kumar, A. Martinez, and A. Gonzalez. Speculative Dynamic Vectorization to Assist Static Vectorization in a HW/SW Co-designed Environment. In 19th International Conference on High Performance Computing, 2013.

[28]

T. Lanier. Exploring the Design of the Cortex-A15 Processor. In ARM, Tech. Rep, 2011.

[29]

LLVM Project, http://llvm.org/docs/LangRef.html/.

[30]

J. Mars and N. Kumar. Blockchop: Dynamic squash elimination for hybrid processor architecture. In International Symposium on Computer Architecture, 2012.

Digital Library

[31]

B. D. McCullough and H. D. Vinod. The Numerical Reliability of Econometric Software. Journal of Economic Literature, 37.6:633--665, 1999.

[32]

R. Montoye, E. Hokenek, and S. Runyon. Design of the IBM RISC System/6000 Floating-Point Execution Unit. IBM Journal of Research and Development, 1990.

Digital Library

[33]

J. Muller, N. Brisebarre, F. de Dinechin, C. Jeannerod, V. Lefevre, G. Melquiond, N. Revol, D. Stehle, and S. Torres. Handbook of Floating-Point Arithmetic. Birkhauser, 2009.

Digital Library

[34]

A. Naini, A. Dhablania, W. James, and D. D. Sarma. 1 GHz HAL SPARC64 Dual Floating Point Unit with RAS features. In Symposium on Computer Arithmetic, 2001.

Digital Library

[35]

N. Neelakantam, D. Ditzel, and C. Zilles. A Real System Evaluation of Hardware Atomicity for Software Speculation. In 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010.

Digital Library

[36]

F. P. O'Connell and S.W. White. POWER3: The Next Generation of PowerPC Processors. IBM Journal of Research and Development, 2000.

Digital Library

[37]

G. Ottoni, G. Chinya, G. Hoflehner, J. Collins, A. Kumar, E. Schuchman, D. Ditzel, R. Singhal, and H. Wang. AstroLIT: Enabling Simulation-Based Microarchitecture Comparison Between Intel and Transmeta Designs. In 8th International Conference on Computing Frontiers, 2011.

Digital Library

[38]

D. Pavlou, E. Gibert, F. Latorre, and A. Gonzalez. DDGacc: Boosting Dynamic DDG-based Binary Optimizations through Specialized Hardware Support. In 8th International Conference on Virtual Execution Environments, 2012.

Digital Library

[39]

E. Quinnell, E. Swartzlander, and C. Lemonds. Floating-Point Fused Multiply-Add Architectures. In 41st Asilomar Conference on Signals, Systems and Computers, 2007.

[40]

E. Quinnell, E. Swartzlander, and C. Lemonds. Bridge Floating-Point Fused Multiply-Add Design. In IEEE Transactions on Very Large Scale Integration Systems, 2008.

Digital Library

[41]

R. Rosner, Y. Almog, M. Moffie, N. Schwartz, and A. Mendelson. Power Awareness through Selective Dynamically Optimized Traces. In 31st International Symposium on Computer Architecture, 2004.

Digital Library

[42]

S. Sathaye, J. L. P. Ledak, M. G. S. Kosonocky, J. Fritts, A. Bright, E. Altman, and C. Agricola. Boa: Targeting multigigahertz with binary translation. In Workshop on Binary Translation, 1999.

[43]

H. Sharangpani and H. Arora. Itanium processor microarchitecture. IEEE Micro, 20(5):24--43, 2000.

Digital Library

[44]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[45]

R. Skeel. Roundoff error and the Patriot missile. SIAM News, 25(4):11, 1992.

[46]

C.Wang, Y.Wu, and M. Cintra. Acceldroid: Co-Designed Acceleration of Android Bytecode. In International Symposium on Code Generation and Optimization, 2012.

Digital Library

[47]

Y. Wu, S. Hu, E. Borin, and C. Wang. A HW/SW Co-Designed Heterogeneous Multi-Core Virtual Machine for Energy-Efficient General Purpose Computing. In International Symposium on Code Generation and Optimization, 2011.

Digital Library

[48]

S. Yehia and O. Temam. From Sequences of Dependent Instructions to Functions: an Approach for Improving Performance without ILP or Speculation. In 31st International Symposium on Computer Architecture, 2004.

Digital Library

Index Terms

Speculative hardware/software co-designed floating-point multiply-add fusion
1. Hardware
  1. Robustness
    1. Design for manufacturability
      1. Yield and cost modeling
      2. Yield and cost optimization
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Translator writing systems and compiler generators

Recommendations

Speculative hardware/software co-designed floating-point multiply-add fusion
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no ...
Read More
Speculative hardware/software co-designed floating-point multiply-add fusion
ASPLOS '14

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no ...
Read More
Bridge floating-point fused multiply-add design

A new floating-point fused multiply-add (FMA) design for the execution of (A × B)+C as a single instruction is presented. The bridge fused multiply-add unit is a design intended to add FMA functionality to existing floating-point coprocessor units by ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 49, Issue 4

ASPLOS '14

April 2014

729 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2644865

Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ

Issue’s Table of Contents

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Published in SIGPLAN Volume 49, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
678
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)5

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents