Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
240 views

Fix Point Implementation of Elementry Functions

This thesis examines implementing elementary mathematical functions like square root, reciprocal, and logarithm on a fixed-point SIMD DSP coprocessor. It explores using polynomial approximations with the coprocessor's 16 multipliers to quickly evaluate polynomials. To improve precision with fixed-point arithmetic, it proposes converting inputs to a soft floating-point format before approximation and using techniques like Newton-Raphson or Goldschmidt iterations. It also suggests instruction set additions to efficiently use the hardware and make the implementations faster.

Uploaded by

mukulkabra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
240 views

Fix Point Implementation of Elementry Functions

This thesis examines implementing elementary mathematical functions like square root, reciprocal, and logarithm on a fixed-point SIMD DSP coprocessor. It explores using polynomial approximations with the coprocessor's 16 multipliers to quickly evaluate polynomials. To improve precision with fixed-point arithmetic, it proposes converting inputs to a soft floating-point format before approximation and using techniques like Newton-Raphson or Goldschmidt iterations. It also suggests instruction set additions to efficiently use the hardware and make the implementations faster.

Uploaded by

mukulkabra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Implementation of Elementary Functions for a


Fixed Point SIMD DSP Coprocessor

Examensarbete utfört i datorteknik


vid Tekniska högskolan vid Linköpings universitet
av

Orri Tómasson

LiTH-ISY-EX--10/4399--SE
Linköping 2010

Department of Electrical Engineering Linköpings tekniska högskola


Linköpings universitet Linköpings universitet
SE-581 83 Linköping, Sweden 581 83 Linköping
Implementation of Elementary Functions for a
Fixed Point SIMD DSP Coprocessor

Examensarbete utfört i datorteknik


vid Tekniska högskolan i Linköping
av
Orri Tómasson

LiTH-ISY-EX--10/4399--SE

Handledare: Andreas Ehliar


isy, Linköpings universitet
Dake Liu
isy, Linköpings universitet
Olof Kraigher
isy, Linköpings universitet
Examinator: Andreas Ehliar
isy, Linköpings universitet

Linköping, 10 December, 2010


Avdelning, Institution Datum
Division, Department Date

Division of Computer Engineering


Department of Electrical Engineering
2010-12-10
Linköpings universitet
SE-581 83 Linköping, Sweden

Språk Rapporttyp ISBN


Language Report category —
 Svenska/Swedish  Licentiatavhandling ISRN
 Engelska/English
  Examensarbete
 LiTH-ISY-EX--10/4399--SE
 C-uppsats
Serietitel och serienummer ISSN
 D-uppsats Title of series, numbering —
  Övrig rapport


URL för elektronisk version


http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-63576
http://www.ep.liu.se

Titel
Title Implementation of Elementary Functions for a Fixed Point SIMD DSP Coprocessor

Författare Orri Tómasson


Author

Sammanfattning
Abstract
1 √1 √
This thesis is about implementing the functions x, x
, x and log(x) on a DSP
platform.

A multi-core DSP platform that consists of one master processor core and
several SIMD coprocessor cores is currently being designed by a team at the
Computer Engineering Department of Linköping University.

The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to


support vector multiplication instructions. By efficiently using the 16 multipli-
ers, it is possible to evaluate polynomials very fast. The ALU does not have
(hardware) support for floating point arithmetic, so the challenge is to get good
precision by using fixed point arithmetic.

Precise and fast solutions to implement the mathematical functions are


found by converting the fixed point input to a soft floating point format before
polynomial approximation, choosing a polynomial based on an error analysis
of the polynomial approximation, and using Newton-Raphson or Goldschmidt
iterations to improve the precision of the polynomial approximations.

Finally, suggestions are made of changes and additions to the instruction


set architecture, in order to make the implementations faster, by efficiently using
the currently existing hardware.

Nyckelord
Keywords SIMD, DSP, mathematical functions, elementary functions, polynomial approxi-
mation, fixed-point arithmetic
Abstract
1 √1 √
This thesis is about implementing the functions x, x
, x and log(x) on a DSP
platform.

A multi-core DSP platform that consists of one master processor core and several
SIMD coprocessor cores is currently being designed by a team at the Computer
Engineering Department of Linköping University.

The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to sup-
port vector multiplication instructions. By efficiently using the 16 multipliers, it
is possible to evaluate polynomials very fast. The ALU does not have (hardware)
support for floating point arithmetic, so the challenge is to get good precision by
using fixed point arithmetic.

Precise and fast solutions to implement the mathematical functions are found
by converting the fixed point input to a soft floating point format before poly-
nomial approximation, choosing a polynomial based on an error analysis of the
polynomial approximation, and using Newton-Raphson or Goldschmidt iterations
to improve the precision of the polynomial approximations.

Finally, suggestions are made of changes and additions to the instruction set ar-
chitecture, in order to make the implementations faster, by efficiently using the
currently existing hardware.

v
Contents

1 Introduction 1
1.1 Fixed point representation . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The scope of this work . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 A Brief overview of the ePUMA architecture 5


2.1 ePUMA’s SIMD core . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Instruction set and assembly language . . . . . . . . . . . . 6

3 Polynomial Approximations 9
3.1 Two types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 The fixed point polynomial error . . . . . . . . . . . . . . . 10
3.2 Choosing a polynomial . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Taylor polynomials . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Interpolation polynomials . . . . . . . . . . . . . . . . . . . 12
3.2.3 Min-max polynomials . . . . . . . . . . . . . . . . . . . . . 12
3.2.4 Other polynomials . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.5 Using a polynomial for f (x + a) to evaluate f (x) . . . . . . 13
3.3 Using Soft floating point format . . . . . . . . . . . . . . . . . . . . 16

4 Other methods 17
4.1 Newton-Raphson and Goldschmidt . . . . . . . . . . . . . . . . . . 17
4.1.1 Algorithm for reciprocal . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Opposite sign of product trick . . . . . . . . . . . . . . . . . 17
4.1.3 Algorithm for inverse square root . . . . . . . . . . . . . . . 19
4.1.4 Goldschmidt iteration . . . . . . . . . . . . . . . . . . . . . 19
4.1.5 Our usage of Goldschmidt iterations . . . . . . . . . . . . . 20
4.1.6 Error after iteration . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Lookup tables and interpolation . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Piecewise polynomial interpolation . . . . . . . . . . . . . . . . . . 22
4.3.1 Usage on ePUMA . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii
viii Contents

4.4.1 Usage on ePUMA . . . . . . . . . . . . . . . . . . . . . . . 23


4.5 Product of parabolas . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5.1 Usage on ePUMA . . . . . . . . . . . . . . . . . . . . . . . 23

5 Implementations 25
5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Estimation of Cycle cost . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Cycle cost for multiple inputs . . . . . . . . . . . . . . . . . 26
5.4 Kernel code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Invalid input handling . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5.1 Zero input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5.2 Memory and register usage . . . . . . . . . . . . . . . . . . 28
5.6 Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6.1 Choosing a Polynomial . . . . . . . . . . . . . . . . . . . . . 29
5.6.2 Pre-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.7 Inverse square root . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.7.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.8 Square root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.1 Choosing a polynomial . . . . . . . . . . . . . . . . . . . . . 45
5.8.2 Soft floating point usage . . . . . . . . . . . . . . . . . . . . 45
5.8.3 32 bit version . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.4 Zero input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.5 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.9 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.9.1 Soft floating point usage . . . . . . . . . . . . . . . . . . . . 53
5.9.2 Inputs in other fixed point formats . . . . . . . . . . . . . . 53
5.9.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Instruction proposals 59
6.1 POWERSW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.2 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.3 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 POWERSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 POLYW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 TMACDO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 SSUMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Soft floating point instructions . . . . . . . . . . . . . . . . . . . . 71
6.6.1 STOFLOATW . . . . . . . . . . . . . . . . . . . . . . . . . 71
Contents ix

6.6.2 STOFLOATD . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.6.3 TOFLOATADD . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6.4 Converting vectors to soft floating point . . . . . . . . . . . 73
6.6.5 Conversion from soft floating point format . . . . . . . . . . 74
6.7 Powers with integer exponent . . . . . . . . . . . . . . . . . . . . . 75
6.8 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.8.1 Opposite signed products after multiplication instructions . 76
6.8.2 Special multiplication for inverse square root iterations . . . 77
6.8.3 Scale flag to shift instructions . . . . . . . . . . . . . . . . . 77
6.8.4 Scale flag to add, sub, and other trivial arithmetic instructions 77
6.8.5 Long datapath version of short datapath instructions . . . . 78

7 Results 79
7.1 Function kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Proposed features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8 Conclusions 89

Bibliography 91

A Kernel source codes 93


A.1 Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.1 Kernel 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.2 Kernel 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.3 Kernel 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.1.4 Kernel 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.1.5 Kernel 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.1.6 Kernel 5b . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.1.7 Kernel 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.2 Inverse square root . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.2.1 Kernel 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.2.2 Kernel 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.2.3 Kernel 2b . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2.4 Kernel 3a . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2.5 Kernel 3b . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2.6 Kernel 3c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.2.7 Kernel 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.3 Square root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3.1 Kernel 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3.2 Kernel 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.3.3 Kernel 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.3.4 Kernel 4a . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.3.5 Kernel 4b . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.3.6 Kernel 4c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.3.7 Kernel 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A.4 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.4.1 Kernel 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
x Contents

A.4.2 Kernel 1b . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


A.4.3 Kernel 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.4.4 Kernel 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.4.5 Kernel 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.4.6 Kernel 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

B Simulator instruction set 117


Abbreviations
ALU Arithmetic logic unit
DM Data memory
DMA Direct memory access
DSP Digital signal processing
LSB Least significant bit
LVM Local vector memory
MAC Multiply and accumulate
MSB Most significant bit
NOP No operation
PM Program memory
SIMD Single instruction, multiple data
TMAC Triangular multiply and accumulate
ULP Unit of least precision

xi
Chapter 1

Introduction

The goal of this thesis work is to implement some elementary mathematical func-
tions as fast as possible and as precise as possible on a new DSP platform called
ePUMA, which is being designed and researched at the Computer Engineering
Department of Linköping University.

The main theme in processor design for the last few years has been to increase
the number of processing cores with increased parallelism rather than increasing
the clock frequency. The ePUMA’s approach of this is to have one master control
processor and eight SIMD coprocessors. Its architecture is described in more detail
in [11], [1] and [2].

One of the trade-offs when designing a multi-core system is the number of cores
versus size of each core. A larger core has more hardware and hence hardware
support for more features, while more cores can perform more tasks in parallel.
In order to make each coprocessor smaller, the SIMD coprocessors cores do neither
have support for for division nor floating point arithmetic, which is not uncommon
for DSP architectures.
The SIMD cores do however have hardware to support vector multiplication and
to support that each SIMD core has sixteen 16 bit multipliers.
Because of this large amount of multipliers, we can use them to evaluate polyno-
mials. By using the multipliers efficiently, we can evaluate up to one polynomial
of degree eight (or smaller) per cycle.

By using polynomial approximation to implement x1 we do not need specific


division hardware which usually consumes much chip area, The cycle cost is also
considerably lower than common software solutions.
It is challenging to get precise results by using fixed point arithmetic, compared
to using floating point arithmetic, and one of our goals is to find out how we can
achieve as precise results as possible.

1
2 Introduction

1.1 Fixed point representation


The SIMD core does not have any hardware support for floating point operations.
Instead fixed point number representation is used for fractional numbers. It is
similar to regular binary representation of integers, except there is a radix point
(sometimes called binary point by analogy to decimal point). For example if the
bit string 1111 has two integer bits and two fractional bits it represents the decimal
value 3.75 = 21 + 20 + 2−1 + 2−2 . Further reading on fixed point numbers is for
instance found in chapter 2 in [6].

1.1.1 Notation
In this thesis we use the notation Qi.f to indicate a fixed point format that has
i integer bits and f fractional bits. This is equivalent to what would be called
Q2 (i, f ) in [6] (where the subscripted 2 indicates that a base 2 number system is
being used).
Both signed (two’s complement) and unsigned fixed point formats are used.
Signed is our default format, meaning that we always state that we use unsigned
when we use unsigned but if we do not state whether a fixed point format is signed
or unsigned, it is signed, and in some cases it is irrelevant.
A signed Qi.f fixed point format has i bits before the radix point. When two’s
complement is used, the left most bit is a sign bit and it is counted with the i
integer bits. The integer part of unsigned Qi.f takes values in the range [0, 2i − 1]
and the integer part of signed Qi.f takes values in the range [−2i−1 , 2i−1 − 1].
Another way to view this is that the leftmost bit of signed Qi.f has the weight
−2−i−1 , while the leftmost bit of unsigned Qi.f has the weight 2i−1 .

1.2 The scope of this work


When this thesis work began, much work had already been done on the ePUMA
project. The outline of the architecture had been decided as well as the main
features of the SIMD core. A preliminary version of the Sleipnir instruction set
(described in 2.1.2) had been made as well as an assembler and a pipeline accurate
simulator to test functional kernels. For this work many kernels were written in
the Sleipnir assembly language. Several instructions or instruction features (such
as flags) were added to the simulator and then these instructions and features
were tested, and many of them are discussed and suggested to be added to the
architecture in chapters 6 and 7.

1.3 Report outline


Chapter 2 contains a brief overview of the ePUMA architecture.
1.3 Report outline 3

Chapter 3 is about polynomial approximations. It contains an analysis of er-


rors introduced when polynomials are evaluated using fixed point arithmetic.

Chapter 4 discusses other methods to approximate functions. It discusses both


methods that are used in the implementations in this thesis, as well as other com-
mon methods which were not used.

Chapter 5 lists the function kernels that were implemented and explains them
in some detail.

Chapter 6 then contains suggestions of instructions that we suggest be added


to the instruction set, as well as some other additions and modifications of the
architecture. The intention of the proposals is to efficiently use the current hard-
ware in the datapath rather than adding more hardware.

Chapter 7 lists the results. The results consist of a summary of the kernels that
were implemented and a list of instruction and features we suggest should be
added, or considered to be added, to the architecture.

Chapter 8 lists the conclusions and summarizes this work.


Chapter 2

A Brief overview of the


ePUMA architecture

The following description of the architecture is taken from [11]:


“The ePUMA master-multi-SIMD architecture is illustrated in Figure 1. It consists
of one master controller, eight SIMD coprocessors, and a memory subsystem for
the on-chip communication. The master processor executes the sequential task in
an application algorithm, while the SIMD cores run the parallelizable portion of
the algorithm. . . . ”

The SIMD cores have both program memory (PM) and data memory (DM) 1
which is a vector memory. The SIMD cores can exchange data through a central
DMA controller and an interconnection network depicted in figure 2.1. [11]

SIMD SIMD SIMD SIMD


Core 1 Core 2 Core 3 Core 4

N1 N2 N3 N4

Master processor
&
DMA

N5 N6 N7 N8

SIMD SIMD SIMD


Core 5 Core 6 Core 7 PM DM

Processing
Core

SIMD
Core 8

Figure 2.1: The ePUMA master-multi-SIMD architecture, the figure is inspired by the figure
referred to as Figure 1, in the quoted text from [11]

1 Also referred to as local vector memory (LVM)

5
6 A Brief overview of the ePUMA architecture

2.1 ePUMA’s SIMD core


The work in this thesis is focused on implementations that will be run on one
SIMD core.

The SIMD core has eight general purpose 128 bit vector registers (called vr0-
vr7 ). The 128 bit vector can be a vector word which is eight 16 bit words, a
vector double word with four 32 bit double words or a complex vector which has
four complex numbers, each with a 16 bit real part and a 16 bit imaginary part.
It has also two scratchpad memories.
Each SIMD core also has two local vector memories (LVM). Both of them can
be accessed simultaneously, which means that we can make two memory accesses
(read or write) per cycle, one to each memory.
Further description of the ePUMA architecture is found in [11], [5] and [1].

2.1.1 Datapath

MUL ALU ALU


Lane- Lane- Lane-
Switch Switch Switch

Ve cto r
Acc umu lat or

Figure 2.2: The ePUMA datapath (the figure taken with permission from [1])

The datapath of a SIMD core has 3 stages: A multiplication stage followed


by two ALU stages (with adders, shifters, bitwise logic and other common ALU
hardware). It is shown in figure 2.2
The datapath is intended to support for example vector multiplication, mul-
tiply and accumulate (MAC), and Butterflies (used to calculate DCT2 , FFT3 )
[1].

2.1.2 Instruction set and assembly language


The SIMD cores use an instruction set and assembly language called Sleipnir. It
is documented in [2]. The assembly syntax is:
instruction - flags - destination operand - source operand 0 - source operand 1

A summary of a large part of the instruction set is found in appendix B. It is also

documented in [2].

2 discrete cosine transform


3 fast Fourier transform
2.1 ePUMA’s SIMD core 7

Example 2.1
This example shows an example of assembly code for the ePUMA SIMD core,
followed by explinations of some parts of it. The code is Kernel 3 from section
5.6.3
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8
9 // Newton Raphson iteration
10 smulww < scale =15 , uu , negres , rnd > vr0 .0 vr1 .2 vr2 .0
11 5* nop
12 smulww < scale =15 , uu , sat , rnd > vr0 .1 vr0 .0 vr2 .0
13 5* nop
14 stop
15
16 . m0
17 0 x9800
18 . m1
19 0
20 . cm
21 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

• The .main keyword marks the beginning of the main function of the kernel.
• The .m0, m1 and .cm keyword indicates data that should be written to the
local vector memories (.m0 and .m1) and the constant memory (.cm) before
execution of kernel.
• X*nop issues X number of NOPs. (they are usually used because of data
dependency).
• Double slash (//) defines a comment.
• stofloatw, saddw, polyw and smulww are instruction mnemonics.
• Flags are given to instructions in angled brackets (<>) between the instruc-
tion mnemonic and its operands.
• vrX is used to refer to entire vector register number X, vrX.Y refers to scalar
word number Y in vector register number X, vrX.Yd refers to scalar double
word number Y.
• m0[X] refers to memory word number X. .sw, .sd, .vw and .vd suffix is
used to refer to a scalar word, scalar double, vector word or vector double
word.
• The prefix 0x before a constant indicates hexadecimal number format.
Chapter 3

Polynomial Approximations

Polynomial approximations are used as a part of all the methods discussed in this
thesis to approximate elementary mathematical functions. Polynomial approxi-
mations can be described with the formula:

n
X
f (ξ) ≈ p(x) ≡ pi · xi (3.1)
i=0

Where x is some trivial function of ξ. In our implementations we use x = ξ − a


where a is some constant.
A familiar example are Taylor polynomials where also x = ξ − a and Maclaurin
polynomials are the special case of Taylor polynomials when a = 0.

3.1 Two types of errors


If we have a polynomial p(x) that we use to approximate f (x) such that p(x−a) ≈
f (x), where a is constant, we have an error

er (x) = f (x) − p(x − a) (3.2)


This error will from here be referred to as the real polynomial error. It is the error
of our result, if we use real numbers to evaluate the polynomial.

If p(x − a) is calculated using a fixed point system, our result will differ from
the value of p(x − a), because all arithmetic operations are performed with a finite
amount of bits.
This difference between our result and p(x − a) will be referred to as the fixed
point polynomial error, defined as
ef (x) = p(x − a) − r(x) (3.3)
where r(x) is the result of calculating the polynomial value using a fixed point
system. We can then see that er (x) + ef (x) = f (x) − r(x)

9
10 Polynomial Approximations

3.1.1 The fixed point polynomial error


When we want to calculate the polynomial p(x) on a fixed point system we need to
store the polynomial constants pi and the values of xi with limited precision with
a finite amount of bits in memory or registers. We call these values with limited
precision p̂i and x̂i where:

p̂i = pi + epi (3.4)


i
x̂i = x + exi (3.5)
LSBp
|epi | ≤ (3.6)
2
LSBx
|exi | ≤ (3.7)
2
LSBp and LSBx is the weight of the LSB (least significant bit) of the data word
size that store p̂i and x̂i . As we can see epi and exi are the errors caused by the
limited amount of bits.
If we want to evaluate the polynomial we then calculate
n
X n
X
pi · xi + ei

p̂i · x̂i = (3.8)
i=0 i=0

where ei is the error of one term. We can calculate ei :

p̂i · x̂i = (pi + epi ) · (xi + exi ) (3.9)


i i
= pi · x + x · epi + pi · exi + epi · exi (3.10)
i
⇒ ei = x · epi + pi · exi + epi · exi (3.11)

In addition each term has a rounding error, which is between − LSB 2 and LSB
2 .
i
If we look at equation 3.11 we see that the term x · epi depends on the size
of xi and epi which is the rounding error of the polynomial constant when it is
converted to the fixed point format. The term pi · exi depends on the size of the
polynomial constant and the error of the calculation of xi (which is discussed in the
next section). This term tells us that the error of our polynomial approximation
depends heavily on the magnitude of the polynomial constants. The term epi · exi
is in most cases insignificant.

Example 3.1
We store the powers of x in 16 bit data-words in the Q1.15 fixed point format.
That means that the weight of the LSB is 2−15 so LSB 2
x
= 2−16 . If we want to
evaluate a polynomial with large polynomial constants, for example pi = 1000
then the error caused by the pi · exi term is 1000 · 2−16 = 0.015 ≈ 2−6 .
3.1 Two types of errors 11

The errors of powers of x


When we calculate the powers of the argument of the polynomial, using the same
fixed point system, the error exi is actually not always LSB
2 . The first power of
x does not have any error (there can be an error caused by previous calculations,
but when we implement a way to evaluate functions there is nothing that can be
done about that) meaning that x̂1 = x1 . The second power x̂2 has only a rounding
error which we call ernd which is between − LSB2 and LSB
2 , which means that

ex2 = ernd
and

LSB LSB
− ≤ ex2 ≤
2 2
Then we can calculate the error of x̂3 :
x̂3 = x3 + ex3
= x · (x2 + ex2 ) = x3 + x · ex2 + ernd
⇒ ex3 = x · ex2 + ernd
and then the error of x̂4 :
x̂4 = (x2 + ex2 ) · (x2 + ex2 )
= x4 + 2 · x2 · ex2 + e2x2 + ernd
⇒ ex4 = 2 · x2 · ex2 + e2x2 + ernd
and more general for the subsequent powers:
x̂c = (xa + exa ) · (xb + exb )
= xa+b + xa · exb + xb · exa + exa · exb
⇒ exc = xa · exb + xb · exa + exa · exb
where a + b = c.
The general result is that if x ≤ 1, the worst case error is half LSB for power
2 and increases by half LSB for each power of x. Simulations have been done
that confirm that the error of powers of x are smaller than this worst case error
estimate.
We can see that when we calculate some power of x, the error of the result
is a function of the rounding errors of the calculation of lower powers. We can
therefore see that if we assume that the rounding error for each multiplication is
uniformly random in the range [−0.5, 0.5] (it is though not uniformly random), we
can see that the probabilty of that we get the worst case error when we calculate
some power of x decreases as the exponent increases. It is because the rounding
errors of all the previous results for the calculations of lower powers, must all be
worst case rounding errors with the same sign, to produce a result with worst case
error. Table 3.1 shows the maximum and minimum error we get when we calculate
powers for all values of x ∈ [0.5, 1] in Q1.15.
12 Polynomial Approximations

power max(|ex |) worst case


1 0 0
2 0.5 0.5
3 0.96 1
4 1.42 1.5
5 1.74 2
6 2.13 2.5
7 2.26 3
8 3.02 3.5

Table 3.1: Error of powers of x. The second column shows the the maximum error
of each power, when we have calculated the powers for all possible Q1.15 values
in the simulator. The third column shows an estimate of the worst case error we
could expect to see after the simulation.

3.2 Choosing a polynomial


There are several types of polynomials that can be used to approximate functions.
Our goal is to have as small an error as possible in our final result.

3.2.1 Taylor polynomials


Taylor polynomials do not suit our needs well. Taylor polynomial have the prop-
erty of being mathematically beautiful and their error approaches zero when the
degree of the polynomial approaches infinity. However, we can only evaluate poly-
nomials of limited degree, and Taylor polynomials of limited degrees give less
precision than the other two types of polynomials that are discussed here.
Taylor polynomials do though have the property that their coefficients are rela-
tively easy to calculate. A programmer might therefore want to use Taylor poly-
nomials in some cases, but Taylor polynomials are not used to implement any
functions in this thesis.

3.2.2 Interpolation polynomials


Interpolation polynomials are polynomials that are found by fitting a set of data
points to a polynomial by forming a Vandermoth matrix and solving the least
square problem. The Matlab function polyfit does exeactly that. By giving data
points that are on the curve of some mathematical function we can use this method.
This method is fairly easy to use for a programmer to find a polynomial for any
mathematical function.

3.2.3 Min-max polynomials


Min-max polynomials (also called minimax polynomials) are optimised to have
the smallest worst case error possible in a given range. Polynomial constants
3.2 Choosing a polynomial 13

Type of polynomial Calculation of coefficients Property


The error approaches
Polynomial of degree i
di
zero as the degree of the
Taylor needs i! and dx i f (x) for
polynomial approaches
each coefficient. infinity.

The RMS of the error


Solve a system of linear
Least square function in a given range
equations of same degree
polynomials is optimised to be as
as the polynomial.
small as possible.
Optimized such that the
Use the Remez algorithm absolute error function
Min-Max poly-
which is fairly compli- |p(x)−f (x)| has the same
nomials
cated. value at every local max-
imum.

Table 3.2: Overview and comparison of types of polynomials

for min-max polynomials are found by using the Remez algorithm. The Remez
algorithm is described in [3]. The Remez algorithm is more complicated than the
methods used to calculate the constants of Taylor polynomials and interpolation
polynomials. Further reading on min-max polynomials is found in [8].

3.2.4 Other polynomials


Other polynomials, such as Chebychev polynomials and Legendre polynomials
can also be used to approximate functions. They are not as precise as min-max
polynomials. They are discussed in for example [8].

3.2.5 Using a polynomial for f (x + a) to evaluate f (x)


By using a polynomial for f (x + a) to evaluate f (x) we can often find polynomials
with a set of polynomial constants that are better suited for implementation in a
fixed point system. In that way, we can reduce the fixed point polynomial error,
ef p , significantly.
When we use min-max polynomials, the real polynomial error does not change
significantly when we use different values of a and use polynomials for f (x + a) to
approximate f (x) for x in a given interval (it is a property that was observed by
trying various values of a to find other polynomial constants for the same function,
we have not attempted to mathematically prove this property).

Example 3.2
Table 3.2 lists polynomial constants for three polynomials which can be used
to approximate √1x such that p(x − a) ≈ √1x , x ∈ [0.5, 1]. All the three polynomials
14 Polynomial Approximations

Taylor
0 Min−Max
Least−square

−5

−10
log2(|e|)

−15

−20

−25

−30
0 0.5 1 1.5
x

(a) Absolute error of polynomial approximation on logarithmic scale.


−3
x 10
5
Taylor
Min−Max
4 Least−square

0
e

−1

−2

−3

−4

−5
0 0.5 1 1.5
x

(b) Error of polynomial approximation

Figure 3.1: A comparison of errors of a fourth degree Taylor polynomial, fourth degree least
square polynomial and a fourth degree Min-Max polynomial, that are all optimised to be used
to calculate f (x) ≡ x1 for x ∈ [0.5, 1], such that p(x − 0.75) ≈ f (x), and the error is e =
f (x) − p(x − 0.75).
3.2 Choosing a polynomial 15

a Polynomial constants
0 3.230. . . -7.600. . . 12.71. . . -12.51. . . 6.629. . . -1.460. . .
0.5 1.414. . . -1.411. . . 2.060. . . -2.910. . . 2.978. . . -1.460. . .
0.75 1.154. . . -0.7699. . . 0.7660. . . -0.8450. . . 1.152. . . -1.460. . .

Table 3.3: Polynomial constants for 5th degree polynomial approximations of √1 .


x

have the same real polynomial error.


Let us take a look at equation 3.11. The two terms we are most interested in
are xi · epi and pi · exi . Let us first look at pi · exi . We know from section 3.1.1
that exi is between 0 and 2 ULPs (we have 5th degree polynomial so power 5 is
the highest exponent we use in this polynomial). We can get a quick estimate of
the largest pi · exi term by simply looking at the size of the polynomial constants
which is the pi variable in the equation 3.11. Even though exi differs, we can see
that pi differs more.
Let us now look at the term xi · epi . The variable epi is the rounding error we
get when we convert the polynomial constant to fixed point format. If we store
all the polynomial constants for a polynomial in the same fixed point format, epi
depends on which fixed point format we use. To store the polynomial constants
of the polynomial with a = 0 in a 16 bit fixed point format, we need four integer
bits and a sign bit to store numbers between −12.51 . . . and 12.71 . . . so we need
to use Q5.11. The weight of the LSB in Q5.11 is 2−11 so the absolute rounding
error would be: epi ≤ 2−12 . There are cases where we are lucky and the rounding
error is smaller, and we can even tweak the parameters for the Remez algorithm
to try to make this rounding error a bit smaller (for example to find a polynomial
that is optimized for the range x ∈ [0.49, 1.01] instead of x ∈ [0.5, 1], and see if we
were luckier ). Even though we can in some cases reduce this rounding error, we
will in most cases prefer to store all the polynomial constants in the same fixed
point format, and it is improbable that we can be lucky and get a small rounding
error for all the polynomial constants.
Another problem we face when the polynomial constants are large, is that the
products xi · pi also become large. We have to store these products somewhere
(register or memory), until the products are addedPn together. If the products xi · pi
i
are significantly larger than their sum p(x) = i=0 (x) · pi , we will need to waste
many bits on storing the integer bits that will not be required in the result. For
example if x ∈ [0.5, 1] and we want to store the each xi · pi in a 16 bit register, we
√ if some pi = 12.71 and x ∈ [0.5, 1]. The sum p(x − a) is however in
will need Q5.11
the range [1, 2], which can be stored in Q2.14 or unsigned Q1.15. The polynomial
where a = 0.75 has all its polynomial constants with absolute values smaller than
2 which means that they can be stored in signed Q2.14. The products pi · (x − a)i
can also be stored in Q2.14, which means that we get a more precise √ result when
we add values in signed Q2.14 to generate a result in the range [1, 2] compared
to if we would add six values in signed Q5.11.
16 Polynomial Approximations

l degree 2 degree 4 degree 7 degree 12


-8 2.2 1.67 0.94 -0.18
-7 1.46 0.75 -0.23 -1.75
-6 0.63 -0.32 -1.65 -3.73
-5 -0.32 -1.61 -3.41 -6.28
-4 -1.45 -3.21 -5.7 -9.69
-3 -2.87 -5.31 -8.81 -14.46
-2 -4.81 -8.31 -13.38 -21.64
-1 -8.03 -13.45 -21.41 -34.46

Table 3.4: log2 (|e|) of polynomial approximations of √1x for polynomials of degrees
2, 4, 7 and 12, that are optimized with minimum error in the range x ∈ [2l , 1] where
l is in the left most column.

3.3 Using Soft floating point format


A problem with polynomial approximations of many functions is that they are only
precise in a limited part of the domain. This is especially true for functions like √1x ,
1
x , and log(x) that approach infinity (positive or negative) when x approaches zero.
Table 3.4 shows the error (the real polynomial error) of a min-max polynomials, of
degree 2, 4, 7 and 12, which are optimized to approximate √1x for variously large
intervals of x. We see that it is easy to find a polynomial that gives good precision
in a specific part of the domain more difficult (require higher degree) for larger
parts of the domain.
We would see similar results for other functions if we would make similar tables
for them.
We can solve this problem by scaling the function argument so that it is in
a desired range. If we convert the argument to soft floating point, such that its
mantissa is in the range [0.5, 1i, we can evaluate a polynomial of the mantissa and
then do some required scaling (or addition in the case of logarithm) after we have
evaluated the function.

Using a soft floating point format has more advantages. The output range of
the polynomial is in many cases smaller, which means that we know which fixed
point format we need for the result (for instance p(x) ≈ x1 for x ∈ [0.5, 1] results
in p(x) ∈ [1, 2], but if x ∈ [2−15 , 1] we would have p(x) ∈ [1, 215 ]).
Another advantage is that if we know that x < 1 we know that xi · pi < pi so we
know what fixed point format we need to store that result with good precision,
and we also know that the term xi · epi < epi from equation 3.11.
Chapter 4

Other methods

In this chapter we discuss other methods that can be used to implement functions.
In section 4.1, Newton-Raphson and Goldschmidt iterations are discussed, which
are methods we use in our implementations to increase precision after polynomial
approximations. In remaining sections of the chapter, alternative methods that
were not used in any of the implementations are discussed. We speculate how
applicable the other methods are to the ePUMA architecture.

4.1 Newton-Raphson and Goldschmidt


Newton-Raphson division is an iterative method that improves the precision of
an approximation of a reciprocal. It uses two multiplications and one subtraction
operation per iteration. It is a special case of Newton’s method to find roots of
functions. A variation of Newton-Raphson division, that uses one more multipli-
cation per iteration, improves the precision of an approximation of inverse square
root.

4.1.1 Algorithm for reciprocal


The algorithm to increase the precision of a reciprocal is as follows:
t1 := ri · x
t2 := 2 − t1
ri+1 := ri · t2
where ri is the result of a previous approximation of x1 (gotten from a previous
iteration or polynomial evaluation) and ri+1] is an improvement. t1 and t2 are
temporary variables.

4.1.2 Opposite sign of product trick


An interesting trick we can use is that when t1 is unsigned Q1.15 or Q1.31, we can
calculate t2 = 2 − t1 by pretending it is a signed number (in Q1.15 or Q1.31) and

17
18 Other methods

find its negative.

Example 4.1
To understand how this trick works let us look at following example C code:
1 # include < stdio .h >
2
3 int main ()
4 {
5 unsigned char a , b ;
6 short i ;
7
8 for ( i = 1; i < 256; i ++) {
9 a = ( unsigned char ) i ;
10 b = ( unsigned char ) ( -( char ) a ) ;
11 printf ( " a : %d , b : % d \ n " , a , b ) ;
12 }
13 }

What this code does is change the sign of the unsigned variable a as if it were a
signed variable, and then store the result in b which is also an unsigned variable..
The output is:
1 a : 1 , b : 255
2 a : 2 , b : 254
3 a : 3 , b : 253
4 ...
5 a : 127 , b : 129
6 a : 128 , b : 128
7 a : 129 , b : 127
8 ...
9 a : 253 , b : 3
10 a : 254 , b : 2
11 a : 255 , b : 1

We can see that b = 256 − a. 256 is two times the weight of the MSB of the
unsigned char datatype. When we use the Q1.15 or Q1.31 the weight of the MSB
is 1, meaning that two times the weight of the MSB is 2.

To understand why, let us go through the steps to calculate 2 − x in a safe way,


when x is unsigned Q1.15 (meaning that x ∈ [0, 2 − 2−15 ]), and we also want our
result in unsigned Q1.15. We can follow these steps:

• Convert x to signed Q3.15. The two new bits are both zero valued and the
leftmost is the sign bit.

• Find −x by inverting all the bits and add one in the LSB position. The sign
bit will be set (unless if x = 0, then it overflows), as well as our other new
bit.

• Calculate 2 − x by adding 2 and −x, the value 2 in signed Q3.15 has the two
left most bits valued 01 and the rest is zeros.
4.1 Newton-Raphson and Goldschmidt 19

• Since the two left-most bits of −x are valued 11, and the two left-most bits
of 2 are 01 the result of the addition of 2 and −x will have the two left most
bits valued 00 (meaning it is a positive value smaller than 2, which is the
expected result when one calculates 2 − x where x ∈ [0, 2i

• We can convert this value back to unsigned Q1.15 by simply ignoring the
two left-most bits that are zero valued.

• Since we know that the two left-most bits will be zero if we calculate 2 − x
using signed Q3.15, we can just as well take x in unsigned Q1.15 and invert
all the bits and add one to the LSB position to get the same result.

This trick is the motivation behind the proposal in section 6.8.1; to extend the
instruction set of our architecture, such that one can choose to get the negative of
the result of a multiplication instruction (either by giving a flag to the instruction
or with a new instruction). Then we only need two instructions for each Newton-
Raphson iteration. Where the first instruction calculates t := −ri · x and the
second instruction calculates ri+1 := ri · t.

4.1.3 Algorithm for inverse square root


The Newton-Raphson algorithm to increase the precision of inverse square root is
as follows:

t1 := ri · x
r i · t1
t2 :=
2
t3 := 1.5 − t2
ri+1 := ri · t3

where ri is the result of a previous approximation of √1x (gotten from a previous


iteration or polynomial evaluation), and ri+1 is an improvement of that approx-
imations (with smaller error). t1 , t2 and t3 are temporary variables. Note that
division by two is done with a right shift by setting the scale flag of the multipli-
cation instruction accordingly.

4.1.4 Goldschmidt iteration


Goldschmidt iterations are similar to Newton-Raphson iterations for division and
inverse square root. The first iteration is identical to Newton-Raphson iteration
but the first instruction of every subsequent iteration has data dependency from
the same instruction as the last instruction of the previous iteration, (rather than
having data dependency from the last instruction of the previous iterations like
Newton-Raphson iterations). It means that the last instruction of an iteration and
the first instruction of the next iteration can be issued at the same time.
20 Other methods

As a result Goldschmidt iterations gives less pipeline penalties due to data de-
pendencies, compared to Newton-Raphson iterations. But both require the same
amount of arithmetic operations, which means that they are equally fast when we
have enough inputs to fill the pipeline and eliminate pipeline penalties.

4.1.5 Our usage of Goldschmidt iterations

The negative product approach that we use to calculate a Newton-Raphson iter-


ation for reciprocal, reduces the amount of instructions needed by one, which is a
better benefit than using Goldschmidt iteration.
This same approach can not be used to calculate inverse square root1 , therefore
we use implementations with a Goldschmidt iteration in the cases where we use
two iterations (we never use more than two iterations in our implementations).
Implementations with Goldschmidt iterations are further discussed in [7] and [9].
This is the algorithm we use to calculate inverse square root with two iterations.

t1 := r0 · r0 · x/2
t2 := 1.5 − t1
r1 := r0 · t2
t3 := t2 · t2 · t1
t4 := 1.5 − t3
r2 := r1 · t4

Note that r1 and t3 can be computed in parallel. Also note that t3 = t2 · t2 ·


t1 = r1 · r1 · x/2. Calculation of t1 and t2 must be done with two multiplication
instructions each.

4.1.6 Error after iteration

After each iteration the relative error of our new approximation is the square of
the relative error of our previous approximation. This means that we double the
amount of significant bits in each iteration. This goes for both Newton-Raphson
and Goldschmidt iterations.
This can be seen if we rewrite the algorithm so that we begin by calculating
the relative error e1 . The next iteration will have relative error e2

1 it is possible to use a similar approach which is discussed in section 6.8.2


4.2 Lookup tables and interpolation 21

e1 = r1 · x − 1
t1 = 1 − e1 = 2 − r1 · x
r 2 = t1 · r 1
e2 = r2 · x − 1
= (1 − e1 )r1 · x − 1
= r1 · x − e1 · r1 · x − 1
= e1 − e1 · r1 · x
= e1 (1 − r1 · x)
= e1 · e1

Hence the relative error after an iteration is the square of the previous relative
error.

Error of Goldschmidt vs. Newton-Raphson


A Goldschmidt iteration gives larger worst case error by approximately one half bit,
in comparison to a Newton-Raphson iteration. This is because when Goldschmidt
iteration is used we calculate t3 = t2 · t2 · t1 where both t2 and t1 have rounding
errors. The same value is calculated by Newton-Raphson iteration as r1 · r1 · x/2
where x does not have any rounding error. Because of this ,Goldschmidt iterations
give slightly less precision.

4.2 Lookup tables and interpolation


Using lookup tables is a popular method to evaluate functions in DSP applications.
We can naturally use lookup tables on ePUMA if we want. There is a trade-off
between precision and the amount of memory we use the store the lookup table.
A convenient method is to use the most significant bits of an input value as an
index to a lookup table.

4.2.1 Interpolation
The fastest approach is to get one value from the lookup table, and use that value
directly. We can improve the precision by getting two consecutive values from the
lookup table and applying first degree linear interpolation. A common formula for
linear interpolation is:

f (x1 ) − f (x0 )
f (x) ≈ f (x0 ) + (x − x0 ), where x0 ≤ x < x1
x1 − x0
If we use a few of the input value’s MSBs as an index to the look up table,
then x1 − x0 is a constant (it will be some 2i where i ∈ Z), we can find x − x0
22 Other methods

by masking out the MSBs that were used as index. Then it is enough to multiply
that value with f (x1 )−f (x0 ) and shift the product and no division will be needed.
Hence, linear interpolation can be done in four Arithmetic instructions (bit-wise
AND, multiplication with scaling, and addition).
Another possibility is, instead of fetching f (x0 ) and f (x1 ) from the lookup table,
to fetch f (x0 ) and f (x1 ) − f (x0 ), and then do not have to calculate the difference,
but we will how ever consume more memory.

4.3 Piecewise polynomial interpolation


Another option is to fetch polynomial constants from a lookup table. Then we do
as before, divide the domain of our function into intervals, and for each interval
we have one polynomial which approximates the function we want to calculate.
Spline interpolation is an example of this. When spline interpolation is applied,
polynomials are found that interpolate a given set of data points (and we could
choose a set of datapoints that are on the curve of some function). Another
possibility is to find a polynomial that approximates the given function, for each
interval separately to generate a better polynomial approximation. We can for
example use min-max polynomials where we have one polynomial optimized for
each interval. That method might require more work to generate a lookup table,
but it will give a more precise approximation of a function.

4.3.1 Usage on ePUMA


None of the implementations discussed in this thesis use a lookup table. However,
the implementations we use are related. The method we use for our implementa-
tions is similar to the lookup table method in that we divide the domain of the
function into intervals. For an input x we want to find an i such that x is in
the interval x ∈ [2i , 2i+1 ]. That can be seen as dividing our domain into even
intervals on a logarithmic scale. Then we scale our input value to put it into a
specified interval (always [0.5, 1]) and then we scale the result accordingly as well.
That approach is especially good when we are dealing with functions that are not
defined for x = 0.
Using lookup tables and interpolation could be a good choice to implement other
mathematical functions on our architecture. Our soft-floating point approach is
also not suitable for functions that are not as easy to scale after a polynomial
evaluation.
If we implement a lookup table on an ePUMA SIMD core, we need at least a half
cycle per lookup because we can only address one memory location, in each local
vector memory per instruction. However, we can fetch up to 128 bits per lookup,
which means that it would take an equal time to look up polynomial constants
(assuming the polynomial constants are in total 128 bits or less) and to look up
just one value. To achieve two memory lookups per cycle, we need to store a copy
of a lookup table in both of the local vector memories.
If we want look up polynomial constants, we can perhaps not do one polynomial
4.4 CORDIC 23

evaluation per cycle since due to the fact that we do not have the same polynomial
constants for every input value 2 , but we can use one POWERSW instruction and
one TMAC which would take two cycles per value for up to evaluate a 7th degree
or lower polynomial.

4.4 CORDIC
The CORDIC algorithm can be used to calculate trigonometric functions, hyper-
bolic functions, exponential functions, logarithms and square roots.
The algorithm uses only a small lookup table, and adders and shifters. It is
however an iterative algorithm and each iteration increases the precision approx-
imately one bit [10], which means that it would require 16 iterations to calculate
a function with approximately 16 correct bits. The CORDIC algorithm is often
used in implementations with special hardware since it only needs simple hardware
(adders, shifters and small lookup table).

4.4.1 Usage on ePUMA


It has not yet been tried to implement the CORDIC algorithm on ePUMA. It is
possible use vector instructions to run the algorithm for multiple input values in
parallel. Due to data dependencies, one iteration would probably take at least
10 cycles, but because it is possible to use vector instructions for multiple input
values, this algorithm could be a good choice to use in some implementations.
Since the algorithm increases the precision by approximately one bit per iteration
it would take 16 iterations to get approximately 16 correct bits. But fewer bits
could be sufficient in some situations.

4.5 Product of parabolas


In [4] a method for hardware implementation to calculate several mathematical
functions (for example logarithm, square root and trigonometric functions) is in-
troduced. The method is based on evaluating several second degree polynomials
and multiplying the results. Finding the product of k parabolas is equivalent to
evaluating a polynomial of degree 2k.

4.5.1 Usage on ePUMA


The exact method discussed [4], depends on having customized hardware, but the
concept of calculating polynomials as a product of parabolas (or even other poly-
nomials of lower degree) could be adapted to the ePUMA hardware; it requires an
equal amount of multiplication operations. It would however be at least equally
complicated to implement a special instruction that evaluates a polynomial as
2 It is currently under consideration how the polynomial instruction will be implemented and

perhaps it will only be possible to calculate a polynomial of one value per cycle when the we
always use the same polynomial constants
24 Other methods

products of parabolas, as the method that we use (which is to calculate the pow-
ers of the function argument and then multiplying the powers to the polynomial
constants and then find the sum of all the products). It would also be challenging
to use this approach with minimum fixed point polynomial error.
The method has the advantage that the highest power of the input which is needed
is the second power, which is an advantage because it can be difficult to deal with
the big dynamic range of powers.
Chapter 5

Implementations

This chapter describes the implementations of the functions that were done for
this work. Most of the implementations use special instructions which were added
to the simulator and then proposed to be added to the instruction set (see chapter
6). If some of the instructions will not be implemented or implemented differently,
the same methods can nevertheless be used. The cycle count would be different
but the precision would be the same as long as the same method is used and the
arithmetic operations are done with the same semantics.

5.1 Method
The functions are implemented in assembly code that is intended to be run on one
SIMD coprocessor. The assembly code was tested in a pipeline accurate simulator
so that the cycle cost and precision of the result could be confirmed. For imple-
mentations with 16 bit inputs, the simulation was run for every possible input
value which the kernel is supposed to support.
For implementations with 32 bit inputs it would be too timeconsuming to test all
the 232 input values. Instead several thousand input values from the entire input
range were tested. The test inputs were for some kernels selected with an even
interval but for other kernels all upper words (the 16 MSBs) were tested, each
with a (pseudo) random lower word (16 LSBs). We also tried both approaches on
some kernels and they gave the same results.

Some new instructions were added to the instruction set of the simulator for
these implementations. Those instruction are discussed and proposed in chap-
ter 6. Some minor variances are between the instruction’s implementations that
were used, and the instructions as they are proposed and discussed in chapter 6
because the instructions were reviewed after having been implemented and used
in the simulator.
When we test the kernels, a script is used to run the simulator multiple times for
different input values each time. The input value (the argument to the mathemat-
ical function) is written to memory location m1[0] before the kernel is run.

25
26 Implementations

5.2 Errors
As was said in the previous section, each kernel is tested in a simulator and we
compare the result with a reference value. The reference value used is gotten with
either Matlab or Python (math module) and is in most cases a value in IEEE754
64 bit floating point format (52 bit mantissa (+1), 11 bit exponent and one sign
bit). We give either the worst absolute error or the absolute relative error. The
absolute error is simply the absolute value of the difference between our result
and the reference value. The absolute relative error is the absolute value of the
difference between our result and the reference value, divided by the reference
value. The error is usually given as a power of 2. The reason is because that
makes it easy to see how many correct bits we have. If the absolute relative error
is 2−e then we have e correct significant bits in our result. If the absolute error is
2−e then the error appears in the bit with the weigth 2−e .

Example 5.1
Some result in Q1.15 format has the worst case absolute error: 2−12 . Then the
bit in the position with the weight 2−12 (the forth least significant) and lesser
significant bits contain error but the other bits are correct.

5.3 Estimation of Cycle cost


In the following sections we give estimations of the cycle cost of each kernel. We
give both the cycle cost of calculating a function of one input value, but we also
give an estimate of how many cycles it takes to evaluate a function per input value
when we have multiple input values, and we want to calculate an approximation
for each of them. Each kernel (with one exception) evaluates a function for one
input value. The value which is given as cycles for one input value, is the number
of simulator cycles it takes to run the kernel (from .main to stop) in the simulator.

5.3.1 Cycle cost for multiple inputs


An estimate is also given for how many cycles it would take to evaluate the function
for multiple input values. By multiple input values we mean sufficiently many so we
can fill the pipeline without issuing any NOPs due to data dependencies, and use
vector instructions to make the required calculations in parallel for several input
values simultaneously. For those estimates we make the following assumptions:

• Polynomial evaluation takes 1 cycle per value. At this point it is not certain
whether the architecture will have support for that. In the case where it will
not be supported, a polynomial evaluation will take two cycles per value.
• Conversion from fixed point to soft floating point can be done with vector
inputs with 4 values per cycle, which results in 0.25 cycles per value. If
5.4 Kernel code 27

conversion to soft floating point will only be possible for one scalar at a
time, it will take 1 cycle per value.

• We do not count overheads. We do not count some constant amount of


NOPs which is independent of how many input values we have. We might
need to do that to avoid structural hazards, for example after the last issued
POLYW instruction, we need to issue a fixed amount of NOPs before we can
issue a multiplication instruction, to avoid structural hazards (the POLYW
needs to have finished its multiplication stages before).

• We count instructions with 16 bit operands as 0.125 cycles per instruction,


because the instructions can be run for eight scalars per cycle. Similarly we
count instructions with 32 bit operands as 0.25 cycles per instruction.

How many input values?


In order to achieve the cycle cost which is listed for multiple input values, we need
enough input values so that no NOPs are issued due to dependencies, and vec-
tor instructions are always used instead of scalar instructions. Commonly there
are about 7 consecutive NOPs which means that in order to fill the pipeline with
vector instructions we need 64 scalars (we replace one scalar word instruction and
seven NOPs with eight vector word instructions).

However, sometimes we need to issue NOPs to avoid structural hazards. In many


cases this penalty, is a fixed amount of cycles independent of the number of in-
put values we have (meaning that penalty cycles per value approaches zero as
the amount of input values increase). In other cases we get some penalty cycles
due to the limited size of the register file. Often it is necessary to store tempo-
rary variables, and it is only possible to store 64 scalars (16-bit) in the register
file (assuming we only use the general purpose vector registers), and as a result
we sometimes get some penalty cycles (around 3-8 cycles) for every 64 scalars or
something similar. This means that the estimate of cycle cost for multiple input
is a bit too optimistic.

5.4 Kernel code


The following sections contain details about implementations of several kernels.
These sections contain segments from the kernel source code but the full source
codes are found in Appendix A.

5.5 Invalid input handling


The kernels implemented have no handling of invalid inputs. The kernels will only
give correct results for a specified domain of inputs. If the kernel gets an input
which is not in the domain, it will return erroneous result (but not crash). The
reason for why we ommit invalid input handling is that in the case where no invalid
28 Implementations

input handling is needed, no time is spent on it, and the function evaluation can
be done as fast as possible. Most of the kernels use conversion to soft floating
point format. If the format=15u flag is used (see section 6.6.1), the kernels will
return f (|x|) for the function they implement being implemented.

5.5.1 Zero input



Of all the functions that were implemented, only x is defined for x = 0. The
kernel implementations do however not support the zero input. It must be checked
as a special case in order to support it. The kernels will return some value though
(usually the zero power term (DC-term) of the polynomial). For the functions that
are not defined for x = 0, the input can be checked if it is desired that the kernel
returns some special value when it gets zero input. Such special value could be
something similar to NaN (not a number) or Inf (infinity) in the IEEE754 floating
point standard.

5.5.2 Memory and register usage


The usage of memory and registers was not optimized. In most cases the kernels
use very little memory but in some cases registers could have been reused. By
using each register for only one variable it became easier to debug the code.
5.6 Reciprocal 29

5.6 Reciprocal
One of the challenges of implementing reciprocal is the difference between its
domain and its range. For inputs smaller than 1, the result is larger than 1, and
vice versa. Several kernels were implemented and more than one solution was
used to solve the problem with the domain and range problem. The usage of soft
floating point is essential in all the solutions.

5.6.1 Choosing a Polynomial


Here we describe in some detail how we choose the polynomial that we used. We
use a very similar method to choose polynomials for other functions.

5.6.2 Pre-analysis
To select a polynomial to use we run the Remez algorithm with various parameters.
We try various degrees of polynomials and also various values for the constant a
(the constant which is subtracted from the input value before calculating the
polynomial). We are interested in the real polynomial error and the fixed point
polynomial error (discussed in chapter 3). The Remez algorithm (which we use to
calculate polynomial constant) gives the real polynomial error as a by-product. We
can get an estimate of the fixed point error by assuming the error of the calculation
of the powers in the polynomial evaluations is one ULP and then find the largest
value of pi · exi in equation 3.11 in section 3.1.1, where pi is the largest polynomial
constant, and exi is the weight of the ULP.
The results are shown in figure 5.1. The error of an implementation is the sum
of the real polynomial error and the fixed point polynomial error. The difference
between the signed and unsigned fixed point polynomial error is whether the value
x − a which we calculate powers of, is a signed or unsigned value. When it is
unsigned, we have to make sure it is always a positive value. Since x (the mantissa
of the input value) is between 0.5 and 1, it means that the value of a must be 0.5
or smaller.

Why not Taylor polynomials?


1
The function x+1 has the known Maclaurin series with polynomial constants 1
for all odd powers and −1 for all even powers. That would give us a smaller
fixed point polynomial error in comparison to the min-max polynomial, but the
real polynomial error will be larger. If we would use a 7th degree Maclaurin
1
polynomial for x+1 , to calculate an approximate for x1 , the maximum error in the
range x ∈ [0.5, 1] would be 2−7 . We could use a Maclaurin polynomial for x+a 1

for some other a, but the degree 7 Maclaurin polynomial would still always give a
worse maximum error than a 5th degree min-max polynomial.
30 Implementations

−6
Estimate of fixedpoint error signed
Estimate of fixedpoint error unsigned
Real Error
−8

−10

−12
log2(e)

−14

−16

−18

−20

−22
2 3 4 5 6 7 8
Degree

Figure 5.1: Estimates of the real polynomial error and the fixed polynomial error

Simulation
We see that the result of the error analysis is that using unsigned Q0.16 with
a = 0.5 or using signed Q1.15 with a = 0.8125 should give similarly precise re-
sults. An assembly code was written that uses these two polynomials to implement
calculation of x1 . Then simulations were run where both these polynomials were
used to calculate x1 for every possible input in the range [0.5, 1]. Both simulations
returned the result in unsigned Q1.15. The results of the simulations were that
the polynomial with a = 0.8125 has a worst case error of 11.1 ULPs and the poly-
nomial with a = 0.5 has a worst case error of 11.5 ULPs. The ULPs weight is 2−15
and 11ULPs ≈ 2−11.5 , so we see that the result of the simulation matches well to
the error estimation from the pre-analysis.
The polynomial with a = 0.8125 is used in all implementations.

5.6.3 Kernels
Several kernels that calculate x1 have been implemented. All of them, except one,
1
use 5 degree min-max polynomial for x+0.8125 . The polynomial can give approx-
imately 12 correct bits (worst case relative error is 2−11.85 ). Newton-Raphson
iterations can be used to increase the precision. One Newton-Raphson iteration
doubles the amount of correct bits, since the relative error after an iteration is the
square of the error prior to the iteration.
5.6 Reciprocal 31

Kernel 1
Input format Q1.15 in the range [0.5, 1i
Output format unsigned Q1.15
Error max error 11.1 ULPs or 2−11.5
Cycles (one input value) 20
multiple inputs 1.125 cycles per value

The first kernel uses one polynomial evaluation. It uses a 5th degree polynomial
1
for x+0.8125 . Before the polynomial is evaluated, 0.8125 is subtracted from the
input value, then the polynomial is evaluated using a POLYW instruction. Note
that the input range is only [0.5,1].
1 . main
2 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
3 4* nop
4 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
5 12* nop
6 stop
7
8 . m0
9 0 x9800 // -0.8125
10
11 // constants in Q4 .12 for 1/( x +0.8125)
12 . cm
13 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

Listing 5.1: Kernel 1


32 Implementations

Kernel 2
Input Q1.15, all positive values
Output soft floating point, mantissa is unsigned Q1.15
Error max relative error is 2−11.85
Cycles, one input value 23
Cycles, multiple input values 1.375 cycles per value

This kernel can calculate x1 for all positive values in Q1.15. It is the same
implementation as Kernel 1, except that we first convert the input to soft floating
point format. The exponent is left unchanged, and a programmer can use it as he
wishes to scale the result in any way he wants. A programmer can also choose to
use the mantissa of the result in a multiplication before he chooses to make the
conversion from floating point.
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8 stop

Listing 5.2: Kernel 2


5.6 Reciprocal 33

Kernel 3
Input format Q1.15, all positive values
Output format soft floating point, mantissa is unsigned Q1.15
Error max relative error is 2−15.0
Cycles, one input value 35
Cycles, multiple input values 1.625 cycles per value

Same as Kernel 2 except after the polynomial evaluation, a Newton-Raphson


iteration is used to increase the precision such that the relative error is at most
2−15 and the error of the mantissa is at most 1 ULP.
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8 // Newton Rapson iteration
9 smulww < scale =15 , uu , negres , rnd > vr0 .0 vr1 .2 vr2 .0
10 5* nop
11 smulww < scale =15 , uu , sat , rnd > vr0 .1 vr0 .0 vr2 .0
12 5* nop
13 stop

Listing 5.3: Kernel 3


34 Implementations

Kernel 4
Input format Q1.15, all positive values
Output format Q16.16
Error max relative error is 2−16
Cycles, one input value 38
2.375 per value, (1.875 if 16 bit multi-
Cycles, multiple input values
plications are use in iterations).

Same as Kernel 3 except that after the polynomial evaluation we shift the
mantissa with the value 15 − exponent. The constant 15 is stored in memory
(location m0[1].sw) and after the conversion to soft floating point the exponent
is subtracted from it.
In this kernel 32 bit multiplications (smuldd instruction) are used in the Newton-
Raphson iteration. By using 32 bit multiplications in the Newton-Raphson itera-
tion, we get up to 24 correct significant bits. But we do have an error of up to 1
ULP, and for this input domain and this output range, the smallest outputs have
16 significant bits. Therefor an error of 1 ULP for one of the smallest outputs
in this range results in a relative error of 2−16 . Larger values have either 1 ULP
error, or 24 correct significant bits. For multiple inputs, this kernel can be modi-
fied to give max 16 correct significant bits, by using 16 bit multiplications in the
Newton-Raphson iterations. It reduces the cycle cost for multiple inputs by 0.5
cycles per value.
1 . main
2
3 stofloatw < sout > vr1 .1 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr1 .2 m0 [0]. sw
6 ssubw vr1 .6 m0 [1]. sw vr1 .3
7 2* nop
8 scopyw vr1 .4 vr1 .2
9 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
10 12* nop
11
12 // Newton - Rapson iteration
13 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
14 5* nop
15 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
16 5* nop
17
18 // scale result
19 slsrd vr0 .3 d vr0 .1 d vr1 .6
20 2* nop
21 stop
22
23
24 . m0
25 0 x9800 15

Listing 5.4: Kernel 4


5.6 Reciprocal 35

Kernel 5
Input format Q1.31, all positive values
Output format soft floating point, mantissa is unsigned Q1.31
Error max relative error is 2−24
Cycles, one input value 35
Cycles, multiple input values 2.125 cycles per value

32 bit version, with 32 bit input and soft floating point output. After converting
the input value to soft floating point we use its 16 most significant bits as an input
to the same 5th degree polynomial. Then we use all the 32 bits of the mantissa of
the input value in the Newton-Raphson iteration with 32 bit multiplications.
1 . main
2 stofloat32d < sout > vr1h m1 [0]. sd
3 2* nop
4 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8
9 // Newton - Rapson iteration
10 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
11 5* nop
12 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
13 5* nop
14 stop

Listing 5.5: Kernel 5

Kernel 5b
Input format Q1.31, all positive values
Output format soft floating point, mantissa is unsigned Q1.31
Error max relative error is 2−31
Cycles, one input value 47
Cycles, multiple input values 2.625 cycles per value

Same as kernel 5 but with two Newton-Raphson iterations, for more precision.
1 // ... following code is added before ‘‘ stop ’ ’ in kernel 5:
2 // second iteration
3 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr0 .1 d vr1 .2 d
4 5* nop
5 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr0 .1 d
6 5* nop
7 stop

Listing 5.6: Kernel 5b


36 Implementations

Kernel 6
Input format unsigned Q0.16 all values
Output format soft floating point, mantissa is unsigned Q1.15
Error max relative error is 2−15
Cycles 64 values in 120 cycles

A Second degree polynomial for x1 is used and then two Newton-Raphson iter-
ations. Three vector multiplications and two vector additions are used to evaluate
the second degree polynomial (we do not use the poly instruction). The second
degree polynomial gives approximately 6 correct bits.

The 120 cycles include overheads. If we use the same approach to estimate
the number of cycles per value using this method, the result is 1.375 cycles per
value ( 28 for conversion to soft floating point 85 for polynomial evaluation and 82
for each Newton-Raphson iteration). 8 cycles are spent on copying intermediate
result from vector register to memory to use it later. It became rather tricky to
decide where to store intermediate results (VRF, LVM1 or LVM2), in order to be
able to access them again when needed, in an efficient way. If the vector register
file would be larger, as is suggested in [5] this copying would not be necessary.
These 120 cycles also include 8 NOPs after the last instruction to wait for the last
instruction to finish. If we have more than 64 input values, we can replace these
8 NOPs with copy instructions that copy the results from the vector registers to
memory. And after they have finished we can immediatly start evaluating a new
batch with 64 values. This means that 120 cycles per 64 values is a very realistic
estimate of how fast we can find reciprical of multiple 16-bit input values.
The source code is in Appendix A section A.1.7
5.7 Inverse square root 37

5.7 Inverse square root


5.7.1 Method
We use similar methods as before. We find a suitable polynomial which is opti-
mized to give good precision in the range x ∈ [0.5, 1]. We convert our input to soft
floating point before we evaluate the polynomial and either return the result in
soft floating point format, so that a programmer can decide how he wants to scale
the result, or perhaps use the value in the soft floating point format (for example
in multiplication).

Soft floating point usage


By using soft floating point to implement inverse square root, we meet a similar
problem to the one discussed in section 5.8.2. We convert our input to mantissa
and exponent such that

x = m · 2−e , m ∈ [0.5, 1], e ∈ {0, 1, 2, . . . , 15} (5.1)

And then inverse the square root is:


1 1 1
√ =√ ·√ (5.2)
x m 2−e
1
= √ · 2e/2 (5.3)
m

We use two different methods to deal with this. We can either precalculate
2e/2 and use e as an offset to addressing a multiplicand. √The other method is
to multiply the result of the polynomial calculation with 2 for odd e, or 1 for
even e, then right shift the exponent (equivalent to floor division by two) and then
return the result as a soft floating point number.

5.7.2 Kernels
We implemented several kernels that calculate inverse square root. We tried var-
ious ways of returning the result, and we tried to use both Newton-Raphson it-
erations and Goldschmidt iterations after 16 bit polynomial evaluation, and we
also tried to calculate a high degree polynomial with 32 bit precision to get 32 bit
results with acceptable precision.
We also try both the methods discussed in section 5.7.1.
38 Implementations

Kernel 1
Input format Q1.15, in range [0.5, 1i
Output format Unsigned Q1.15
Error Max error is 2−13.68
Cycles, one input value 20 cycles
Cycles, multiple input values 1.125 cycles per value

This kernel uses only subtraction and polynomial evaluation. It can only calculate
inverse square root for values in the range [0.5, 1i.
The polynomial used has its coefficients stored in Q2.14, which means that we
can expect a fixed point polynomial error of approximately 2−14 which is close to
maximum error in our simulations.
1 . main
2
3 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
4 4* nop
5 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
6 12* nop
7 stop
8
9 . m0
10 0 x9000
11
12 . m1
13 0
14
15 // constants in Q2 .14 invsqrt ( x +0.875)
16 . cm
17 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing 5.7: Kernel 1


5.7 Inverse square root 39

Kernel 2

Input format Q1.15, in range [2−15 , 1i


Output format Soft floating point. Mantissa is in unsigned Q1.15
Error Maximum relative error is 2−13.91
Cycles, one input value 30 cycles
Cycles, multiple input values 2 cycles per value

This kernel takes any positive Q1.15 number and calculates √1x and returns the
result in soft floating point format. The input is converted√to soft floating point,
a polynomial is evaluated and the result is multiplied with 2 or 1, depending on
whether the expoent is even or odd, and then the exponent is right shifted by one
(floor division by 2).
1 . main
2
3 stofloatw < sout > vr0 .0 d m1 [0]. sw
4 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
5 nop
6 saddw vr1 .0 vr0 .0 m0 [0]. sw
7 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
8 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
9 2* nop
10 scopyw . ne vr0 .3 m0 [1]. sw
11 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
12 12* nop
13 smulww < scale =15 , uu , sat > vr3 .0 vr2 .0 vr0 .3 // multiply result
with 1 or sqrt ( x )
14 5* nop
15 stop
16
17 . m0
18 0 x9000 0 xB505 0 x8000 0 x0001 0 0 0 0

Listing 5.8: Kernel 2


40 Implementations

Kernel 2b
Input format Q1.15, in range [2−15 , 1i
Output format Unsigned Q8.24
Error Maximum relative error is 2−13.91
Cycles, one input value 29 cycles
Cycles, multiple input values 3.375 cycles per value

This kernel converts to soft floating point, evaluates polynomial and uses the
exponent of the floating point representation as an offset to address a multiplicand
in the memory to convert the result of the polynomial evaluation to unsigned
Q8.24.
1 . main
2
3 stofloatw < sout > vr0 .0 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr0 .0 m0 [0]. sw
6 3* nop
7 slslw vsr1 .4 vr0 .1 m0 [1]. sw // shift exponent and save to address
register ar1
8 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
9 9* nop
10
11 smulwd < scale =31 , uu > vr3 .0 d vr2 .0 m0 [ ar1 +8]. sd
12 8* nop
13 stop
14
15 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
16 . m0
17 0 x9000 0 x0001 0 0 0 0 0 0
18 0 x0100 0 x0000 0 x016A 0 x09E6 0 x0200 0 x0000 0 x02D4 0 x13CD
19 0 x0400 0 x0000 0 x05A8 0 x279A 0 x0800 0 x0000 0 x0B50 0 x4F33
20 0 x1000 0 x0000 0 x16A0 0 x9E66 0 x2000 0 x0000 0 x2D41 0 x3CCD
21 0 x4000 0 x0000 0 x5A82 0 x799A 0 x8000 0 x0000 0 xB504 0 xF334

Listing 5.9: Kernel 2b


5.7 Inverse square root 41

Kernel 3a

Input format Q1.31, in range [2−31 , 1i


Output format Soft floating point. Mantissa is unsigned Q1.31
Error Maximum relative error is 2−26.58
Cycles, one input value 51 cycles
Cycles, multiple input values 3.5 cycles per value

This kernel takes 32 bit inputs, converts to soft floating point, evaluates polynomial
for the 16 MSBs, then one Newton-Raphson
√ iteration is used to increase precision.
The result is then multiplied with 2 if the exponent is odd, exponent is shifted
to right, and the result is in soft floating point format.
1 . main
2
3 stofloat32d < sout > vr1h m1 [0]. sd
4 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
5 nop
6 saddw vr1 .0 vr1 .4 m0 [0]. sw
7 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
8 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
9 2* nop
10 scopyd . ne vr0 .2 d m0 [8]. sd
11 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
12 12* nop
13
14 // iteration
15 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
16 5* nop
17 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
18 2* nop
19 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
20 5* nop
21 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
22 5* nop
23
24 // multiply with 1 if exponent is even or sqrt (2) if it is odd
25 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
26 5* nop
27
28 stop
29
30 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
31 . m0
32 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
33 0 xB504 0 xF334 0 x8000 0 x0000

Listing 5.10: Kernel 3a


42 Implementations

Kernel 3b

Input format Q1.31, in range [2−31 , 1i


Output format Soft floating point. Mantissa is unsigned Q1.31
Error Maximum relative error is 2−30.6
Cycles, one input value 72 cycles
Cycles, multiple input values 4.5 cycles per value

Same as Kernel 3a but with two Newton-Raphson iterations.


1 ...
2 // iteration
3 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
4 5* nop
5 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
6 2* nop
7 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
8 5* nop
9 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
10 5* nop
11
12 // iteration 2
13 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
14 5* nop
15 smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
16 2* nop
17 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
18 5* nop
19 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
20 5* nop
21
22 // multiply with 1 if exponent is even or sqrt (2) if it is odd
23 smuldd < scale =31 , uu , rnd > vr3 .0 d vr5 .3 d vr0 .2 d
24 5* nop
25 stop

Listing 5.11: Kernel 3b


5.7 Inverse square root 43

Kernel 3c

Input format Q1.31, in range [2−31 , 1i


Output format Soft floating point. Mantissa is unsigned Q1.31
Error Maximum relative error is 2−29.85
Cycles, one input value 66 cycles
Cycles, multiple input values 4.5 cycles per value

Same as Kernel3b but with two Goldschmidt iterations. Note the difference in
cycle cost with one input value and errors. These differences are explained in
sections 4.1.4 and 4.1.6.
1 ...
2 // iteration
3 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X 0 x3fff ’ ,
’0 xffff ’ ,
4 5* nop
5 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
6 2* nop
7 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
8 5* nop
9
10 // iteration 2 Goldschidt
11 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // T2 * T2
12
13 // this next is instruciton is part of previous iteration
14 // but placed here due to data dependencies
15 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R2 = R1 * T2
16 4* nop
17 smuldd < scale =31 , uu , rnd > vr6 .1 d vr6 .0 d vr4 .1 d // TT1 = T2 * T2 * T1
18 2* nop
19 ssubd vr5 .2 d m0 [4]. sd vr6 .1 d // TT2 = 1.5 - TT1
20 5* nop
21 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R3 = R2 * TT2 =
R1 * T2 * TT2
22 5* nop
23
24 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
25 smuldd < scale =31 , uu , rnd > vr3 .1 d vr5 .3 d vr0 .2 d
26 5* nop
27 stop

Listing 5.12: Kernel 3c


44 Implementations

Kernel 4

Input format Q1.31, in range [2−31 , 1i


Output format Returns mantissa in Q2.30 and exponent of input
Error Maximum relative error is 2−27
Cycles, one input value 37 cycles
Cycles, multiple input values 8 cycles per value

Kernel 4 is an experiment to use 32 bit polynomial evaluation. The polynomial


used was of degree 11. An instruction that calculates powers 1 to 4 of a 32 bit
scalar was used, then powers 5-8 were calculated with a 32 bit scalar to vector
double instruction, and the then powers 9-12 in the same way. After that three
TMAC instructions are used, that each multiply four of the powers with four
polynomial constants and accumulate the sum of the products. Because three
TMAC instructions are needed, we stored each set of four polynomial constants
with as many significant bits as we could, and as a result we used various scaling
for each tmac instruction. The difference in the precision by using various scaling
for the tmac instructions, and using the same scaling for all the tmac instructions,
was not big (less than one bit), but though noticeable, and there is no good reason
against using various scaling. Neither 32 bit power instruction nor 32 bit TMAC
instruction are part of the current version of the instruction set. The purpose of
this test was to see if it could be useful. If we compare this result with kernel 3a, we
see that we get similar precision, and it takes fewer cycles for one input value, but
more cycles√for multiple input values. This kernel does however not multiply the
result with 2 for inputs with odd exponents, neither is the exponent right shifted.
That would take additional 6 cycles for one input value. The other instructions
needed can be placed where there are now issued NOPs.
The kernel source code is found in Appendix A.
5.8 Square root 45

5.8 Square root


5.8.1 Choosing a polynomial
To choose a polynomial to use, we use the same approach that we used for the
reciprocal implementation. We did a pre-analysis were we ran the Remez algorithm
with various input parameters to estimate the size of the error we should expect
for each polynomial. Simulations were then used to compare polynomials that
gave similar results in the pre-analysis. √
The result was that we use a 6th degree polynomial for x + 0.75. It gives us a
maximum error of one ULP.

5.8.2 Soft floating point usage


As in other implementations, we convert our input to soft floating point format
and use a polynomial that is optimized to be precise in the range [0.5, 1]. So we
convert a 16 bit input x such that

x = m · 2−e , m ∈ [0.5, 1], e ∈ {0, 1, 2, . . . , 15} (5.4)


√ √
To calculate x we use a polynomial to find m such that:
√ √ √
x = m · 2−e (5.5)

= m · 2−e/2 (5.6)

Since the 2e is not an integer when e is odd, we can not use only shifting to

perform the multiplication m · 2e/2 . √
We use two different methods to solve this. In kernel 2 we shift m by b 2e c,
q
and then multiply with the constant 12 if e is odd.
In kernel 3 we store the constants 2−e/2 for every possible value of e in memory.
The constant e is then copied to an address register, and then used as an offset
to fetch an operand
√ from memory (using offset addressing mode), which is then
multiplied with m.

5.8.3 32 bit version



The most efficient way we have found to calculate x with 32 bit precision of the
1
result is to first calculate √x by using a polynomial, and then one or two Newton-

Raphson iterations. That result is then multiplied with x to get x.
A 10th degree min-max polynomial can give a result with close to 32-bit correct
bits, but it would require that all the calculations to be done with 32 bit precision.

5.8.4 Zero input


As mentioned in section 5.5.1 the kernels do not support the zero input. It must
be checked specially in order to be supported.
46 Implementations

5.8.5 Kernels
Six kernels were implemented, three with 16 bit input and output, and three with
32 bit input and output.

Kernel 1
Input format Q1.15 in the range [0.5,1]
Output format Unsigned Q0.16
Error Max error 1 ULP (2− 15.6)
Cycles, one input value 20 cycles
Cycles, multiple input values 1.125 cycles per value

This
√ kernel calculates x for x ∈ [0.5, 1]. It uses a 6th degree polynomial for
x + 0.75. First the constant −0.75 is added to the input, then the polynomial is
calculated.
1 . main
2 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
3 4* nop
4 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
5 12* nop
6 stop
7
8 . m0
9 0 xa000 // -0.75
10
11 . m1
12 0 // input value
13
14 . cm // constants in Q1 .15 sqrt ( x +0.75)
15 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0

Listing 5.13: Kernel 1


5.8 Square root 47

Kernel 2
Input format Q1.15, all positive values
Output format unsigned Q0.16
Error Max error 1 ULP
Cycles, one input value 31
Cycles, multiple input values 2.125 cycles per value

This kernel can calculate x for all positive values. The input is first converted to
soft floating point format and then the polynomial is calculated in the same way as
in kernel 1. Then the first method described in 5.8.2 to convert from soft-floating
point to fixed point is used is used.
1 . main
2 stofloatw < sout > vr0 .0 d m1 [0]. sw
3 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
4 nop
5 saddw vr1 .0 vr0 .0 m0 [0]. sw // add constant before polynomial
evaluation
6 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
7 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
8 2* nop
9 scopyw . ne vr0 .3 m0 [1]. sw // overwrite the value 1 with the
value sqrt (0.5)
10 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
11 10* nop
12 slsrw vr1 .1 vr0 .3 vr0 .4 // shift result by floor ( exponent / 2)
13 2* nop
14
15 // multiply result with 1 ( even exponent ) or sqrt (0.5) ( odd
exponent )
16 smulww < scale =15 , uu , rnd > vr3 .0 vr1 .1 vr2 .0
17
18 5* nop
19 stop

Listing 5.14: Kernel 2


48 Implementations

Kernel 3
Input format Q1.15, all positive values
Output format unsigned Q0.16
Error max error 1 ULP
Cycles, one input value 28
Cycles, multiple input values 3.625 cycles per value

Same as Kernel 2 except the other method described in section 5.8.2 is used. Note
that Kernel 3 is faster for one input value but slower for multiple input values.
1 . main
2
3 stofloatw < sout > vr0 .0 d m1 [0]. sw // convert to soft float
4 2* nop
5 saddw vr1 .0 vr0 .0 m0 [0]. sw // add -0.75
6 3* nop
7 scopyw vsr1 .4 vr0 .1 // copy exponent to address register to use as
offset
8
9 // evaluate polynomial
10 polyw < start =0 , scale =15 , scale2 =15 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
11 9* nop
12 smulww < scale =15 , uu , rnd > vr3 .0 vr2 .0 m1 [ ar1 +8]. sw // scale
output
13 8* nop
14 stop
15
16
17 . m0
18 0 x8400
19
20 . m1
21 0 0 0 0 0 0 0 0
22 0 x8000 0 x5a82 0 x4000 0 x2d41 0 x2000 0 x16a1 0 x1000 0 x0b50
23 0 x0800 0 x05a8 0 x0400 0 x02d4 0 x0200 0 x016a 0 x0100 0 x00b5
24 // m1 [8] - m1 [23] contains a table of sqrt (0.5) ^ e for various e .

Listing 5.15: Kernel 3


5.8 Square root 49

Kernel 4a
Input format Q1.31, all positive values
Output format Soft floating point, unsigned Q0.32 mantissa
Error Max error is approximately 2−26
Cycles, one input value 57
Cycles, multiple input values 4.25 cycles per value.

This kernel calculates square root with 32 bit input and output. It uses a polyno-
mial to calculate √1x . Then it uses one Newton-Raphson iteration to increase the
precision.
Like in the reciprocal implementation, the polynomial is evaluated for the 16 MSBs
using 16 bit precision for the calculations. Then the Newton-Raphson iteration is
done with 32 bit operations.
1 . main
2
3 stofloat32d < sout > vr1h m1 [0]. sd
4 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
5 nop
6 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract constant
7 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
8 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
9 2* nop
10 scopyd . ne vr0 .2 d m0 [8]. sd
11 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
12 12* nop
13
14 // iteration
15 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
16 5* nop
17 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
18 2* nop
19 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
20 5* nop
21 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
22 5* nop
23
24 // convert inverse square root to square root
25 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
26 5* nop
27 // multiply with 1 if exponent is even or sqrt (2) if it is odd
28 smuldd < scale =30 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
29 5* nop
30 stop

Listing 5.16: Kernel 4a


50 Implementations

Kernel 4b
Input format Q1.31, all positive values
Output format Soft floating point, unsigned Q0.32 mantissa
Error Relative error 2−29.6
Cycles, one input value 72
Cycles, multiple input values 5.25 cycles per value

Same as kernel 4a, except with two Goldschmidt iterations.


1 ...
2 // iteration
3 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R0 * X
4 5* nop
5 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R0 * R0 * X /2
6 2* nop
7 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
8 5* nop
9
10 // iteration 2
11 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // TT1 = T2 * T2 * T1
12 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R1 = R0 * T2
13 4* nop
14 smuldd < scale =31 , uu , rnd > vr6 .1 d vr6 .0 d vr4 .1 d // TT1 * T1
15 2* nop
16 ssubd vr5 .2 d m0 [4]. sd vr6 .1 d // TT2 = 1.5 - TT1
17 5* nop
18 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R2 = R1 * TT2
19 5* nop
20
21 smuldd < scale =30 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
22 5* nop
23 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
24 smuldd < scale =31 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
25 5* nop
26 stop

Listing 5.17: Kernel 4b


5.8 Square root 51

Kernel 4c
Input format Q1.31, all positive values
Output format Soft floating point, unsigned Q0.32 mantissa
Error Relative error 2−30.1
Cycles, one input value 78
Cycles, multiple input values 5.25 cycles per value

Same as kernel 4b except it uses two Newton-Raphson iterations rather than two
Goldschmidt. The difference is more pipeline penalties but slightly better preci-
sion.
The difference in precision is due to the fact that when we start the second Newton
Raphson iteration we calculate r1 · r1 · x/2, where x is our input value, and r1 is
the result of a previous iteration. The value x has no rounding error. When we
use Goldschmidt iteration, we do a similar multiplication of three values, but all
three can have rounding errors. Therefore the worst case error is about a half bit
smaller than when we use Newton-Raphson iterations..
1 ...
2 Kernel 5// iteration 1
3 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X
4 5* nop
5 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
6 2* nop
7 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = R1
8 5* nop
9 5* nop
10 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R2 = R1 * T2
11 5* nop
12
13 // iteration 2
14 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R2 * X
15 5* nop
16 smuldd < scale =32 , uu , rnd > vr5 .1 d vr5 .0 d vr4 .3 d // TT1 = R2 * R2 * X /2
17 2* nop
18 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d // TT2 = 1.5 - TT1
19 5* nop
20 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R3 = R2 * TT2
21 5* nop
22
23 smuldd < scale =30 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
24 5* nop
25 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
26 smuldd < scale =31 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
27 5* nop
28 stop

Listing 5.18: Kernel 4c


52 Implementations

Kernel 5
Input format Q1.31, all positive values
Output format Unsigned Q0.32
Error 2 ULP
Cycles, one input value 81
Cycles, multiple input values 5.5 cycles per value

Same as kernel 4b with a shift instruction in the end, to return the result in
unsigned Q0.32 rather than soft floating point format.
5.9 Logarithms 53

5.9 Logarithms
Four different kernels to calculate logarithms have been written. As before we
convert to soft floating point format and then use polynomial evaluation. The
Newton-Raphson method is not as convenient to improve the precision of loga-
rithms as it is in the special case of reciprocal and inverse square root where we
could use only multiplication and subtraction.

5.9.1 Soft floating point usage


As before we use a polynomial that is optimized to have minimum error for the
input range [0.5, 1]. To calculate log2 (x), we convert x to soft floating point format
such that:

x = m · 2e

Where m is in the interval [0.5, 1] and e is an integer (not to be confused with


Euler’s number). We have a polynomial

p(m) ≈ log2 (m), m ∈ [0.5, 1] (5.7)

And then we can calculate log2 (x) as

log2 (m · 2e ) = p(m) + e (5.8)

If we only need the integer part of the logarithm, we do not need to evaluate
p(m).

5.9.2 Inputs in other fixed point formats


The kernels that were implemented all take inputs in Q1.15 format. If the input
value is in some other format, the same kernel can be used and a constant is added
to the result.

Example 5.2

For example if we want to calculate log2 (x) and the input value x is in Q3.13, we
use a kernel for log2 and we let the kernel think that the input value is in Q1.15.
To compensate, we add the constant 2 to the result. The general rule is that we
(i−1)
add log to the result where i is the number of integer bits in the fixed point
2 (b)
format of the input and b is the base of the logarithm we are calculating.
54 Implementations

5.9.3 Kernels
Kernel 1
Input format Q1.15 in the interval [0.5, 1i
Output format signed Q1.15
Error max relative error is 2−10
Cycles, one input value 17
Cycles, multiple input values 1 cycle per value

This kernel is just one POLYW instruction and is only intended to work for
inputs in the range [0.5, 1i. It can used if the programmer wants to do all the
required scaling.
1 . main
2 polyw < start =0 , scale =15 , scale2 =11 , sign = us , rnd1 , rnd2 > vr2 .0 m1
[0]. sw cm [0]
3 14* nop
4 stop
5
6 // polynomial constants in signed Q5 .11
7 . cm
8 0 xE199 0 x5178 0 x8E60 0 x6865 0 xCAAE 0 x0B7D 0 0

Listing 5.19: Kernel 1

Kernel 1b
Input format Q1.15 in the interval [0.5, 1i
Output format signed Q1.15
Error max relative error is 2−13.78
Cycles, one input value 20
Cycles, multiple input values 1.125 cycle per value

Like Kernel 1, this kernel also only works for inputs in the range [0.5, 1i. A
constant is subtracted from the input before the polynomial evaluation. It results
in better precision but slightly longer computation time.
1 . main
2 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
3 4* nop
4 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
5 12* nop
6 stop
7
8 . m0
9 0 x8000

Listing 5.20: Kernel 1b


5.9 Logarithms 55

Kernel 2
Input format Q1.15 all positive values
Output format signed Q5.11
Error max relative error is 2−11.7
Cycles, one input value 29
Cycles, multiple input values 1.625 cycle per value

This kernel can take all positive Q1.15 inputs (but not zero). For the domain
x ∈ [2−15 , 1], log2 (x) has the range [−15, 0]. For that range 4 integer bits and a
sign bit are needed, which means that we have to use signed Q5.11 for the result
value.
If the input is in some other fixed point format, an integer can be added to the
result to compensate for it.
This kernel converts the input to soft floating point format before the polynomial
evaluation, and then adds the exponent to the result (actually the exponent is
subtracted because it is negative). The exponent needs to be shifted to the correct
position so that it can be added to the result (the shifting converts the exponent
from Q16.0 to Q5.11).
1 . main
2
3 stofloatw < sout > vr1 .1 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr1 .2 m0 [0]. sw
6 slslw vr1 .4 vr1 .3 m0 [1]. sw
7 3* nop
8 // vr2 .0 is Q5 .11 , vr1 .0 is Q1 .15 , cm [0] is Q2 .14 (15+14 -11 = 18)
9 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
10 12* nop
11 ssubw vr3 .0 vr2 .0 vr1 .4
12 5* nop
13 stop
14
15 . m0
16 0 x8000 0 x000b

Listing 5.21: Kernel 2


56 Implementations

Kernel 3 - natural logarithm


Input format Q1.15 all positive values
Output format signed Q5.11
Error max error is 2−10.96
Cycles, one input value 29
Cycles, multiple input values 1.625 cycle per value

A simple way to calculate logarithms with other bases is to multiply the result
with the constant log1(b) where b is the new base, since logb (x) = log 2 (x)
log2 (b) .
2
It is also possible to modify the log2 (x) kernel where the polynomial constants
are scaled by log1(b) . The exponent must also be scaled.
2

log2 (m · 2e ) p(m) e
logb (m · 2e ) = = + (5.9)
log2 (b) log2 (b) log2 (b)
p(m)
Equation 5.9 is a modification of equation 5.8. We can calculate log by
2 (b)
scaling the polynomial constants (e is here the exponent and must not be confused
with Euler’s number, the base of the natural logarithm). In kernel 2 we needed
a shift instruction to scale the exponent. We can replace that shift instruction
with a multiplication instruction, which means that we can calculate any other
logarithm in the same amount of time.
It is also possible to use a polynomial for logb (x − a), x ∈ [0.5, 1i and use that
instead. loge (b) can be calculated in some of the empty slots, where NOPs are
2
issued while waiting for the results of a previous instruction before issuing the
next instruction. In that way such an implementation should take an equally long
time.
The polynomial constants are the same as in the log2 (x) implementation except
scaled with log1(e) where e is the base of natural logarithm (Euler’s number).
2
Similar changes can be made on the 32 bit version of log2 (x), and the same
method can be used to implement log10 (x) or some other base logarithm.
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 smulww < scale =4 , ss , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 ( e )
6 3* nop
7 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
8 12* nop
9 ssubw vr3 .0 vr2 .0 vr1 .4
10 5* nop
11 stop
12
13 . m0
14 0 x8000 0 x000b 0 x58B9

Listing 5.22: Kernel 3


5.9 Logarithms 57

Kernel 4 - base 10 logarithm


Input format Q1.15 all positive values
Output format signed Q4.12
Error maximum absolute error: 2−12.6
Cycles, one input value 29
Cycles, multiple input values 1.625 cycle per value

Implemented in the same way as natural logarithm. For the input domain
[2−15 , 1] the output range is [−4.15.0]. It means that the result can be stored in
Q4.12. A polynomial can give log10 (m) with maximum error of 2−15 , which means
that the maximum error of the result is a rounding error because we have to use
Q4.12 to be able to fit all possible results.
1 . main
2
3 stofloatw < sout > vr1 .1 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr1 .2 m0 [0]. sw
6 smulww < scale =5 , su , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 (10)
7 3* nop
8 polyw < start =0 , scale =15 , scale2 =19 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
9 12* nop
10 ssubw vr3 .0 vr2 .0 vr1 .4
11 5* nop
12 stop
13
14 . m0
15 0 x8000 0 x000b 0 x9A21

Listing 5.23: Kernel 4


58 Implementations

Kernel 5 - 32 bit version


Input format Q1.31 all positive values
Output format soft floating point, mantissa is Q1.31
Error maximum absolute 2−28.12
Cycles, one input value 37
Cycles, multiple input values 7.75 cycles per value

A 32 bit version of log2 was also implemented. It was done by using an 11th
degree polynomial, which was calculated by using one POWERSD instruction which
gave [x1 , x2 , x3 , x4 ], then two double scalar double vector multiplications were used
to first multiply x4 and [x1 , x2 , x3 , x4 ] and then multiply x8 and [x1 , x2 , x3 , x4 ].
Then three triangular mac instructions were used to multiply the powers of x
and the polynomial constants, and finally accumulate the products. The result
is returned in soft floating point format, so the user can decide how to scale the
result. The mantissa is in vr4.3d which is the last scalar double in vector register
4. The exponent is in vr3.6. It is an integer (Q16.0) but vr3.7 contains zero
so the vr3.3d contains the exponent in Q16.16 so it can be shifted to any 32 bit
format and added to the mantissa after the mantissa has been shifted to the same
fixed point format.
This method gives the mantissa with a maximum error of 2−28.26 which is
consistent with the fact that polynomial constants are stored in Q3.29 which has
a ULP weight of 2−29
The cycle count of this kernel depends on what support will be for 32 bit poly-
nomial evaluations. This kernel uses a double TMAC instruction which was added
to the simulator, but will perhaps be implemented as two instructions (see section
6.4.

The Kernel source code is found in Appendix A.


Chapter 6

Instruction proposals

This chapter contains a discussion about instructions that we suggest be considered


to be added to the instruction set in order to be able to evaluate functions faster.
Before this work began, it had been suggested to let the architecture support
calculation of powers of a scalar, and support to evaluate polynomials, but here
we describe in some details how the instructions can be implemented.

59
60 Instruction proposals

6.1 POWERSW
This instruction calculates powers two through eight of a scalar word.
The idea is that this instruction can be used to calculate powers of a scalar value
and then the result can be used to evaluate a polynomial. Since the datapath
of the SIMD core already has 16 multipliers, it is possible to create a pipelined
special instruction that calculates powers two through eight for a scalar value, and
if several consecutive POWERSW instructions are issued, it is possible to complete
one instruction per cycle.

6.1.1 Implementation
The instruction will have 3 multiplication pipeline stages and 4 ALU pipeline
stages. Table 6.1 shows what is performed in each stage. sat() function performs
saturation if the saturation flag is set, the >> operator is a right shift operator,
scale is the value of the scale flag, round() function performs rounding if the
rounding flag is set.

MUL stage 1 x2 := x1 · x1
x2 := sat(round(x2 >> scale)),
ALU stage 1 zero/sign extend, and select operands for next mul
stage
MUL stage 2 x3 := x1 · x2 , x4 := x2 · x2
x3 := sat(round(x3 >> scale))
x4 := sat(round(x4 >> scale))
ALU stage 2
zero/sign extend, and select operands for next mul
stage
x5 := x4 · x1 , x6 := x4 · x2
MUL stage 3 x7 := x4 · x3 , x8 := x4 · x4
ALU stage 3 xi : sat(round(xi >> scale)), ∀i ∈ {5, 6, 7, 8}
ALU stage 4 choose output, set flags, etc

Table 6.1: The pipeline stages of the POWERSW instruction

dst 128 bit (8x16 bit)


src0 16 bit

Table 6.2: Operands of POWERSW instruction


6.1 POWERSW 61

6.1.2 Flags
The flags of the instruction are shown in table 6.3

name values function


first 0,1 If valued 0: return powers 0-7,
if valued 1: return powers 1-8
sign s,u Signed or unsigned multiplications
scale 0-16 Scaling after each multiplication
rnd toggle Round after each multiplication
sat toggle Saturate after each multiplication

Table 6.3: Flags of POWERSW instruction

6.1.3 Design decisions


There are some design decisions which must made before adding this instruction
to the instruction set. The details of the design of the datapath is currently under
consideration, and decisions regarding what type of hardware is needed in each
pipeline stage.

Scale flag
If the scale flag is supposed to be able take any value in the range 0-16, there
would need to be a barrel shifter in the ALU stage that performs that scaling.
There will probably only be a barrel shifter in the latter of the two ALU stages
in the datapath. For all the implementations in this thesis where this instruction
was used, the value of the scale flag was 15 or 16.1 This flexibility is perhaps not
needed and it would be enough to always shift the result by 15. The additional
option to also be able to shift by 16 could give better precision in some cases, but
that flexibility can be sacrificed to make the implementaion simpler.

Return value
We can also consider whether we want to sacrifice the flexibility to be able to
choose to return either powers 1-8 or power 0-7. It depends on how adaptable
the multiplexing hardware in the last ALU stage is. The reason for wanting to
return powers 0-7, is that we can then directly use one TMAC instruction to
multiply powers 0-7 and polynomial constants, and accumulate the products. If
we only have the option to return powers 1-8, it will be necessary to add the
first polynomial constant (the one for the zero power) separately (unless in the
case where constant’s value is zero). When we want to evaluate a polynomial for
multiple input values, we can use one addition instruction to do this addition for
1 If we multiply two numbers in Q1.15 and want result in Q1.15, the scaling flag is set to 15,

similarly if we multiply two Q0.16 values we set the scale flag to 16 to get a result in Q0.16.
62 Instruction proposals

Fixed Scale Power 0 value


point flag
format value
Q0.16 16 0xffff
0x8000 if sign==’u’,
Q1.15 15
0x7fff if sign==’s’
Q2.14 14 0x4000
Q3.13 13 0x2000
Q4.12 12 0x1000
etc.

Table 6.4: The value to return as the zero power in POWERSW instruction

eight values, which results in additional 1/8 clock cycle per value. When we have
one or very few input values, we can in most cases issue an instruction to preload
this constant to the accumulator register, where here would otherwise be issued a
NOP (due to data-dependency).

Power zero
How do we return power 0 for unsigned Q0.16 or signed Q1.15? We can use
0xffff and 0x7fff (0xffff in Q0.16 equals 1 − 2−16 , and 0x7fff in signed Q1.15 equals
1 − 2−15 ). Table 6.4 shows what values should be returned as the zero power,
based on the scale flag and the sign flag. Note that if it will be decided to only
return powers 1-8, it is not necessary to consider this. Also if we only use the
scaling value 15, or only values 15 and 16, there are only 2 or 3 options.

Combine ALU stages 3 and 4?


Perhaps the last ALU stage is not needed and ALU stage 3 and 4 can be just one
stage.

Selection of multipliers
The most important (and obvious) thing we need to consider when we select the
multipliers we use for each multiplication stage, is to never use the same multiplier
more than once per instruction. Another thing we want to consider if we extend
the POWERSW insruction to a POLYW instruction (discussed below), is that we
might want the last multiplication stage of the POLYW instruction to be like a
TMACO instruction, and maybe it is thus possible to reuse some of the control of
the TMACO to implement the POLYW. It means that the set of multipliers we use
for POWERSW is the set of multipliers we do not use for the TMACO instruction.
Another advantage of using the set of multipliers which is not used for TMACO,
is that we can issue a TMACO instruction after a POWERSW instruction such
that it will enter the write back stage immediately one cycle after a POWERSW
6.1 POWERSW 63

instruction enters the write back stage (but there may not be a data dependency
between the last POWERSW and the first TMACO if we issue them like that).
64 Instruction proposals

6.2 POWERSD
This instruction calculates powers 1-4 of a scalar double. It is also a special
instruction similar to POWERSW. Since we only calculate up to power 4, we only
need two multiplication datapath stages. The operands flags and pipeline stages
are shown in tables 6.5, 6.6 and 6.7.

dst 128 bit (4x32 bit)


src0 32 bit

Table 6.5: Operands of the POWERSD instruction

name values function


first 0,1 Value 0 to return powers 0-3, value 1 to return powers 1-4
sign s,u Signed or unsigned multiplications
scale 0-16 Scaling after each multiplication
rnd toggle round after each multiplication
sat toggle saturate after each multiplication

Table 6.6: Flags of the POWERSD instruction

MUL stage 1 Four multipliers calculate x2 .


ALU stage 1 Combine the four products to construct x2 double word,
scale and round.
MUL stage 2 Four multipliers calculate x3 , Four multipliers calculate
x3 .
ALU stage 2 Same as ALU stage 1, but for x3 and x4
ALU stage 3 Set flags, write back etc.

Table 6.7: Pipeline stages of the POWERSD instruction

6.2.1 Design decisions


• Do we perhaps need 2 ALU stages to combine the results of 4 multiplications
into one 32 bit result?

• The only function implementation in this thesis that depends on this in-
struction is the 32 bit version logarithm2 . If there is not a need to evaluate
2 We also tried to use it to implement inverse square root but as is entioned in chapter 5 we can

use 16 bit polynomial evaluations followed by Newton-Raphson iterations to implement inverse


square root with good precision
6.2 POWERSD 65

logarithms or other functions, this instruction is not needed. The CORDIC


algorithm can also be used to get 32 bit precision of various functions, but
can however be rather slow (see section 4.4).
66 Instruction proposals

6.3 POLYW
This instruction evaluates a polynomial for a scalar value. It is a special instruc-
tion that works like a POWERSW instruction followed by a TMAC instruction.
Seven multipliers are used to calculate the powers and eight multipliers are used
to calculate the triangular mac. It means that this instruction needs fifteen mul-
tipliers. The datapath has sixteen multipliers as well as enough ALU hardware to
be able to evaluate one polynomial per cycle.
Implementing the control of this instruction would be a challenge though. The
POWERSW will most likely be implemented by adding a special instruction de-
coder. It would not require much further addition to also add support for this
instruction to the special decoder, because the POLYW instruction is identical
to the POWERSW instruction until after the third multiplication pipeline stage.
The pipeline stages are shown in table 6.8.
The operands and flags are in tables 6.9 and 6.10.

Same first 6 stages as for POWERSW


MUL stage 4 · xi , ∀i ∈ 1, 2, 3, 4, 5, 6, 7, 8
pi := ciP
ALU stage 4 and 5 return i = pi >> scale2

Table 6.8: The pipeline stages of the POLYW instruction, Same notations are
used as in table 6.1. ci are the polynomial constants

dst 16 bit
src0 16 bit
src1 128 bit (8x16 bit)

Table 6.9: Operands of POLYW instruction

name values function


start 0,1 same as in powersw instruction
first letter is for sign or unsigned powers, second let-
sign uu,us,su,ss
ter is for signed or unsigned polynomial constants
scale1 0-16 scaling during calculation of powers
scale2 0-16 scaling of final result
rnd1 toggle round during calculation of powers
rnd2 toggle round final result
sat toggle saturate after accumulation

Table 6.10: Suggested flags for POLYW instruction


6.3 POLYW 67

6.3.1 Design decisions


Loading of polynomial constants
A challenge regarding this instruction is to load the source operand with the
polynomial constants. They are not needed until after eight pipeline stages (only
counting datapath pipeline stages). Since it is a vector word source operand, it
would need eight 128 bit pipeline registers, to store this source operand between
the operand fetch stage, and until the pipeline stage where they are needed. A
solution that will likely be used is to load the operand into a special register. The
multiplicands for the fourth multiplication stage will then be fetched from that
special register. It is a good solution because in very many cases where we would
want to evaluate one polynomial per cycle, we would use the same polynomial
constants. In the case where the same polynomial constants would not be used
repeatedly in a sequence (for example if we use piecewise polynomials like discussed
in section 4.3) one POWERSW and one TMAC instruction would be used instead,
and throughput would be two cycles per polynomial evaluation.

One or two round flags?


It may not be necessary to have two round flags. The thought behind having two
round flags, is that rnd1 would be like a round flag to a POWERSW instruction
and rnd2 would be like round flag to a TMAC instruction. But we can just as well
choose to round in neither or both cases.
We ran a few simulations where a function was calculated using a polynomial,
and all four combinations of setting the two round flags were tried. The result was
that setting both flags gave the most accurate result. An unexpected result was
that the second most accurate approach was to set neither of two round flags. It
could have been specific to the function or the polynomial that was used but in
that test case, but we did not investigate further what caused that result.

Two scale flags/values are necessary


It is however necessary to be able to use two different values for scaling. Similar
to the idea behind the two round flags, the first scale flag would be like the scale
flag to a POWERS instruction and the second scale flag is like the scale flag to the
TMAC instruction. The first scale flag is chosen based on the fixed point format
of the argument of the polynomial. The second scaling flag is chosen based on the
fixed point format of the powers of the argument, the polynomial constants and
the desired fixed point format of the output.

Range of scale values


For this instruction, we also need to consider what is discussed in section 6.1.3,
that is, always use the same scaling after multiplications when we are calculating
the powers of the argument. The decision which will be taken regarding the scale
flag for the POWERS instruction will most likely apply to the first scale flag of
the poly instruction.
68 Instruction proposals

Opcode size
If we had two scale flags that both can take values between 0 and 16, each would
need 5 bits. It could possibly be pushing the limits of the size of the opcode, but
the size of the opcode has not yet been decided.

Use sign-magnitude for powers?


To improve precision, it is possible with some hardware modifications to store the
sign bit of the powers of the argument separately. In that way we can have the
polynomial input in signed Q1.15, but use its absolute value in unsigned Q0.16
for all the calculations of the powers, and in the end apply the sign bit to the
odd powers (either before or after the last multiplication (where the powers of x
are multiplied with polynomial constants), but it must be before the addition of
the products. This requires a more complicated implementation but gives slightly
better precision (smaller fixed point polynomial error).

POLYD - a 32 bit version


Should we implement a POLYD instructions that calculates a 3rd or 4th degree
polynomial like a combination of POWERSD and TMACDO? POWERSD would
use 12 multipliers and TMACDO would use all 16 multipliers. It is therefore not
possible to issue one POLYD instruction per cycle and then we can just as well
issue two instructions. The only gain of implementing this instruction would be
to avoid the pipeline penalty between the POWERSD and TMACDO due to data
dependency when we have one or few input values.
6.4 TMACDO 69

6.4 TMACDO
This instructions is a triangular MAC of two double vectors, accumulates the result
to an accumulation register, and outputs the result to the destination operand. A
The return value is: src0[0 : 31]·src1[0 : 31]+src0[32 : 63]·src1[32 : 63]+src0[64 :
95] · src1[64 : 95] + src0[96 : 127] · src1[96 : 127] + accregval, where accregval is
a value that is already in the accumulator register which is used. We could also
use an instruction called TMACD which is identical except it does not output the
result to a destination operand (it only updates the accumulator register).
The main usage in this thesis was to use this instruction along with POW-
ERSD to evaluate polynomials, but it could of course also be used to calculate dot
products and convolution and other DSP tasks which use triangular MAC.

dst 32 bit
src0 128 bit, 4 · 32bit
src1 128 bit, 4 · 32bit

Table 6.11: Operands of TMACDO instruction

6.4.1 Design decisions


• This instruction requires a total of 15 additions (three additions for each of
the four 32 bit multiplications and then three additions to accumulate). Can
that be done in two ALU stages?

• If it is not possible to implement this instruction easily due to the number of


dependent additions, we can achieve the same result in two cycles per vector
(plus possible data dependency penalties) by implementing an instruction
that calculates the sum of a double vector. Then we would first use a double
vector multiplication instruction and secondly calculate the sum.
70 Instruction proposals

6.5 SSUMD
This instruction calculates the sum of a double vector. It is intended to be used
after multiplying two double vectors, if we will not be able to implement the
TMACDO instruction.

dst 32 bit
src0 128 bit (4 · 32bit)

Table 6.12: Operands of SSUMD instruction

6.5.1 Design decisions


• Can we make an instruction that calculates the sums of two vectors, by taking
two double vector source operand and return half vector to the destination
operand? It would depend on how many and how big adders we will have in
each datapath stage, but that is currently being decided.
If that is possible, we can replace two TMACDO instructions with two double
vector multiplications, and only one SUM instruction that calculates the sum
of both the instructions.

• Can we accumulate the results of SSUMD instructions (or similar instruction


with a different mnemonic and opcode) in an accumulation register? It would
be useful when we calculate higher degree polynomials.
6.6 Soft floating point instructions 71

6.6 Soft floating point instructions


6.6.1 STOFLOATW
This instruction converts a scalar word to soft floating point format.
It returns a mantissa (a scaled version of the input) in the upper word, and ex-
ponent in the lower word of a double. Operands and flags are in tables 6.13 and
6.14.

dst 32 bit
src 16 bit

Table 6.13: Operands of STOFLOATW instruction

name values function


sign u,s signed or unsigned input
format 16u, 15u, 15s format of the mantissa

Table 6.14: Flags of STOFLOATW instruction

The format flag is the fixed point format of the mantissa. It can be unsigned
Q0.16 or Q1.15 (gives absolute value if input is negative) or signed Q1.15. 3 . For
the implementation of the unsigned Q1.15 format the sign bit can be stored in
the same word as the exponent because the exponent only needs 5 bits (values
between 0 and 31).
The unsigned Q0.16 format has the MSB always at 1 and the unsigned Q1.15
has the two MSBs always valued 01. This is because the mantissa is always a value
between 0.5 and 1.
When sign flag is ’s’, the value of the exponent value in the output will be
equal to the number of leading sign bits minus one. When the sign flag is ’u’ the
value of the exponent is equal to the number of leading zeros.
Another way of viewing this is that sign=’s’ assumes Q1.15 input and sign=’u’
assumes Q0.16 input, and both have exponent 0 when the input’s absolute value
is between 0.5 and 1.

Design decisions
• Perhaps a better mnemonic could be chosen for the instruction.

• Perhaps a better name can be found for the format flag as well as its values.

3 The simulator implementation of STOFLOATW had a flag called sout (short for signed

output), instead of the format flag. If the flag is set, the result is the same as for format=15s,
when it is not set, the result is the same as for format=16u
72 Instruction proposals

• Even though format=15s was used in all the implementations (but the flag is
named sout in the implementations source code), the 15u format is probably
more suitable for most applications. It is probably enough to only allow the
format values 16u and 15u.

• It is also possible to sacrifice the flexibility of the option to use 16u and always
use the 15u output format (and hence no flag is needed). The tradeoff is that
one bit is lost from the input if the input format is unsigned Q0.16.

• A sign bit can be stored in the same word as the exponent. The shift
instruction ignores all the bits of the shift operand exept the relevant LSBs
and therefore it would not affect the conversion from soft floating point
format to fixed point format, which is often done with a shift instruction.

6.6.2 STOFLOATD
Converts a double scalar to soft floating point. The first two words of the output
contain a 32 bit mantissa, the third word contains the exponent and the last word
is not used. Operands and flags are in tables 6.15 and 6.16.

dst 64 bit
src0 32 bit

Table 6.15: Operands of STOFLOATW instruction

name values function


sign u,s signed or unsigned input
format 32u, 31u, 31s format of output
clearlast toggle set the last (unused) word to zero
(otherwise leave it unchanged)

Table 6.16: Flags of STOFLOATW instruction


6.6 Soft floating point instructions 73

6.6.3 TOFLOATADD
These are several instructions that convert to floating point and then add the sec-
ond source operand to the mantissa.
Almost all implementations in this thesis where polynomial evaluation was used,
included adding a constant to the mantissa before the polynomial evaluation
This instruction would reduce pipeline delay. This instruction would perhaps
need too many hardware adjustments to make it worth adding it. The instruction
requires to first count leading zeros, then shifting, and then addition. Another
possibility would be do this as a long data path instruction, but since this in-
struction starts by counting leading zeros, there would need to be zero counting
hardware in the first ALU stage.

STOFLOTADDW STOFLOATADDD VMANTADDW VMANTADDD


dst 32 bit 64 bit 128 bit 128 bit
src0 16 bit 32 bit 16 bit 32 bit
src1 16 bit 32 bit 16 bit 32 bit

Table 6.17: Operands of TOFLOATADD instructions. The VMANT instruction


only return mantissas of a vectors like discussed in 6.6.4

6.6.4 Converting vectors to soft floating point


If we want to convert a vector to soft floating point, we have the problem that if
the input is eight 16 bit words, the result would need to be eight 32 bit words,
and 128 bit is the maximum size destination operands.
There are two ways we can use to implement instructions which convert a vector
to soft floating point format.
The first is to take a half vector source operand and return a full vector to the
destination operand.
The second way is to use one instruction to calculate only the mantissas of a vec-
tor. And then use another instruction to count leading zeros of the same vector
to get the exponent part of the soft floating point format.

The method of using one instruction to get mantissas and another instruction
to get exponents is more convenient. It is because it is more convenient to have
the mantissas in the same vector word, and the exponent in the same vector word.
Then it is easier to use them as operands to other vector instructions.

The only advantage of the half vector over the full vector version is in the case
where we only need to convert one half vector. In all other cases we would need
to issue two instructions per vector anyway.
These instructions are useful when we need to calculate polynomials for a se-
ries of values. They require that we have more than one instance of count leading
74 Instruction proposals

zeros hardware.

An instruction called VMANTW was implemented in the simulator, it returns


the mantissas of a vector word.

6.6.5 Conversion from soft floating point format


Conversion from a soft floating point format can be done by shifting the mantissa
by the exponent.
There is no need to have any special instruction for that. However, adding a scale
flag to the shift instruction can be convenient for the programmer so that he can
choose the fixed point format of the result, or because the value was in some fixed
point format (known only by the programmer) before it was converted to soft
floating point, and the programmer might want to do some scaling to compensate
for that. This is mentioned in section 6.8.3.
6.7 Powers with integer exponent 75

6.7 Powers with integer exponent


If there is need to calculate efficiently powers with integer exponents we could
create an special instruction for that kind of calculation.
This instruction is not necessary but it would speed up this type of calculation
with use of special registers and/or pipeline forwarding to reduce pipeline delay
(in a similar way to the powers instruction).

This pseudo assembly code calculates xe where e is an integer.


1 set res = 1
2
3 # loop begins
4 AND dst e 1 // check if LSB is zero and set zero condition
flag
5 // we do not use the result of the AND
operation , only need the flag
6 MUL . nz res res x // multiply result with x if the zero flag was
not set
7 MUL x x x // replace x with x * x
8 RSHIFT e e 1 // shift e by one to the right
9 # loop end

Listing 6.1: Kernel 4


This code does is to calculate xe as:

xe = xe[0] · (x2 )e[1] · (x4 )e[2] · (x8 )e[3] · . . . (6.1)

where e[i] ist the i-th bit of e.


The loop can be run either until e is zero or as often as the number of maxi-
mum bits in e (for example if we know that e < 128 we let the loop run 7 times).
The advantage of running the loop a fixed number of times is that we can unroll
it if we want, and it takes a fixed amount of time to execute. Running the loop
until the e = 0 can be faster but takes variable amount of time which can be a
disadvantage in real time applications (the worst case execution time is though
not difficult to predict).

Note that x and res can be vectors (or any numerical datatype for that matter).
If e is a constant, known at compile time, we do not need the AND and RSHIFT
instructions, we only need the MUL x x x and MUL res res x when appropriate.
That is, it is not necessary to check the condition since we know at compile time
which multiplications will be needed.

There are several ways to implement a special instruction that uses the cur-
rently existing hardware in the datapath to speed up these calculations. The two
multiplications and the right shift can all be performed in parallel, and the AND
instruction can removed by using the LSB of the exponent (and its shifted versions)
directly as a select bit to a multiplier. It would however be relatively complicated
to implement the control for such a special instruction, so a simpler and slower
approach might be chosen instead.
76 Instruction proposals

Fractional exponents
The method described here calculates integer powers, but could be used in com-
bination with polynomial approximation to calculate powers where the exponent
is a fractional number.
The polynomial would be used to approximate the power of the fractional part
of the exponent. It is however only possible to use polynomial approximations
for the fractional part when the base is a constant known at compile time, since
different bases of the power need a different set of polynomial constants.
The calculation would be done as:

k e = k i+f = k i · k f (6.2)

Where e is a fractional variable with integer part i and fractional part f such
that e = i + f . k i would be calculated with the method described above to
calculate powers with integer exponents and k f is calculated with polynomial
approximation.

Variable bases
Another similar method can be used to calculate powers with variable fractional
exponents and variable bases. It is however challenging to use in a fixed point
system.
We factorize the exponent into an integer and a fractional constant. The frac-
tional constant is the value of the LSB. For example if we have a variable exponent
with four fractional bits, we can factorize for example. 10.1011 as 101011 · 0.0001.
If we want to calculate b10.1011 we can calculate b101011 as a power with integer ex-
poenent and then use polynomial approximation to calculate (b101011 )0.0001 , with
a polynomial which approximates the function f (x) ≡ x0.0001 .

6.8 Other
6.8.1 Opposite signed products after multiplication instruc-
tions
This feature was described in section 4.1.2. It is used to speed up Newton-Raphson
division by returning the product with its opposite sign after a multiplication
instruction.
This feature was implemented in the simulator by adding a flag to multiplication
instructions to let them return the product with the opposite sign. It can also be
implemented as a separate instruction, depending on which is easier to implement
in the instruction decoder or depending on which approach is more consistent with
the rest of the instruction set.
Since a multiplication is already a long datapath instruction, it would cost lit-
tle to use either one of the ALU stages to return the negative of the product.
6.8 Other 77

Another option would be to only (bitwise) invert the product. Inverting a


two’s complementary value gives its negative minus one LSB. That solution would
produce worse results when the error we want to correct using Newton-Raphson
iteration is very small (about 1-4 LSB weights).

6.8.2 Special multiplication for inverse square root itera-


tions
This is a feature similar to the opposite signed result of multiplication instruction,
but it is used for Newton-Raphson iterations for inverse square root. The Newton-
Raphson iteration for inverse square root is:
ri · ri · x
ri+1 = ri · (1.5 − )
2
Where ri is an approximation of the inverse square root and ri+1 is an improvement
of that approximation.
This would be implemented as a long datapath instruction which multiplies
two operands, and then subtracts the product from 1.5..
This feature is slightly more complicated compared to the opposite signed product
feature, because the consntant 1.5 must be loaded into the ALU stage. Multipli-
cation instructions do already have two source operands, which must be loaded,
so the additional loading of this 1.5 constant would need to be hardwired or some-
thing similar.

6.8.3 Scale flag to shift instructions


When converting between soft floating point format and some fixed point format, it
is useful to be able to shift a source operand (the mantissa of the soft floating point
format) by the sum of a variable (the exponent of the soft floating point format)
and a constant. The constant can be given as a scale flag to the instruction, and
the value of this constant depends on which fixed point format we want the result
to be in.
Preferably we would want to left or right shift, depending on the sign of the
sum of the variable and a constant, because sometimes we would want to have a
scale flag with a negative value.
As a result, perhaps the best method to implement this is to add a new instruction
for this purpose.

6.8.4 Scale flag to add, sub, and other trivial arithmetic


instructions
We might want to shift a result after some other arithmetic instruction. It is useful
to convert between fixed point formats without having to issue a shift operation.
Then a round flag will also be needed. We did not come across many cases where
this is necessary, but if scaling and rounding can easily be done after arithmetic
operations in the same cycle, the cost of adding this feature is not big.
78 Instruction proposals

6.8.5 Long datapath version of short datapath instructions


This was suggested in [5] in order to reduce pipeline delay caused by structural
hazard. During the implementations of the kernels, we have come across cases
where this solution would prevent pipeline delay. This solution is useful in any
case where there is a mixture of short datapath instructions, and long datapath
instructions if there are no data dependencies that cause pipeline delays.
Adding a flag to any short datapath instruction to make it a long datapath
instruction could be one way to implement this solution.
Chapter 7

Results

In this chapter we summarize our results. The results are divided into two sections:
a summary of the kernels we have implemented, and a summary of the features
we propose that will be added or considered to be added to the instruction set
architecture.

7.1 Function kernels


A summary of the kernels that were implemented for each functions are found
in the following tables. Table 7.2 lists the kernels that approximate reciprocals.
Similarily tables 7.3, 7.4 and 7.5 list the kernels which implement inverse square
root, square root and logarithms. Table 7.1 lists the abbreviations used in the
tables.
The main difference beween the kernels for each function is the precision and
cycle cost, and that sometimes there are more than one type of input and output
formats. More details about each kernel, for example the methods it uses is found
in chapter 5

uQi.f unsigned Qi.f


sQi.f signed Qi.f
abs maximum absolute error
rel maximum relative error
sf soft floating point format

Table 7.1: Abbreviations used in following tables.

79
80 Results

Cycles Cycles
Kernel name Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 Q1.15 [0.5, 1i uQ1.15 abs 2−11.5 20 1.125
Kernel 2 Q1.15 [2−15 , 1i sf uQ1.15 rel 2−11.85 23 1.375
Kernel 3 Q1.15 [2−15 , 1i sf uQ1.15 rel 2−15.0 35 1.625
Kernel 4 Q1.15 [2−15 , 1i Q16.16 rel 2−16 38 2.375
Kernel 5 Q1.31 [2−31 , 1i sf uQ1.31 rel 2−24 35 2.125
Kernel 5b Q1.31 [2−31 , 1i sf uQ1.31 rel 2−31 47 2.625
Kernel 6 64 x [2−16 , 1i sf uQ1.15 rel 2−15 N/A 1.875
uQ0.16

Table 7.2: A summary of the kernels to implement reciprocal

Cycles Cycles
Kernel name Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 Q1.15 [0.5, 1i uQ1.15 abs 2−13.68 20 1.125
Kernel 2 Q1.15 [2−15 , 1i sf uQ1.15 rel 2−13.91 30 2.0
Kernel 2b Q1.15 [2−15 , 1i sf uQ1.15 rel 2−13.91 29 3.375
Kernel 3a Q1.31 [2−31 , 1i sf uQ1.31 rel 2−26.58 51 3.5
Kernel 3b Q1.31 [2−31 , 1i sf uQ1.31 rel 2−30.6 72 4.5
Kernel 3c Q1.31 [2−31 , 1i sf uQ1.31 2−29.85 66 4.5
Kernel 4 Q1.31 [2−31 , 1i sf uQ2.30 2−27 37 8

Table 7.3: A summary of the kernels that implement inverse square root

Cycles Cycles
Kernel name Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 Q1.15 [0.5, 1i uQ0.16 abs 2−15.6 20 1.125
Kernel 2 Q1.15 [2−15 , 1i uQ0.16 abs 1 ULP 31 2.125
Kernel 3 Q1.15 [2−15 , 1i uQ0.16 abs 1 ULP 28 3.625
Kernel 4a Q1.31 [2−31 , 1i sf uQ0.32 abs 2−26 57 4.25
Kernel 4b Q1.31 [2−31 , 1i sf uQ0.32 rel 2−29.6 72 5.25
Kernel 4c Q1.31 [2−31 , 1i sf uQ0.32 rel 2−30.1 78 5.25
Kernel 5 Q1.31 [2−31 , 1i uQ0.32 abs 2 ULP 81 5.5

Table 7.4: A summary of the kernels that implement square root


7.1 Function kernels 81

Cycles Cycles
Kernel name Base Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 2 Q1.15 [0.5, 1i sQ1.15 rel 2−10 17 1
Kernel 1b 2 Q1.15 [2−15 , 1i sQ1.15 rel 2−13.78 20 1.125
Kernel 2 2 Q1.15 [2−15 , 1i sQ5.11 rel 2−11.7 29 1.625
Kernel 3 e Q1.15 [2−15 , 1i sQ5.11 abs 2−10.96 29 1.625
Kernel 4 10 Q1.15 [2−15 , 1i sQ4.12 abs 2−12.6 29 1.625
Kernel 5 2 Q1.31 [2−31 , 1i sf sQ1.31 abs 2−28.12 37 7.75

Table 7.5: A summary of the kernels that implement logarithms


82 Results

7.2 Proposed features


Here we have listed a summary of instructions and features that were discussed
in chapter 6. We have tried to evaluate how important it is to implement the
instruction or feature.

POWERSW
Special instruction that calculates powers 1-8 of a 16 bit scalar.

Add to architecture: Important


Comments: All the implementation in this thesis depend
on it.
Usage and benefits: To calculate polynomials. Makes it possible
to evaluate a polynomial (degree 8 or less) for
a scalar in 2 cycles per input or 1 cycle per
scalar extended as a POLYW instruction.
Other: We have given details of how to implement
this feature. An important result is that we
can use a fixed amount of scaling after each
multiplication stage.

POWERSD
Special instruction that calculates powers 1-4 of a 32 bit scalar

Add to architecture: Maybe


Comments: Important for 32 bit polynomial approxima-
tions.32 bit polynomial approximations
√ are
not needed to implement x1 , √1x and x with
32 bit precision, but it is needed to imple-
ment logarithms and other functions with 32
bit precision.
Usage and benefits: Speeds up evaluations of 32 bit polynomials.

POLYW
Special instruction that calculates a polynomial of degree 8 or less, for a 16-bit
scalar.
7.2 Proposed features 83

Add to architecture: Preferably


Comments: Complicated to implement, but significant
benefits.
Usage and benefits: Makes it possible to evaluate a polynomial in
one cycle per 16-bit input value, instead of 2
cycles per input.

TMACDO
Triangular multiplication and accumulation of two vector double words.

Add to architecture: Maybe


Comments: Needs many additions which are perhaps not
possible to complete in two ALU stages.
Usage and benefits: Used after POWERSD to evaluate polynomi-
als. Can also be used for other common DSP
usages where triangular MAC are used.

SSUMD
Adds all the four scalars in a vector double word and returns the sum.

Add to architecture: If POWERSD is implemented and TMACDO


is not implemented, or if it is needed for for
other purposes.
Comments: one VMULDD and one SSUMD can replace
one TMACDO
Usage and benefits: see entry for TMACDO.

STOFLOATW
Converts a scalar word to soft floating point format

Add to architecture: Important


Comments: Needs count leading zeros/ones hardware
Usage and benefits: Important for polynomial approximations.

STOFLOATD
Convert scalar double word to soft floating point format.
84 Results

Add to architecture: Important


Comments: Needs count leading zeros/ones hardware
Usage and benefits: Important for polynomial approximations for
32 bit scalars.

VMANTW
Returns the mantissas of 8 scalars in a vector word.

Add to architecture: Preferable


Comments: Needs multiple instances count leading ze-
ros/ones hardware
Usage and benefits: Used before polynomial approximations of
multiple input values.

VMANTD
Returns the mantissas of 4 scalars in a vector double word.

Add to architecture: Preferable


Comments: Needs multiple instances of count leading ze-
ros/ones hardware
Usage and benefits: Used before polynomial approximations of
multiple 32 bit input values.

Count leading zeros/sign bit


Returns the amount of leading zeros, both vector and scalar version, for both 16
bit and 32 bit inputs.

Add to architecture: Important


Comments: Needs multiple instances of count leading ze-
ros/ones hardware. These instructions are
commonly found in instruction sets and will
probably be needed anyway.
Usage and benefits: For our implementations these instructions are
important to convert multiple inputs to soft
floating point format, in combinations with
VMANTW and VMANTD.
7.2 Proposed features 85

TOFLOATADD
Four instructions, listed in table 6.17, that convert inputs to soft floating point
format and add a constant to the mantissa.

Add to architecture: Not necessary but preferable


Comments: A feature that will be used if it is imple-
mented, but can be skipped if the cost is to
big.
Usage and benefits: Reduce the pipeline delay of polynomial ap-
proximations (about 2-4 cycles), as well as re-
duce the cycle cost for multiple inputs (0.125
for multiple 16 bit inputs and 0.25 for multiple
32 bit inputs).

Powers with integer exponents


Special instruction to speed up the calculation of powers of integer exponent. We
have not yet spent much time considering which is the best approach to implement
this, or investigate whether there is much demand for it.

Add to architecture: Depends on whether there is demand for


it
Comments:
Usage and benefits: Used to speed up the calculation of powers
with integer exponents.

Opposite signed products of multiplication instructions


Multiplication instructions that returns products with opposite sign.

Add to architecture: Important


Comments:
Usage and benefits: Significantly speeds up Newton-Raphson iter-
ation for x1 (by 33%).

Special multiplication for inverse square root iterations


A multiplication instruction which also subtracts the product from the constant
1.5.
86 Results

Add to architecture: Preferable


Comments: If it is not too complicated, then this instruc-
tion will be used to improve precision of square
roots and inverse square roots
Usage and benefits: Significantly speeds up Newton-Raphson iter-
ation for inverse square root (by 25%).

Scale flag to shift instruction


Adds the possibility to scale the result of an instruction.

Add to architecture: Preferable


Comments: Used to shift by the sum of a variable and a
constant.
Usage and benefits: Used to change between fixed point formats.
7.2 Proposed features 87

Scale flag to simple arithmetic instructions


Adds the possibility to scale the result of a shift instruction.

Add to architecture: Maybe


Comments: There are shifters in the ALU datapath, but
maybe the critical path would be too long to
implement this easily. The alternative is to is-
sue a shift instruction after the ALU instruc-
tion.
Usage and benefits: Used to change between fixed point formats.

Long datapath versions of datapath instruction


Adds the possibility to issue short datapath instructions as long datapath instruc-
tions.

Add to architecture: Preferable


Comments: If a neat solution of how to implement this fea-
ture without much extra cost, it should defi-
nitely be added.
Usage and benefits: Reduces pipeline delays due to structural haz-
ard.
Chapter 8

Conclusions

Our goal was to implement elementary functions on an ePUMA SIMD core. We


wanted our implementations to be as fast methods as possible, as precise as pos-
sible, and to add as little hardware as possible.
Since there are 16 multipliers in the datapath, the initial idea was to implement an
accelerated special instruction, which uses the 16 multipliers efficiently, to evaluate
Taylor polynomials to approximate functions.

It soon became clear that Taylor polynomials could not be used alone to im-
plement the functions because they only give good precision in a limited interval
of the function’s domain (when we use a finite amount of terms), and very often
the polynomial coefficients as well as the powers of the input, are of various mag-
nitudes that require a big dynamic range, which is difficult to deal with when we
use fixed point arithmetic.

In order to use polynomial approximations for the functions discussed, it is neces-


sary to define an interval of the domain which the polynomial should work for. By
converting the input to soft floating point we define that interval to be x ∈ [0.5, 1i
and we can find polynomials of a reasonable degree which gives reasonable preci-
sion. We also found that min-max polynomials are better suited for our application
rather than Taylor polynomials.

The Remez algorithm was used to calculate the polynomial coefficients for min-max
polynomials. We found a method to roughly estimate the error a given min-max
polynomial would give, if it would be calculated in a fixed point system. We could
then tweek the input parameters (polynomial degree, and the constant we have
called a) of the Remez algorithm to find the polynomial that would give us best
precision.

Newton-Raphson and Goldschmidt iterations were also used to improve the pre-
cision of the results of x1 and √1x .

89
90 Conclusions

Based on our implementations we made suggestions of instructions and features


which should be added or considered to be added to the architecture. We also gave
some details about how to implement the calculations of powers and polynomials.
As an example we found out that if we convert an input to soft floating point
format before calculating the powers, the dynamic range of the powers will be
smaller and we can use a fixed amount of scaling after each multiplication step
(rather than having a barrel shifter).

Our goal was to add as little hardware as possible but we conclude that in order
to efficiently implement conversion to soft floating point format, it is necessary to
have hardware that counts leading zeros and leading ones (leading sign bit). This
type of hardware is relatively small and inexpensive to add to the ALU.

Several kernel source codes were written for each of the four functions. The kernels
differed in precision and execution times and both 16 bit and 32 bit kernels were
implemented. The precision and execution times of the kernels were reasonable.

Future work
The development of the ePUMA platform will continue. When it has been decided
which features discussed in chapter 6 will be implemented and how, it will become
necessary to modify the kernel source codes accordingly. The methods discussed
in this thesis can also be applied to implement other mathematical functions.
Bibliography

[1] ePUMA research team. epuma platform hardware architecture. Unpublished,


Linköping University, Computer Engineering, The Institute of Technology,
2010.

[2] ePUMA research team. Sleipnir instruction set manual. Unpublished,


Linköping University, Computer Engineering, The Institute of Technology,
2010.

[3] W. Fraser. A survey of methods of computing minimax and near-minimax


polynomial approximations for functions of a single independent variable. J.
ACM, 12:295–314, July 1965.

[4] E. Hertz and P. Nilsson. A methodology for parabolic synthesis of unary


functions for hardware implementation. In Signals, Circuits and Systems,
2008. SCS 2008. 2nd International Conference on, pages 1 –6, nov. 2008.

[5] Andreas Karlsson. Algorithm adaptation and optimization of a novel dsp vec-
tor co-processor, 2010. Master thesis done at Linköping University, Computer
Engineering, The Institute of Technology,.

[6] Dake Liu. Embedded DSP Processor Design. Morgan Kaufmann Publishers
Inc., 2008.

[7] Peter Markstein. Software division and square root using Goldschmidt’s al-
gorithms. In In 6th Conference on Real Numbers and Computers, pages
146–157, 2004.

[8] Jean-Michel Muller. Elementary Functions: Algorithms and Implementation.


Birkhäuser Boston, 2005.

[9] J.-A. Pineiro and J.D. Bruguera. High-speed double-precision computation of


reciprocal, division, square root, and inverse square root. Computers, IEEE
Transactions on, 51(12):1377 – 1388, December 2002.

[10] Ken Turkowski. Fixed-point trigonometry with cordic iterations. In Graphics


gems, pages 494–497. Academic Press Professional, Inc., 1990.

91
92 Bibliography

[11] Jian Wang, Joar Sohl, Olof Kraigher, and Dake Liu. ePUMA: A novel embed-
ded parallel DSP platform for predictable computing. In Education Technol-
ogy and Computer (ICETC), 2010 2nd International Conference on, volume 5,
pages V5–32 –V5–35. Institute of Electrical and Electronics Engineers, Inc.,
jun. 2010.
Appendix A

Kernel source codes

A.1 Reciprocal
A.1.1 Kernel 1
1 // input m1 [0]. sw signed Q1 .15 range 0.5 -1
2 // output vr2 .0 unsigned Q1 .15
3 // max error 11.1 * ulp
4 // 20 cycles
5
6 . main
7 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
8 4* nop
9 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
10 12* nop
11 stop
12
13 . m0
14 0 x9800
15
16 . m1
17 0
18
19 // constants in Q4 .12 for 1/( x +0.8125)
20 . cm
21 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

Listing A.1: Kernel 4

A.1.2 Kernel 2
1 // input m1 [0]. sw signed Q1 .15
2 // output : softfloat
3 // mantissa : vr2 .0 unsigned Q1 .15
4 // exponent : vr1 .3 ( integer )
5 //
6 // max relative error size = 2^ -11.855880
7 // 23 cycles
8

93
94 Kernel source codes

9
10 . main
11
12 stofloatw < sout > vr1 .1 d m1 [0]. sw
13 2* nop
14 saddw vr1 .0 vr1 .2 m0 [0]. sw
15 4* nop
16 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
17 12* nop
18 stop
19
20 . m0
21 0 x9800
22
23 . m1
24 0
25
26 . cm
27 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

Listing A.2: Kernel 4

A.1.3 Kernel 3

1 // input in signed Q1 .15 m1 [0]. sw


2 // output soft float
3 // mantissa vr0 .1 unsigned Q1 .15
4 // exponent vr1 .3 ( integer )
5 // max relative error size = 2^ -15
6 // 35 cycles
7
8
9 . main
10 stofloatw < sout > vr1 .1 d m1 [0]. sw
11 2* nop
12 saddw vr1 .0 vr1 .2 m0 [0]. sw
13 4* nop
14 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
15 12* nop
16 // Newton Rapson iteration
17 smulww < scale =15 , uu , negres , rnd > vr0 .0 vr1 .2 vr2 .0
18 5* nop
19 smulww < scale =15 , uu , sat , rnd > vr0 .1 vr0 .0 vr2 .0
20 5* nop
21 stop
22
23 . m0
24 0 x9800
25
26 . m1
27 0
28
29 . cm
30 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

Listing A.3: Kernel 4


A.1 Reciprocal 95

A.1.4 Kernel 4
1 // input in signed Q1 .15 m1 [0]. sw
2 // output unsigned Q16 .16 soft float mantissa vr0 .1 exp vr1 .3
3 // max relative error size = 2^ -16 ( -15.9997) , max abs error 2.28
ulp
4 // 38 cycles
5
6
7 . main
8
9 stofloatw < sout > vr1 .1 d m1 [0]. sw
10 2* nop
11 saddw vr1 .0 vr1 .2 m0 [0]. sw
12 ssubw vr1 .6 m0 [1]. sw vr1 .3
13 2* nop
14 scopyw vr1 .4 vr1 .2
15
16 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
17 12* nop
18
19 // Newton - Rapson iteration
20 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
21 5* nop
22 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
23 5* nop
24
25 // scale result
26 slsrd vr0 .3 d vr0 .1 d vr1 .6
27 2* nop
28 stop
29
30
31 . m0
32 0 x9800 15
33
34
35 . m1
36 0
37
38 . cm
39 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

Listing A.4: Kernel 4

A.1.5 Kernel 5
1 // input in signed Q1 .31 m1 [0]. sd
2 // output soft float mantissa vr0 .1 d exp vr1 .6
3 // max relative error size = 2^ -16 ( -15.9997) , max abs error 2.28
ulp
4 // 35 cycles
5
6
7 . main
8
9 stofloat32d < sout > vr1h m1 [0]. sd
10 2* nop
11
12 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract
96 Kernel source codes

13
14 4* nop
15
16 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
17 12* nop
18
19 // Newton - Rapson iteration
20 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
21 5* nop
22 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
23 5* nop
24
25
26
27 stop
28
29
30 . m0
31 0 x9800 15
32
33
34 . m1
35 0
36
37 . cm
38 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0

Listing A.5: Kernel 4

A.1.6 Kernel 5b
1 // input in signed Q1 .31 m1 [0]. sd
2 // output soft float mantissa vr0 .1 d exp vr1 .6
3
4 // max relative error size = 2^ -31
5 // 47 cycles
6 // 5 degree polynomial + two newton rapson .
7
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 2* nop
13
14 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract
15
16 4* nop
17
18 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20
21 // Newton - Rapson iteration
22 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
23 5* nop
24 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
25 5* nop
26
27
28 // second iteration
29 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr0 .1 d vr1 .2 d
A.1 Reciprocal 97

30 5* nop
31 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr0 .1 d
32 5* nop
33
34
35 // convert from soft float to Q16 .16
36 // 2* nop
37 stop
38
39
40 . m0
41 0 x9800 15
42
43
44 . m1
45 0
46
47 . cm
48 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0

Listing A.6: Kernel 4

A.1.7 Kernel 6
1 // vector reciprocal
2 // calculates reciprocal of 8 vector words stored in m0 [0]. vw - m0
[56]. vw in 1 -15 FP
3 // method works only in the range [0.5 ,1]
4 // reciprocal is returned in 2 -14 FP format to vr0 - vr7
5
6
7 . main
8
9
10 // get mantissa
11 vmantw <u > vr0 m0 [0]. vw
12 vmantw <u > vr1 m0 [8]. vw
13 vmantw <u > vr2 m0 [16]. vw
14 vmantw <u > vr3 m0 [24]. vw
15 vmantw <u > vr4 m0 [32]. vw
16 vmantw <u > vr5 m0 [40]. vw
17 vmantw <u > vr6 m0 [48]. vw
18 vmantw <u > vr7 m0 [56]. vw
19
20 vexpow <u > m1 [64]. vw m0 [0]. vw
21 vexpow <u > m1 [72]. vw m0 [8]. vw
22 vexpow <u > m1 [80]. vw m0 [16]. vw
23 vexpow <u > m1 [88]. vw m0 [24]. vw
24 vexpow <u > m1 [96]. vw m0 [32]. vw
25 vexpow <u > m1 [104]. vw m0 [40]. vw
26 vexpow <u > m1 [112]. vw m0 [48]. vw
27 vexpow <u > m1 [120]. vw m0 [56]. vw
28
29
30
31 vcopy m0 [128]. vw vr0
32 vcopy m0 [136]. vw vr1
33 vcopy m0 [144]. vw vr2
34 vcopy m0 [152]. vw vr3
35 vcopy m0 [160]. vw vr4
36 vcopy m0 [168]. vw vr5
37 vcopy m0 [176]. vw vr6
98 Kernel source codes

38 vcopy m0 [184]. vw vr7


39
40 // vexpow <u >
41 3* nop
42
43 // calculate squares
44 vmulww < scale =16 , uu > m0 [64]. vw vr0 vr0
45 vmulww < scale =16 , uu > m0 [72]. vw vr1 vr1
46 vmulww < scale =16 , uu > m0 [80]. vw vr2 vr2
47 vmulww < scale =16 , uu > m0 [88]. vw vr3 vr3
48 vmulww < scale =16 , uu > m0 [96]. vw vr4 vr4
49 vmulww < scale =16 , uu > m0 [104]. vw vr5 vr5
50 vmulww < scale =16 , uu > m0 [112]. vw vr6 vr6
51 vmulww < scale =16 , uu > m0 [120]. vw vr7 vr7
52
53
54 // multiply x * c1
55 vmulww < scale =16 , us > m1 [0]. vw vr0 cm [2]
56 vmulww < scale =16 , us > m1 [8]. vw vr1 cm [2]
57 vmulww < scale =16 , us > m1 [16]. vw vr2 cm [2]
58 vmulww < scale =16 , us > m1 [24]. vw vr3 cm [2]
59 vmulww < scale =16 , us > m1 [32]. vw vr4 cm [2]
60 vmulww < scale =16 , us > m1 [40]. vw vr5 cm [2]
61 vmulww < scale =16 , us > m1 [48]. vw vr6 cm [2]
62 vmulww < scale =16 , us > m1 [56]. vw vr7 cm [2]
63
64
65 // multiply x * x * c2
66 vmulww < scale =16 , us > vr0 m0 [64]. vw cm [3]
67 vmulww < scale =16 , us > vr1 m0 [72]. vw cm [3]
68 vmulww < scale =16 , us > vr2 m0 [80]. vw cm [3]
69 vmulww < scale =16 , us > vr3 m0 [88]. vw cm [3]
70 vmulww < scale =16 , us > vr4 m0 [96]. vw cm [3]
71 vmulww < scale =16 , us > vr5 m0 [104]. vw cm [3]
72 vmulww < scale =16 , us > vr6 m0 [112]. vw cm [3]
73 vmulww < scale =16 , us > vr7 m0 [120]. vw cm [3]
74
75
76 // p = c0 and x * x * c2
77 5* nop
78 vaddw vr0 cm [1] vr0
79 vaddw vr1 cm [1] vr1
80 vaddw vr2 cm [1] vr2
81 vaddw vr3 cm [1] vr3
82 vaddw vr4 cm [1] vr4
83 vaddw vr5 cm [1] vr5
84 vaddw vr6 cm [1] vr6
85 vaddw vr7 cm [1] vr7
86
87
88 // then add x * c1
89 vaddw vr0 m1 [0]. vw vr0
90 vaddw vr1 m1 [8]. vw vr1
91 vaddw vr2 m1 [16]. vw vr2
92 vaddw vr3 m1 [24]. vw vr3
93 vaddw vr4 m1 [32]. vw vr4
94 vaddw vr5 m1 [40]. vw vr5
95 vaddw vr6 m1 [48]. vw vr6
96 vaddw vr7 m1 [56]. vw vr7
97
98
99
100
101 // start iteration ( Newton Rapson )
102 // iteration 1
A.1 Reciprocal 99

103 // c = -x * i
104 // c : unsigned Q1 .15 i : signed Q4 .12 x : unsigned Q0 .16
105 vmulww < scale =13 , us , negres > m1 [0]. vw m0 [128]. vw vr0
106 vmulww < scale =13 , us , negres > m1 [8]. vw m0 [136]. vw vr1
107 vmulww < scale =13 , us , negres > m1 [16]. vw m0 [144]. vw vr2
108 vmulww < scale =13 , us , negres > m1 [24]. vw m0 [152]. vw vr3
109 vmulww < scale =13 , us , negres > m0 [96]. vw m0 [160]. vw vr4
110 vmulww < scale =13 , us , negres > m0 [104]. vw m0 [168]. vw vr5
111 vmulww < scale =13 , us , negres > m0 [112]. vw m0 [176]. vw vr6
112 vmulww < scale =13 , us , negres > m0 [120]. vw m0 [184]. vw vr7
113
114 // i = i * c or i_2 = i_1 * c
115 // i2 : unsigned Q1 .15
116 3* nop
117 vmulww < scale =12 , su > vr0 vr0 m1 [0]. vw
118 vmulww < scale =12 , su > vr1 vr1 m1 [8]. vw
119 vmulww < scale =12 , su > vr2 vr2 m1 [16]. vw
120 vmulww < scale =12 , su > vr3 vr3 m1 [24]. vw
121 vmulww < scale =12 , su > vr4 vr4 m0 [96]. vw
122 vmulww < scale =12 , su > vr5 vr5 m0 [104]. vw
123 vmulww < scale =12 , su > vr6 vr6 m0 [112]. vw
124 vmulww < scale =12 , su > vr7 vr7 m0 [120]. vw
125
126
127
128 // iteration 2
129 // c = -( x * i )
130 // c : Q1 .15 u x : Q0 .16 u i ; Q1 .15 u
131 vmulww < scale =16 , uu , negres , rnd > m1 [0]. vw m0 [128]. vw vr0
132 vmulww < scale =16 , uu , negres , rnd > m1 [8]. vw m0 [136]. vw vr1
133 vmulww < scale =16 , uu , negres , rnd > m1 [16]. vw m0 [144]. vw vr2
134 vmulww < scale =16 , uu , negres , rnd > m1 [24]. vw m0 [152]. vw vr3
135 vmulww < scale =16 , uu , negres , rnd > m0 [96]. vw m0 [160]. vw vr4
136 vmulww < scale =16 , uu , negres , rnd > m0 [104]. vw m0 [168]. vw vr5
137 vmulww < scale =16 , uu , negres , rnd > m0 [112]. vw m0 [176]. vw vr6
138 vmulww < scale =16 , uu , negres , rnd > m0 [120]. vw m0 [184]. vw vr7
139
140 // i = i * c
141 //15+15 -15
142 3* nop
143 vmulww < scale =15 , uu , rnd , sat > vr0 vr0 m1 [0]. vw
144 vmulww < scale =15 , uu , rnd , sat > vr1 vr1 m1 [8]. vw
145 vmulww < scale =15 , uu , rnd , sat > vr2 vr2 m1 [16]. vw
146 vmulww < scale =15 , uu , rnd , sat > vr3 vr3 m1 [24]. vw
147 vmulww < scale =15 , uu , rnd , sat > vr4 vr4 m0 [96]. vw
148 vmulww < scale =15 , uu , rnd , sat > vr5 vr5 m0 [104]. vw
149 vmulww < scale =15 , uu , rnd , sat > vr6 vr6 m0 [112]. vw
150 vmulww < scale =15 , uu , rnd , sat > vr7 vr7 m0 [120]. vw
151
152 5* nop
153
154
155 stop
156
157
158
159
160 . m0
161 0 x4000 0 x4100 0 x4200 0 x4300 0 x4400 0 x4500 0 x4600 0 x4700
162 0 x4800 0 x4900 0 x4a00 0 x4b00 0 x4c00 0 x4d00 0 x4e00 0 x4f00
163 0 x5000 0 x5100 0 x5200 0 x5300 0 x5400 0 x5500 0 x5600 0 x5700
164 0 x5800 0 x5900 0 x5a00 0 x5b00 0 x5c00 0 x5d00 0 x5e00 0 x5f00
165 0 x6000 0 x6100 0 x6200 0 x6300 0 x6400 0 x6500 0 x6600 0 x6700
166 0 x6800 0 x6900 0 x6a00 0 x6b00 0 x6c00 0 x6d00 0 x6e00 0 x6f00
167 0 x7000 0 x7100 0 x7200 0 x7300 0 x7400 0 x7500 0 x7600 0 x7700
100 Kernel source codes

168 0 x7800 0 x7900 0 x7a00 0 x7b00 0 x7c00 0 x7d00 0 x7e00 0 x7f00


169
170 // remez constants for 2 nd degree polynomial of 1/ x in the range
171 // [0.5 ,1]
172 // 4.2523 , -5.8378 , 2.5946
173 // in Q4 .13 fixedpooint format 0 x4409 0 xa298 0 x2983
174
175 // 1.9853 -3.3137 2.7452
176 // in Q3 .13 fixedpoint format : 3 f87 95 f7 57 d8
177
178
179 . cm
180 0 x0000 0 0 0 0 0 0 0
181 0 x4409 0 x4409 0 x4409 0 x4409 0 x4409 0 x4409 0 x4409 0 x4409
182 0 xa298 0 xa298 0 xa298 0 xa298 0 xa298 0 xa298 0 xa298 0 xa298
183 0 x2983 0 x2983 0 x2983 0 x2983 0 x2983 0 x2983 0 x2983 0 x2983
184
185 0 xc000 0 xc000 0 xc000 0 xc000 0 xc000 0 xc000 0 xc000 0 xc000
186 0 x3f87 0 x3f87 0 x3f87 0 x3f87 0 x3f87 0 x3f87 0 x3f87 0 x3f87
187 0 x95f7 0 x95f7 0 x95f7 0 x95f7 0 x95f7 0 x95f7 0 x95f7 0 x95f7
188 0 x57d8 0 x57d8 0 x57d8 0 x57d8 0 x57d8 0 x57d8 0 x57d8 0 x57d8

Listing A.7: Kernel 4

A.2 Inverse square root


A.2.1 Kernel 1
1 // calculates invsqrt ( x ) of the value in m1 [0]
2 // for x in range 0 x4000 0 x8000 in Q1 .15
3 // returns result in unsigned Q1 .15 to vr2 .0
4 // max error 2 ULP
5 // max relative error 2^ -14.2
6 // 20 cycles
7
8 . main
9
10 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
11 4* nop
12 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
13 12* nop
14 stop
15
16 . m0
17 0 x9000
18
19 . m1
20 0
21
22 // constants in Q2 .14 invsqrt ( x +0.875)
23 . cm
24 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.8: Kernel 4

A.2.2 Kernel 2
A.2 Inverse square root 101

1 // calculates invsqrt ( x ) of the value in m1 [0]. sw


2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr0 .4
5 // max relative error is 2^ -13.91
6 // 30 cycles
7
8 . main
9
10 stofloatw < sout > vr0 .0 d m1 [0]. sw
11 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
12 nop
13 saddw vr1 .0 vr0 .0 m0 [0]. sw
14 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
15 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
16 2* nop
17 scopyw . ne vr0 .3 m0 [1]. sw
18 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20 smulww < scale =15 , uu , sat > vr3 .0 vr2 .0 vr0 .3 // multiply result
with 1 or sqrt ( x )
21 5* nop
22 stop
23
24
25 . m0
26 0 x9000 0 xB505 0 x8000 0 x0001 0 0 0 0
27
28 . m1
29 0
30
31 // constants in Q2 .14 invsqrt ( x +0.875)
32 . cm
33 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.9: Kernel 4

A.2.3 Kernel 2b
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x0001 0 x8000 in Q1 .31 x
3 // returns result in unsigned Q8 .24 to vr3 .0 d
4 // max relative error is 2^ -13.91
5 // 29 cycles
6
7 . main
8
9 stofloatw < sout > vr0 .0 d m1 [0]. sw
10 2* nop
11 saddw vr1 .0 vr0 .0 m0 [0]. sw
12 3* nop
13 slslw vsr1 .4 vr0 .1 m0 [1]. sw
14 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
15 9* nop
16
17 smulwd < scale =31 , uu > vr3 .0 d vr2 .0 m0 [ ar1 +8]. sd
18 8* nop
19 stop
20
102 Kernel source codes

21 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd


22 . m0
23 0 x9000 0 x0001 0 0 0 0 0 0
24 0 x0100 0 x0000 0 x016A 0 x09E6 0 x0200 0 x0000 0 x02D4 0 x13CD
25 0 x0400 0 x0000 0 x05A8 0 x279A 0 x0800 0 x0000 0 x0B50 0 x4F33
26 0 x1000 0 x0000 0 x16A0 0 x9E66 0 x2000 0 x0000 0 x2D41 0 x3CCD
27 0 x4000 0 x0000 0 x5A82 0 x799A 0 x8000 0 x0000 0 xB504 0 xF334
28
29 . m1
30 0
31
32
33
34 // degree 6
35 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
36
37
38 // constants in Q2 .14 invsqrt ( x +0.875)
39 . cm
40 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.10: Kernel 4

A.2.4 Kernel 3a
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -26.58
6 // ( tested with approx 29 thousand values in the entire range
7 // 51 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
28 5* nop
29 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
30 5* nop
31
32
33 // multiply with 1 if exponent is even or sqrt (2) if it is odd
34 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
35 5* nop
36
A.2 Inverse square root 103

37 stop
38
39 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
40 . m0
41 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
42 0 xB504 0 xF334 0 x8000 0 x0000
43
44
45 . m1
46 0
47
48
49 // constants in Q2 .14 invsqrt ( x +0.875)
50 . cm
51 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.11: Kernel 4

A.2.5 Kernel 3b
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31 x
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .6
5 // max relative error is 2^ -30.6 tested with 29000 values
6 // 72 cycles
7
8 . main
9
10 stofloat32d < sout > vr1h m1 [0]. sd
11 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
12 nop
13 saddw vr1 .0 vr1 .4 m0 [0]. sw
14 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
15 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
16 2* nop
17 scopyd . ne vr0 .2 d m0 [8]. sd
18 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20
21 // iteration
22 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
23 5* nop
24 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
25 2* nop
26 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
27 5* nop
28 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
29 5* nop
30
31 // iteration 2
32 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
33 5* nop
34 smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
35 2* nop
36 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
37 5* nop
38 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
39 5* nop
40
41 // multiply with 1 if exponent is even or sqrt (2) if it is odd
104 Kernel source codes

42 smuldd < scale =31 , uu , rnd > vr3 .0 d vr5 .3 d vr0 .2 d


43 5* nop
44
45 stop
46
47 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
48 . m0
49 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
50 0 xB504 0 xF334 0 x8000 0 x0000
51
52
53 . m1
54 0
55
56
57 // degree 6
58 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
59
60
61 // constants in Q2 .14 invsqrt ( x +0.875)
62 . cm
63 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.12: Kernel 4

A.2.6 Kernel 3c
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -29.85
6 // ( tested with approx 30 thousand values in the entire range
7 // 66 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18
19 scopyd . ne vr0 .2 d m0 [8]. sd
20 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
21 12* nop
22
23 // iteration
24 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X 0 x3fff ’ ,
’0 xffff ’ ,
25 5* nop
26 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
27 2* nop
28 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
29 5* nop
30
31 // iteration 2 Goldschidt
32 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // T2 * T2
33
A.2 Inverse square root 105

34 // this next is instruciton is part of previous iteration


35 // but placed here due to data depandancies
36 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R2 = R1 * T2
37 4* nop
38 smuldd < scale =31 , uu , rnd > vr6 .1 d vr6 .0 d vr4 .1 d // TT1 = T2 * T2 * T1
39 2* nop
40 ssubd vr5 .2 d m0 [4]. sd vr6 .1 d // TT2 = 1.5 - TT1
41 5* nop
42 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R3 = R2 * TT2 =
R1 * T2 * TT2
43 5* nop
44
45 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
46 smuldd < scale =31 , uu , rnd > vr3 .1 d vr5 .3 d vr0 .2 d
47 5* nop
48 stop
49
50 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
51 . m0
52 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
53 0 xB504 0 xF334 0 x8000 0 x0000
54
55
56 . m1
57 0
58
59 // constants in Q2 .14 invsqrt ( x +0.875)
60 . cm
61 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.13: Kernel 4

A.2.7 Kernel 4
1 // calculates invsqrt ( x ) of the value in m1 [0]. sd
2 // for x in range 0 x00000001 to 0 x7fffffff in Q1 .31
3 // uses 11 degree polynomial
4 // returns result in soft float format
5 // mantissa in vr4 .3 d and exponent in vr3 .6
6 // max log2 ( relative error ) is -27
7 // 37 cycles
8
9 . main
10
11 stofloat32d < sout > vr3h m1 [0]. sd
12 2* nop
13 saddd vr3 .0 d vr3 .2 d m0 [0]. sd
14 tmacdo < scale =30 , rnd , ss > vr5 .0 d m0 [8]. vd cm [3] // dc term to
accumulator
15 4* nop
16 powersd < scale =31 , start =1 , s , rnd > vr0 vr3 .0 d
17 7* nop
18 svmuldd < scale =31 , rnd , ss > vr1 vr0 .3 d vr0
19 tmacdo < scale =32 , rnd , ss > vr4 .0 d vr0 cm [4]
20 4* nop
21 svmuldd < scale =31 , rnd , ss > vr2 vr1 .3 d vr0
22 tmacdo < scale =32 , rnd , ss > vr4 .1 d vr1 cm [5]
23 3* nop
24 tmacdo < scale =29 , rnd , ss > vr4 .3 d vr2 cm [2]
25 6* nop
26
27 stop
106 Kernel source codes

28
29
30 . m0
31 0 x9000 0 x0000 0 x111A 0 xCEE5 0 x107b 0 x2e11 0 0
32 0 x4000 0 0 0 0 0 0 0
33
34 . m1
35 0 0 0 x107b 0 x2e11 0 x21f 0 x420f
36
37 // constants in invsqrt ( x +0.875)
38 . cm
39 0 xF639 0 xD2EC 0 x0860 0 xB952 0 xF805 0 x6766 0 x07FA 0 x36B6
40 0 xF7CA 0 x23F3 0 x08B6 0 x97D1 0 xF78C 0 xA1AD 0 x081D 0 xADE6
41 0 xDCCA 0 x9833 0 xBC42 0 xF6B5 0 x94E7 0 x8F5B 0 0
42 0 x446B 0 x3B95 0 0 0 0 0 0
43 0 xB1CE 0 x975D 0 x4305 0 xCA90 0 xC02B 0 x3B30 0 x3FD1 0 xB5B0
44 0 xBE51 0 x1F98 0 x45B4 0 xBE8A 0 xBC65 0 x0D66 0 x40ED 0 x6F2D

Listing A.14: Kernel 4

A.3 Square root


A.3.1 Kernel 1
1 // calculates sqrt ( x ) of the value in m1 [0]
2 // for x in range 0 x4000 0 x8000 in Q1 .15
3 // returns result in Q2 .14 to vr2 .0
4
5 . main
6
7
8 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
9 4* nop
10 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
11 12* nop
12 stop
13
14 . m0
15 0 xa000
16
17 . m1
18 0
19
20 // constants in Q1 .15 sqrt ( x +0.75)
21 . cm
22 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0

Listing A.15: Kernel 4

A.3.2 Kernel 2
1 // calculates sqrt ( x ) of the value in m1 [0]
2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in unsigned Q0 .16 to vr3 .0
4
5 . main
A.3 Square root 107

6
7 stofloatw < sout > vr0 .0 d m1 [0]. sw
8 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
9 nop
10 saddw vr1 .0 vr0 .0 m0 [0]. sw // add constant before polynomial
evaluation
11 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
12 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
13 2* nop
14 scopyw . ne vr0 .3 m0 [1]. sw // overwrite the value 1 with the
value sqrt (0.5)
15 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
16 10* nop
17 slsrw vr1 .1 vr0 .3 vr0 .4 // shift result by floor ( exponent / 2)
18 2* nop
19
20 // multiply result with 1 ( even exponent ) or sqrt (0.5) ( odd
expoent )
21 smulww < scale =15 , uu , rnd > vr3 .0 vr1 .1 vr2 .0
22
23 5* nop
24 stop
25
26 . m0
27 0 xa000 0 x5a82 0 x8000 0 x0001
28
29 . m1
30 0
31
32
33
34 // constants in Q1 .15 sqrt ( x +0.75)
35 . cm
36 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0

Listing A.16: Kernel 4

A.3.3 Kernel 3
1 // calculates sqrt ( x ) of the value in m1 [0]
2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in Q2 .14 to vr2 .0
4
5 . main
6
7 stofloatw < sout > vr0 .0 d m1 [0]. sw // convert to soft float
8 2* nop
9 saddw vr1 .0 vr0 .0 m0 [0]. sw // add -0.75
10 3* nop
11 scopyw vsr1 .4 vr0 .1 // copy exponent to address register to use as
offset
12
13 // evaluate polynomial
14 polyw < start =0 , scale =15 , scale2 =15 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
15 9* nop
16 smulww < scale =15 , uu , rnd > vr3 .0 vr2 .0 m1 [ ar1 +8]. sw // scale
output
17 8* nop
18 stop
19
108 Kernel source codes

20
21 . m0
22 0 x8400
23
24 . m1
25 0 0 0 0 0 0 0 0
26 0 x8000 0 x5a82 0 x4000 0 x2d41 0 x2000 0 x16a1 0 x1000 0 x0b50
27 0 x0800 0 x05a8 0 x0400 0 x02d4 0 x0200 0 x016a 0 x0100 0 x00b5
28 // m1 [8] - m1 [23] contains a table of sqrt (0.5) ^ e for various e .
29
30
31 // degree 6
32 0 x6EDA 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8
33
34
35 // constants in Q1 .14 sqrt ( x +0.75)
36 . cm
37 0 x7DFD 0 x4106 0 xEF5B 0 x0A3E 0 x0107 0 x0EA6 0 0
38 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0

Listing A.17: Kernel 4

A.3.4 Kernel 4a
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q0 .32
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is approx 2^ -26.58
6 // ( tested with approx 29 thousand values in the entire range
7 // 57 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
28 5* nop
29 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
30 5* nop
31
32 // iteration 2
33 // smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
34 // 5* nop
35 // smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
36 // 2* nop
37 // ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
A.3 Square root 109

38 // 5* nop
39 // smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
40 // 5* nop
41
42 // multiply with 1 if exponent is even or sqrt (2) if it is odd
43 // smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
44 // 5* nop
45 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
46 5* nop
47 smuldd < scale =30 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
48 5* nop
49 stop
50
51 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
52 . m0
53 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
54 0 x5A82 0 x799A 0 x8000 0 x0000
55
56
57 . m1
58 0
59
60
61 // degree 6
62 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
63
64
65 // constants in Q2 .14 invsqrt ( x +0.875)
66 . cm
67 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.18: Kernel 4

A.3.5 Kernel 4b
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -29.68
6 // ( tested with approx 29 thousand values in the entire range
7 // 72 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R0 * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R0 * R0 * X /2
110 Kernel source codes

26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
28 5* nop
29 // iteration 2
30 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // TT1 = T2 * T2 * T1
31
32 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R1 = R0 * T2
33 4* nop
34
35 smuldd < scale =31 , uu , rnd > vr6 .1 d vr6 .0 d vr4 .1 d // TT1 * T1
36
37
38 2* nop
39 ssubd vr5 .2 d m0 [4]. sd vr6 .1 d // TT2 = 1.5 - TT1
40 5* nop
41 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R2 = R1 * TT2
42 5* nop
43
44 smuldd < scale =30 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
45 5* nop
46 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
47 smuldd < scale =31 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
48 5* nop
49 stop
50
51 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
52 . m0
53 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
54 0 x5A82 0 x799A 0 x8000 0 x0000
55
56
57
58 . m1
59 0
60
61 // degree 6
62 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
63
64
65 // constants in Q2 .14 invsqrt ( x +0.875)
66 . cm
67 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.19: Kernel 4

A.3.6 Kernel 4c
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -30.1
6 // ( tested with approx 29 thousand values in the entire range
7 // 78 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
A.3 Square root 111

15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent


16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration 1
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = R1
28 5* nop
29 5* nop
30 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R2 = R1 * T2
31 5* nop
32
33 // iteration 2
34 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R2 * X
35 5* nop
36 smuldd < scale =32 , uu , rnd > vr5 .1 d vr5 .0 d vr4 .3 d // TT1 = R2 * R2 * X /2
37 2* nop
38 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d // TT2 = 1.5 - TT1
39 5* nop
40 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R3 = R2 * TT2
41 5* nop
42
43 smuldd < scale =30 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
44 5* nop
45 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
46 smuldd < scale =31 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
47 5* nop
48 stop
49
50 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
51 . m0
52 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
53 0 x5A82 0 x799A 0 x8000 0 x0000
54
55
56
57 . m1
58 0
59
60 // degree 6
61 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
62
63
64 // constants in Q2 .14 invsqrt ( x +0.875)
65 . cm
66 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.20: Kernel 4

A.3.7 Kernel 5
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
112 Kernel source codes

4 // exponent in vr1 .7
5 // max error 2 ULP (2.47 ULP )
6 // 51 cycles
7
8 . main
9
10 stofloat32d < sout > vr1h m1 [0]. sd
11 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
12 nop
13 saddw vr1 .0 vr1 .4 m0 [0]. sw
14 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
15 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
16 2* nop
17 scopyd . ne vr0 .2 d m0 [8]. sd
18 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20
21 // iteration
22 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
23 5* nop
24 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
25 2* nop
26 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
27 5* nop
28 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
29 5* nop
30
31 // iteration 2
32 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
33 5* nop
34 smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
35 2* nop
36 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
37 5* nop
38 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
39 5* no
40
41 // multiply with 1 if exponent is even or sqrt (2) if it is odd
42 // smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
43 // 5* nop
44 smuldd < scale =31 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
45 5* nop
46 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
47 smuldd < scale =30 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
48 5* nop
49 slsrd < rnd > vr3 .2 d vr3 .1 d vr1 .7
50 2* nop
51 stop
52
53 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
54 . m0
55 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
56 0 x5A82 0 x799A 0 x8000 0 x0000
57
58
59 . m1
60 0
61
62
63 // degree 6
64 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
65
66
A.4 Logarithms 113

67 // constants in Q2 .14 invsqrt ( x +0.875)


68 . cm
69 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0

Listing A.21: Kernel 4

A.4 Logarithms
A.4.1 Kernel 1

1 // calculates log_2 ( x ) of the value in m1 [0]


2 // for x in range 0 x4000 0 x8000 in Q1 .15
3 // returns result in Q1 .15 to vr2 .0
4 // 17 cycles
5 // max error is 2^ -9.74
6
7 . main
8
9 polyw < start =0 , scale =15 , scale2 =11 , sign = us , rnd1 , rnd2 > vr2 .0 m1
[0]. sw cm [0]
10 14* nop
11 stop
12
13 // polynomial constants in signed Q5 .11
14 . cm
15 0 xE199 0 x5178 0 x8E60 0 x6865 0 xCAAE 0 x0B7D 0 0

Listing A.22: Kernel 4

A.4.2 Kernel 1b

1 // calculates log_2 ( x ) of the value in m1 [0]


2 // for x in range 0 x4000 0 x8000 in Q1 .15
3 // returns result in Q1 .15 to vr2 .0
4 // 17 cycles
5 // max error is 2^ -9.74
6
7 . main
8 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
9 4* nop
10 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
11 12* nop
12 stop
13
14 . m0
15 0 x8000
16
17
18 // polynomial consants in signed Q5 .11
19 . cm
20 0 x0000 0 x5C70 0 xD416 0 x2FEA 0 x20F2 0 x5BE7 0 0

Listing A.23: Kernel 4


114 Kernel source codes

A.4.3 Kernel 2
1 // calculates log_2 ( x ) of the value in m1 [0]
2 // for x in range 0 x0000 0 x8000 in Q1 .15
3 // returns result in Q5 .11 to vr3 .0
4
5 . main
6
7 stofloatw < sout > vr1 .1 d m1 [0]. sw
8 2* nop
9 saddw vr1 .0 vr1 .2 m0 [0]. sw
10 slslw vr1 .4 vr1 .3 m0 [1]. sw
11 3* nop
12 // vr2 .0 is Q5 .11 , vr1 .0 is Q1 .15 , cm [0] is Q2 .14 (15+14 -11 = 18)
13 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
14 12* nop
15 ssubw vr3 .0 vr2 .0 vr1 .4
16 5* nop
17 stop
18
19 . m0
20 0 x8000 0 x000b
21
22 . m1
23 0
24
25 // constants in Q2 .14 log2 ( x +1)
26
27 . cm
28 0 x0000 0 x37A7 0 xE590 0 x1CD9 0 x13D6 0 x3754 0 0
29 0 x0000 0 x5C70 0 xD416 0 x2FEA 0 x20F2 0 x5BE7 0 0

Listing A.24: Kernel 4

A.4.4 Kernel 3
1 // calculates log ( x ) ( natrual log ) of the value in m1 [0]
2 // for x in range 0 x0000 0 x8000 in Q1 .15
3 // returns result in Q5 .11 to vr3 .0
4
5 . main
6
7 stofloatw < sout > vr1 .1 d m1 [0]. sw
8 2* nop
9 saddw vr1 .0 vr1 .2 m0 [0]. sw
10 smulww < scale =4 , ss , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 ( e )
11 3* nop
12 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
13 12* nop
14 ssubw vr3 .0 vr2 .0 vr1 .4
15 5* nop
16 stop
17
18 . m0
19 0 x8000 0 x000b 0 x58B9
20
21 . m1
22 0
A.4 Logarithms 115

23
24
25
26 // constants in Q2 .14 log2 ( x +1)
27
28 . cm
29 0 x0000 0 x4013 0 xE190 0 x2136 0 x16D6 0 x3FB3 0 0
30 0 x8000

Listing A.25: Kernel 4

A.4.5 Kernel 4
1 // calculates log10 ( x ) of the value in m1 [0]
2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in Q4 .12 to vr3 .0
4
5 . main
6
7 stofloatw < sout > vr1 .1 d m1 [0]. sw
8 2* nop
9 saddw vr1 .0 vr1 .2 m0 [0]. sw
10 smulww < scale =5 , su , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 (10)
11 3* nop
12 polyw < start =0 , scale =15 , scale2 =19 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
13 12* nop
14 ssubw vr3 .0 vr2 .0 vr1 .4
15 5* nop
16 stop
17
18 . m0
19 0 x8000 0 x000b 0 x9A21
20
21 . m1
22 0
23
24
25 // constants in signed Q0 .16 log10 ( x +1)
26 . cm
27 0 x0000 0 x6F4F 0 xCB1F 0 x39B2 0 x27AB 0 x6EA9 0 0

Listing A.26: Kernel 4

A.4.6 Kernel 5
1 // calculates log_2 ( x ) of the value in m1 [0]. sd
2 // for x in range 0 x00000001 to 0 x7fffffff in Q1 .31
3 // uses 11 degree polynomial
4 // returns result in soft float format
5 // mantissa in vr4 .3 d and exponent in vr3 .6 in signed Q1 .31
6 //
7 // 37 cycles
8
9
10
11
12 . main
116 Kernel source codes

13
14 stofloat32d < sout > vr3h m1 [0]. sd
15 2* nop
16 saddd vr3 .0 d vr3 .2 d m0 [0]. sd
17 tmacdo < scale =28 , rnd , ss > vr5 .0 d m0 [8]. vd cm [3] // dc term to
accumulator
18
19
20 4* nop
21 powersd < scale =31 , start =1 , s , rnd > vr0 vr3 .0 d
22 7* nop
23 svmuldd < scale =31 , rnd , ss > vr1 vr0 .3 d vr0
24 tmacdo < scale =29 , rnd , ss > vr4 .0 d vr0 cm [0]
25 4* nop
26
27 svmuldd < scale =31 , rnd , ss > vr2 vr1 .3 d vr0
28 tmacdo < scale =29 , rnd , ss > vr4 .1 d vr1 cm [1]
29 3* nop
30 tmacdo < scale =29 , rnd , ss > vr4 .3 d vr2 cm [2]
31 6* nop
32
33 stop
34
35
36 . m0
37 0 x9800 0 x0 0 0 0 0 0 0
38 0 x4000 0
39
40
41 . cm
42 0 x38d1 0 xead0 0 xdd08 0 xaae6 0 x1cb0 0 xb0f0 0 xe584 0 xb35
43 0 x1a12 0 x7262 0 xe552 0 x2a11 0 x1c75 0 x8219 0 xdfd5 0 x82fa
44 0 x19b7 0 x7a30 0 xea6c 0 xbe87 0 x7fff 0 xffff 0 x0 0 x0
45 0 xf66a 0 x8e 0 0 0 0 0 0

Listing A.27: Kernel 4


Appendix B

Simulator instruction set

The following table lists the instruction that were supported by the simulator that
was used for implementations in this thesis. The purpose of including this list here
is to aid the reading of the kernel source codes.
Since both the simulator and the instruction set architecture are still under
development, all the instructions of the architecture have not been implemented,
and some of the instructions in the simulator are outdated or have been changed.
The instruction set architecture is documented in the Sleipnir Instruction Set
Manual [2].
Some of the instructions were added to the simulator for the purpose of this
thesis work and some of the instructions are experimental and will not be part of
the final instruction set. The table might contain errors.

callq Call immediate address.


dbr2bf Double Radix-2 Butterfly
hvcopy Half vector copy
intq interrupt master
jmpq jump to immediate address.
nop no operation
polyw Evaluates a polynomial for value in scr0, by
using constants in src1
powersd Calculates powers of a double scalar and re-
turns 4 powers to a vector word
powersw Calculates powers of a scalar word and returns
8 powers to a vector word
r4bf Radix-4 Butterfly
repeat hardware looping
repeatr hardware looping writh repeat register
ret return from subroutine
sabsd scalar absolute double word
sabsdd scalar absolute difference double word
sabsdw scalar absolute difference word
sabsw scalar absolute word

117
118 Simulator instruction set

saddb scalar byte addition


saddd scalar double word addition
saddw scalar word addition
sandd scalar logic AND double word
sandw scalar logic AND word
savgd scalar average double word
savgw scalar average word
scabs2 scalar complex squared absolute
scadd scalar complex addition
scmac scalar multiply and accumulate complex
scmpd scalar double word comparation
scmpw scalar word comparation
scmul scalar multiplication complex to complex
scopyd scalar copy double word
scopyw scalar copy word
scopywq scalar copy word immediate
scsub scalar complex subtraction
sctzd Count leading zeros of double
sctzw Count leading ze
sinw Scalar input word
slsld scalar logical left shift double
slslw scalar logical left shift word
slsrd scalar logical right shift double
slsrw scalar logical right shift word
smacw scalar multiply and accumulate word
smaxw scalar maximum word
sminw scalar minimum word
smuldd scalar multiplication double word to double
word
smulw scalar multiplication word to word
smulwd scalar multiplication word to double word
smulww scalar multiplication word to word
sord scalar logic OR double word
sorw scalar logic OR word
soutw Scalar output word
ssubb scalar byte subtraction
ssubd scalar double word subtraction
ssubw scalar word subtraction
stofloat32d Converts a scalar double to soft floating point
format. Returns half vector word 0 and 1 are
mantissa (32 bits), word 2 is exponent, word
3 is unchanged.
stofloatd Converts a scalar double to soft floating point
format. Higher word is mantissa, lower is ex-
ponent.
119

stofloatw Converts a 16 bit word in to soft floating point


format. Higher word is mantissa, lower is ex-
ponent.
stop stop execution
sumo Sum of a vector
svmuldd scalar to vector multiplication scalar word to
double word
svsubw vector word subtraction
sxord scalar logic XOR double word
sxorw scalar logic XOR word
tcmac Triangular multiply and accumulate
tcmaca Triangular multiply and accumulate
tcmaco Triangular multiply and accumulate
tmac Triangular multiply and accumulate
tmaca Triangular multiply and accumulate
tmacawd Triangular multiply and accumulate
tmacdo Triangular multiply and accumulate
tmaco Triangular multiply and accumulate
tmacowd Triangular multiply and accumulate
tmacwd Triangular multiply and accumulate
vacrclr clear vector accumulator
vacrow output vector accumulator word
vacrset Set vector accumulator
vaddb vector byte addition
vaddd vector double word addition
vaddw vector word addition
vand Vector logic AND
vcadd vector complex addition
vcmac vector multiply and accumulate complex
vcmpd Vector double word comparation
vcmpw Vector word comparation
vcmul vector multiplication complex to complex
vcopy Vector copy
vcsub vector complex subtraction
vctzd Count leading zeros of double
vctzw Count leading zeros of vector word
vlsld vector logical left shift double
vlslw vector logical left shift word
vlsrd vector logical right shift double
vlsrw vector logical right shift word
vmacw vector multiply and accumulate word
vmantw Convert a vector word to soft floating point
farmat. Return only mantissas in a vector
word
vmuldd vector multiplication double word to double
word
120 Simulator instruction set

vmulw vector multiplication word to word. Calcu-


lates square of one opperand
vmulwd Triangular multiply and accumulate
vmulww vector multiplication word to word
vor Vector logic OR
vscmac vector scalar multiply and accumulate com-
plex
vscmaco vector scalar multiply and accumulate com-
plex and output
vsmacw vector multiply with scalar and accumulate
word
vsubw vector word subtraction
vxor Vector logic XOR

You might also like