Fix Point Implementation of Elementry Functions
Fix Point Implementation of Elementry Functions
Examensarbete
Orri Tómasson
LiTH-ISY-EX--10/4399--SE
Linköping 2010
LiTH-ISY-EX--10/4399--SE
Titel
Title Implementation of Elementary Functions for a Fixed Point SIMD DSP Coprocessor
Sammanfattning
Abstract
1 √1 √
This thesis is about implementing the functions x, x
, x and log(x) on a DSP
platform.
A multi-core DSP platform that consists of one master processor core and
several SIMD coprocessor cores is currently being designed by a team at the
Computer Engineering Department of Linköping University.
Nyckelord
Keywords SIMD, DSP, mathematical functions, elementary functions, polynomial approxi-
mation, fixed-point arithmetic
Abstract
1 √1 √
This thesis is about implementing the functions x, x
, x and log(x) on a DSP
platform.
A multi-core DSP platform that consists of one master processor core and several
SIMD coprocessor cores is currently being designed by a team at the Computer
Engineering Department of Linköping University.
The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to sup-
port vector multiplication instructions. By efficiently using the 16 multipliers, it
is possible to evaluate polynomials very fast. The ALU does not have (hardware)
support for floating point arithmetic, so the challenge is to get good precision by
using fixed point arithmetic.
Precise and fast solutions to implement the mathematical functions are found
by converting the fixed point input to a soft floating point format before poly-
nomial approximation, choosing a polynomial based on an error analysis of the
polynomial approximation, and using Newton-Raphson or Goldschmidt iterations
to improve the precision of the polynomial approximations.
Finally, suggestions are made of changes and additions to the instruction set ar-
chitecture, in order to make the implementations faster, by efficiently using the
currently existing hardware.
v
Contents
1 Introduction 1
1.1 Fixed point representation . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The scope of this work . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Polynomial Approximations 9
3.1 Two types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 The fixed point polynomial error . . . . . . . . . . . . . . . 10
3.2 Choosing a polynomial . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Taylor polynomials . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Interpolation polynomials . . . . . . . . . . . . . . . . . . . 12
3.2.3 Min-max polynomials . . . . . . . . . . . . . . . . . . . . . 12
3.2.4 Other polynomials . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.5 Using a polynomial for f (x + a) to evaluate f (x) . . . . . . 13
3.3 Using Soft floating point format . . . . . . . . . . . . . . . . . . . . 16
4 Other methods 17
4.1 Newton-Raphson and Goldschmidt . . . . . . . . . . . . . . . . . . 17
4.1.1 Algorithm for reciprocal . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Opposite sign of product trick . . . . . . . . . . . . . . . . . 17
4.1.3 Algorithm for inverse square root . . . . . . . . . . . . . . . 19
4.1.4 Goldschmidt iteration . . . . . . . . . . . . . . . . . . . . . 19
4.1.5 Our usage of Goldschmidt iterations . . . . . . . . . . . . . 20
4.1.6 Error after iteration . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Lookup tables and interpolation . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Piecewise polynomial interpolation . . . . . . . . . . . . . . . . . . 22
4.3.1 Usage on ePUMA . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
viii Contents
5 Implementations 25
5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Estimation of Cycle cost . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Cycle cost for multiple inputs . . . . . . . . . . . . . . . . . 26
5.4 Kernel code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Invalid input handling . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5.1 Zero input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5.2 Memory and register usage . . . . . . . . . . . . . . . . . . 28
5.6 Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6.1 Choosing a Polynomial . . . . . . . . . . . . . . . . . . . . . 29
5.6.2 Pre-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.7 Inverse square root . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.7.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.8 Square root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.1 Choosing a polynomial . . . . . . . . . . . . . . . . . . . . . 45
5.8.2 Soft floating point usage . . . . . . . . . . . . . . . . . . . . 45
5.8.3 32 bit version . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.4 Zero input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.8.5 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.9 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.9.1 Soft floating point usage . . . . . . . . . . . . . . . . . . . . 53
5.9.2 Inputs in other fixed point formats . . . . . . . . . . . . . . 53
5.9.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Instruction proposals 59
6.1 POWERSW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.2 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.3 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 POWERSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 POLYW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 TMACDO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 SSUMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.1 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Soft floating point instructions . . . . . . . . . . . . . . . . . . . . 71
6.6.1 STOFLOATW . . . . . . . . . . . . . . . . . . . . . . . . . 71
Contents ix
6.6.2 STOFLOATD . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.6.3 TOFLOATADD . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6.4 Converting vectors to soft floating point . . . . . . . . . . . 73
6.6.5 Conversion from soft floating point format . . . . . . . . . . 74
6.7 Powers with integer exponent . . . . . . . . . . . . . . . . . . . . . 75
6.8 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.8.1 Opposite signed products after multiplication instructions . 76
6.8.2 Special multiplication for inverse square root iterations . . . 77
6.8.3 Scale flag to shift instructions . . . . . . . . . . . . . . . . . 77
6.8.4 Scale flag to add, sub, and other trivial arithmetic instructions 77
6.8.5 Long datapath version of short datapath instructions . . . . 78
7 Results 79
7.1 Function kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Proposed features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8 Conclusions 89
Bibliography 91
xi
Chapter 1
Introduction
The goal of this thesis work is to implement some elementary mathematical func-
tions as fast as possible and as precise as possible on a new DSP platform called
ePUMA, which is being designed and researched at the Computer Engineering
Department of Linköping University.
The main theme in processor design for the last few years has been to increase
the number of processing cores with increased parallelism rather than increasing
the clock frequency. The ePUMA’s approach of this is to have one master control
processor and eight SIMD coprocessors. Its architecture is described in more detail
in [11], [1] and [2].
One of the trade-offs when designing a multi-core system is the number of cores
versus size of each core. A larger core has more hardware and hence hardware
support for more features, while more cores can perform more tasks in parallel.
In order to make each coprocessor smaller, the SIMD coprocessors cores do neither
have support for for division nor floating point arithmetic, which is not uncommon
for DSP architectures.
The SIMD cores do however have hardware to support vector multiplication and
to support that each SIMD core has sixteen 16 bit multipliers.
Because of this large amount of multipliers, we can use them to evaluate polyno-
mials. By using the multipliers efficiently, we can evaluate up to one polynomial
of degree eight (or smaller) per cycle.
1
2 Introduction
1.1.1 Notation
In this thesis we use the notation Qi.f to indicate a fixed point format that has
i integer bits and f fractional bits. This is equivalent to what would be called
Q2 (i, f ) in [6] (where the subscripted 2 indicates that a base 2 number system is
being used).
Both signed (two’s complement) and unsigned fixed point formats are used.
Signed is our default format, meaning that we always state that we use unsigned
when we use unsigned but if we do not state whether a fixed point format is signed
or unsigned, it is signed, and in some cases it is irrelevant.
A signed Qi.f fixed point format has i bits before the radix point. When two’s
complement is used, the left most bit is a sign bit and it is counted with the i
integer bits. The integer part of unsigned Qi.f takes values in the range [0, 2i − 1]
and the integer part of signed Qi.f takes values in the range [−2i−1 , 2i−1 − 1].
Another way to view this is that the leftmost bit of signed Qi.f has the weight
−2−i−1 , while the leftmost bit of unsigned Qi.f has the weight 2i−1 .
Chapter 5 lists the function kernels that were implemented and explains them
in some detail.
Chapter 7 lists the results. The results consist of a summary of the kernels that
were implemented and a list of instruction and features we suggest should be
added, or considered to be added, to the architecture.
The SIMD cores have both program memory (PM) and data memory (DM) 1
which is a vector memory. The SIMD cores can exchange data through a central
DMA controller and an interconnection network depicted in figure 2.1. [11]
N1 N2 N3 N4
Master processor
&
DMA
N5 N6 N7 N8
Processing
Core
SIMD
Core 8
Figure 2.1: The ePUMA master-multi-SIMD architecture, the figure is inspired by the figure
referred to as Figure 1, in the quoted text from [11]
5
6 A Brief overview of the ePUMA architecture
The SIMD core has eight general purpose 128 bit vector registers (called vr0-
vr7 ). The 128 bit vector can be a vector word which is eight 16 bit words, a
vector double word with four 32 bit double words or a complex vector which has
four complex numbers, each with a 16 bit real part and a 16 bit imaginary part.
It has also two scratchpad memories.
Each SIMD core also has two local vector memories (LVM). Both of them can
be accessed simultaneously, which means that we can make two memory accesses
(read or write) per cycle, one to each memory.
Further description of the ePUMA architecture is found in [11], [5] and [1].
2.1.1 Datapath
Ve cto r
Acc umu lat or
Figure 2.2: The ePUMA datapath (the figure taken with permission from [1])
documented in [2].
Example 2.1
This example shows an example of assembly code for the ePUMA SIMD core,
followed by explinations of some parts of it. The code is Kernel 3 from section
5.6.3
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8
9 // Newton Raphson iteration
10 smulww < scale =15 , uu , negres , rnd > vr0 .0 vr1 .2 vr2 .0
11 5* nop
12 smulww < scale =15 , uu , sat , rnd > vr0 .1 vr0 .0 vr2 .0
13 5* nop
14 stop
15
16 . m0
17 0 x9800
18 . m1
19 0
20 . cm
21 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0
• The .main keyword marks the beginning of the main function of the kernel.
• The .m0, m1 and .cm keyword indicates data that should be written to the
local vector memories (.m0 and .m1) and the constant memory (.cm) before
execution of kernel.
• X*nop issues X number of NOPs. (they are usually used because of data
dependency).
• Double slash (//) defines a comment.
• stofloatw, saddw, polyw and smulww are instruction mnemonics.
• Flags are given to instructions in angled brackets (<>) between the instruc-
tion mnemonic and its operands.
• vrX is used to refer to entire vector register number X, vrX.Y refers to scalar
word number Y in vector register number X, vrX.Yd refers to scalar double
word number Y.
• m0[X] refers to memory word number X. .sw, .sd, .vw and .vd suffix is
used to refer to a scalar word, scalar double, vector word or vector double
word.
• The prefix 0x before a constant indicates hexadecimal number format.
Chapter 3
Polynomial Approximations
Polynomial approximations are used as a part of all the methods discussed in this
thesis to approximate elementary mathematical functions. Polynomial approxi-
mations can be described with the formula:
n
X
f (ξ) ≈ p(x) ≡ pi · xi (3.1)
i=0
If p(x − a) is calculated using a fixed point system, our result will differ from
the value of p(x − a), because all arithmetic operations are performed with a finite
amount of bits.
This difference between our result and p(x − a) will be referred to as the fixed
point polynomial error, defined as
ef (x) = p(x − a) − r(x) (3.3)
where r(x) is the result of calculating the polynomial value using a fixed point
system. We can then see that er (x) + ef (x) = f (x) − r(x)
9
10 Polynomial Approximations
In addition each term has a rounding error, which is between − LSB 2 and LSB
2 .
i
If we look at equation 3.11 we see that the term x · epi depends on the size
of xi and epi which is the rounding error of the polynomial constant when it is
converted to the fixed point format. The term pi · exi depends on the size of the
polynomial constant and the error of the calculation of xi (which is discussed in the
next section). This term tells us that the error of our polynomial approximation
depends heavily on the magnitude of the polynomial constants. The term epi · exi
is in most cases insignificant.
Example 3.1
We store the powers of x in 16 bit data-words in the Q1.15 fixed point format.
That means that the weight of the LSB is 2−15 so LSB 2
x
= 2−16 . If we want to
evaluate a polynomial with large polynomial constants, for example pi = 1000
then the error caused by the pi · exi term is 1000 · 2−16 = 0.015 ≈ 2−6 .
3.1 Two types of errors 11
ex2 = ernd
and
LSB LSB
− ≤ ex2 ≤
2 2
Then we can calculate the error of x̂3 :
x̂3 = x3 + ex3
= x · (x2 + ex2 ) = x3 + x · ex2 + ernd
⇒ ex3 = x · ex2 + ernd
and then the error of x̂4 :
x̂4 = (x2 + ex2 ) · (x2 + ex2 )
= x4 + 2 · x2 · ex2 + e2x2 + ernd
⇒ ex4 = 2 · x2 · ex2 + e2x2 + ernd
and more general for the subsequent powers:
x̂c = (xa + exa ) · (xb + exb )
= xa+b + xa · exb + xb · exa + exa · exb
⇒ exc = xa · exb + xb · exa + exa · exb
where a + b = c.
The general result is that if x ≤ 1, the worst case error is half LSB for power
2 and increases by half LSB for each power of x. Simulations have been done
that confirm that the error of powers of x are smaller than this worst case error
estimate.
We can see that when we calculate some power of x, the error of the result
is a function of the rounding errors of the calculation of lower powers. We can
therefore see that if we assume that the rounding error for each multiplication is
uniformly random in the range [−0.5, 0.5] (it is though not uniformly random), we
can see that the probabilty of that we get the worst case error when we calculate
some power of x decreases as the exponent increases. It is because the rounding
errors of all the previous results for the calculations of lower powers, must all be
worst case rounding errors with the same sign, to produce a result with worst case
error. Table 3.1 shows the maximum and minimum error we get when we calculate
powers for all values of x ∈ [0.5, 1] in Q1.15.
12 Polynomial Approximations
Table 3.1: Error of powers of x. The second column shows the the maximum error
of each power, when we have calculated the powers for all possible Q1.15 values
in the simulator. The third column shows an estimate of the worst case error we
could expect to see after the simulation.
for min-max polynomials are found by using the Remez algorithm. The Remez
algorithm is described in [3]. The Remez algorithm is more complicated than the
methods used to calculate the constants of Taylor polynomials and interpolation
polynomials. Further reading on min-max polynomials is found in [8].
Example 3.2
Table 3.2 lists polynomial constants for three polynomials which can be used
to approximate √1x such that p(x − a) ≈ √1x , x ∈ [0.5, 1]. All the three polynomials
14 Polynomial Approximations
Taylor
0 Min−Max
Least−square
−5
−10
log2(|e|)
−15
−20
−25
−30
0 0.5 1 1.5
x
0
e
−1
−2
−3
−4
−5
0 0.5 1 1.5
x
Figure 3.1: A comparison of errors of a fourth degree Taylor polynomial, fourth degree least
square polynomial and a fourth degree Min-Max polynomial, that are all optimised to be used
to calculate f (x) ≡ x1 for x ∈ [0.5, 1], such that p(x − 0.75) ≈ f (x), and the error is e =
f (x) − p(x − 0.75).
3.2 Choosing a polynomial 15
a Polynomial constants
0 3.230. . . -7.600. . . 12.71. . . -12.51. . . 6.629. . . -1.460. . .
0.5 1.414. . . -1.411. . . 2.060. . . -2.910. . . 2.978. . . -1.460. . .
0.75 1.154. . . -0.7699. . . 0.7660. . . -0.8450. . . 1.152. . . -1.460. . .
Table 3.4: log2 (|e|) of polynomial approximations of √1x for polynomials of degrees
2, 4, 7 and 12, that are optimized with minimum error in the range x ∈ [2l , 1] where
l is in the left most column.
Using a soft floating point format has more advantages. The output range of
the polynomial is in many cases smaller, which means that we know which fixed
point format we need for the result (for instance p(x) ≈ x1 for x ∈ [0.5, 1] results
in p(x) ∈ [1, 2], but if x ∈ [2−15 , 1] we would have p(x) ∈ [1, 215 ]).
Another advantage is that if we know that x < 1 we know that xi · pi < pi so we
know what fixed point format we need to store that result with good precision,
and we also know that the term xi · epi < epi from equation 3.11.
Chapter 4
Other methods
In this chapter we discuss other methods that can be used to implement functions.
In section 4.1, Newton-Raphson and Goldschmidt iterations are discussed, which
are methods we use in our implementations to increase precision after polynomial
approximations. In remaining sections of the chapter, alternative methods that
were not used in any of the implementations are discussed. We speculate how
applicable the other methods are to the ePUMA architecture.
17
18 Other methods
Example 4.1
To understand how this trick works let us look at following example C code:
1 # include < stdio .h >
2
3 int main ()
4 {
5 unsigned char a , b ;
6 short i ;
7
8 for ( i = 1; i < 256; i ++) {
9 a = ( unsigned char ) i ;
10 b = ( unsigned char ) ( -( char ) a ) ;
11 printf ( " a : %d , b : % d \ n " , a , b ) ;
12 }
13 }
What this code does is change the sign of the unsigned variable a as if it were a
signed variable, and then store the result in b which is also an unsigned variable..
The output is:
1 a : 1 , b : 255
2 a : 2 , b : 254
3 a : 3 , b : 253
4 ...
5 a : 127 , b : 129
6 a : 128 , b : 128
7 a : 129 , b : 127
8 ...
9 a : 253 , b : 3
10 a : 254 , b : 2
11 a : 255 , b : 1
We can see that b = 256 − a. 256 is two times the weight of the MSB of the
unsigned char datatype. When we use the Q1.15 or Q1.31 the weight of the MSB
is 1, meaning that two times the weight of the MSB is 2.
• Convert x to signed Q3.15. The two new bits are both zero valued and the
leftmost is the sign bit.
• Find −x by inverting all the bits and add one in the LSB position. The sign
bit will be set (unless if x = 0, then it overflows), as well as our other new
bit.
• Calculate 2 − x by adding 2 and −x, the value 2 in signed Q3.15 has the two
left most bits valued 01 and the rest is zeros.
4.1 Newton-Raphson and Goldschmidt 19
• Since the two left-most bits of −x are valued 11, and the two left-most bits
of 2 are 01 the result of the addition of 2 and −x will have the two left most
bits valued 00 (meaning it is a positive value smaller than 2, which is the
expected result when one calculates 2 − x where x ∈ [0, 2i
• We can convert this value back to unsigned Q1.15 by simply ignoring the
two left-most bits that are zero valued.
• Since we know that the two left-most bits will be zero if we calculate 2 − x
using signed Q3.15, we can just as well take x in unsigned Q1.15 and invert
all the bits and add one to the LSB position to get the same result.
This trick is the motivation behind the proposal in section 6.8.1; to extend the
instruction set of our architecture, such that one can choose to get the negative of
the result of a multiplication instruction (either by giving a flag to the instruction
or with a new instruction). Then we only need two instructions for each Newton-
Raphson iteration. Where the first instruction calculates t := −ri · x and the
second instruction calculates ri+1 := ri · t.
t1 := ri · x
r i · t1
t2 :=
2
t3 := 1.5 − t2
ri+1 := ri · t3
As a result Goldschmidt iterations gives less pipeline penalties due to data de-
pendencies, compared to Newton-Raphson iterations. But both require the same
amount of arithmetic operations, which means that they are equally fast when we
have enough inputs to fill the pipeline and eliminate pipeline penalties.
t1 := r0 · r0 · x/2
t2 := 1.5 − t1
r1 := r0 · t2
t3 := t2 · t2 · t1
t4 := 1.5 − t3
r2 := r1 · t4
After each iteration the relative error of our new approximation is the square of
the relative error of our previous approximation. This means that we double the
amount of significant bits in each iteration. This goes for both Newton-Raphson
and Goldschmidt iterations.
This can be seen if we rewrite the algorithm so that we begin by calculating
the relative error e1 . The next iteration will have relative error e2
e1 = r1 · x − 1
t1 = 1 − e1 = 2 − r1 · x
r 2 = t1 · r 1
e2 = r2 · x − 1
= (1 − e1 )r1 · x − 1
= r1 · x − e1 · r1 · x − 1
= e1 − e1 · r1 · x
= e1 (1 − r1 · x)
= e1 · e1
Hence the relative error after an iteration is the square of the previous relative
error.
4.2.1 Interpolation
The fastest approach is to get one value from the lookup table, and use that value
directly. We can improve the precision by getting two consecutive values from the
lookup table and applying first degree linear interpolation. A common formula for
linear interpolation is:
f (x1 ) − f (x0 )
f (x) ≈ f (x0 ) + (x − x0 ), where x0 ≤ x < x1
x1 − x0
If we use a few of the input value’s MSBs as an index to the look up table,
then x1 − x0 is a constant (it will be some 2i where i ∈ Z), we can find x − x0
22 Other methods
by masking out the MSBs that were used as index. Then it is enough to multiply
that value with f (x1 )−f (x0 ) and shift the product and no division will be needed.
Hence, linear interpolation can be done in four Arithmetic instructions (bit-wise
AND, multiplication with scaling, and addition).
Another possibility is, instead of fetching f (x0 ) and f (x1 ) from the lookup table,
to fetch f (x0 ) and f (x1 ) − f (x0 ), and then do not have to calculate the difference,
but we will how ever consume more memory.
evaluation per cycle since due to the fact that we do not have the same polynomial
constants for every input value 2 , but we can use one POWERSW instruction and
one TMAC which would take two cycles per value for up to evaluate a 7th degree
or lower polynomial.
4.4 CORDIC
The CORDIC algorithm can be used to calculate trigonometric functions, hyper-
bolic functions, exponential functions, logarithms and square roots.
The algorithm uses only a small lookup table, and adders and shifters. It is
however an iterative algorithm and each iteration increases the precision approx-
imately one bit [10], which means that it would require 16 iterations to calculate
a function with approximately 16 correct bits. The CORDIC algorithm is often
used in implementations with special hardware since it only needs simple hardware
(adders, shifters and small lookup table).
perhaps it will only be possible to calculate a polynomial of one value per cycle when the we
always use the same polynomial constants
24 Other methods
products of parabolas, as the method that we use (which is to calculate the pow-
ers of the function argument and then multiplying the powers to the polynomial
constants and then find the sum of all the products). It would also be challenging
to use this approach with minimum fixed point polynomial error.
The method has the advantage that the highest power of the input which is needed
is the second power, which is an advantage because it can be difficult to deal with
the big dynamic range of powers.
Chapter 5
Implementations
This chapter describes the implementations of the functions that were done for
this work. Most of the implementations use special instructions which were added
to the simulator and then proposed to be added to the instruction set (see chapter
6). If some of the instructions will not be implemented or implemented differently,
the same methods can nevertheless be used. The cycle count would be different
but the precision would be the same as long as the same method is used and the
arithmetic operations are done with the same semantics.
5.1 Method
The functions are implemented in assembly code that is intended to be run on one
SIMD coprocessor. The assembly code was tested in a pipeline accurate simulator
so that the cycle cost and precision of the result could be confirmed. For imple-
mentations with 16 bit inputs, the simulation was run for every possible input
value which the kernel is supposed to support.
For implementations with 32 bit inputs it would be too timeconsuming to test all
the 232 input values. Instead several thousand input values from the entire input
range were tested. The test inputs were for some kernels selected with an even
interval but for other kernels all upper words (the 16 MSBs) were tested, each
with a (pseudo) random lower word (16 LSBs). We also tried both approaches on
some kernels and they gave the same results.
Some new instructions were added to the instruction set of the simulator for
these implementations. Those instruction are discussed and proposed in chap-
ter 6. Some minor variances are between the instruction’s implementations that
were used, and the instructions as they are proposed and discussed in chapter 6
because the instructions were reviewed after having been implemented and used
in the simulator.
When we test the kernels, a script is used to run the simulator multiple times for
different input values each time. The input value (the argument to the mathemat-
ical function) is written to memory location m1[0] before the kernel is run.
25
26 Implementations
5.2 Errors
As was said in the previous section, each kernel is tested in a simulator and we
compare the result with a reference value. The reference value used is gotten with
either Matlab or Python (math module) and is in most cases a value in IEEE754
64 bit floating point format (52 bit mantissa (+1), 11 bit exponent and one sign
bit). We give either the worst absolute error or the absolute relative error. The
absolute error is simply the absolute value of the difference between our result
and the reference value. The absolute relative error is the absolute value of the
difference between our result and the reference value, divided by the reference
value. The error is usually given as a power of 2. The reason is because that
makes it easy to see how many correct bits we have. If the absolute relative error
is 2−e then we have e correct significant bits in our result. If the absolute error is
2−e then the error appears in the bit with the weigth 2−e .
Example 5.1
Some result in Q1.15 format has the worst case absolute error: 2−12 . Then the
bit in the position with the weight 2−12 (the forth least significant) and lesser
significant bits contain error but the other bits are correct.
• Polynomial evaluation takes 1 cycle per value. At this point it is not certain
whether the architecture will have support for that. In the case where it will
not be supported, a polynomial evaluation will take two cycles per value.
• Conversion from fixed point to soft floating point can be done with vector
inputs with 4 values per cycle, which results in 0.25 cycles per value. If
5.4 Kernel code 27
conversion to soft floating point will only be possible for one scalar at a
time, it will take 1 cycle per value.
input handling is needed, no time is spent on it, and the function evaluation can
be done as fast as possible. Most of the kernels use conversion to soft floating
point format. If the format=15u flag is used (see section 6.6.1), the kernels will
return f (|x|) for the function they implement being implemented.
5.6 Reciprocal
One of the challenges of implementing reciprocal is the difference between its
domain and its range. For inputs smaller than 1, the result is larger than 1, and
vice versa. Several kernels were implemented and more than one solution was
used to solve the problem with the domain and range problem. The usage of soft
floating point is essential in all the solutions.
5.6.2 Pre-analysis
To select a polynomial to use we run the Remez algorithm with various parameters.
We try various degrees of polynomials and also various values for the constant a
(the constant which is subtracted from the input value before calculating the
polynomial). We are interested in the real polynomial error and the fixed point
polynomial error (discussed in chapter 3). The Remez algorithm (which we use to
calculate polynomial constant) gives the real polynomial error as a by-product. We
can get an estimate of the fixed point error by assuming the error of the calculation
of the powers in the polynomial evaluations is one ULP and then find the largest
value of pi · exi in equation 3.11 in section 3.1.1, where pi is the largest polynomial
constant, and exi is the weight of the ULP.
The results are shown in figure 5.1. The error of an implementation is the sum
of the real polynomial error and the fixed point polynomial error. The difference
between the signed and unsigned fixed point polynomial error is whether the value
x − a which we calculate powers of, is a signed or unsigned value. When it is
unsigned, we have to make sure it is always a positive value. Since x (the mantissa
of the input value) is between 0.5 and 1, it means that the value of a must be 0.5
or smaller.
for some other a, but the degree 7 Maclaurin polynomial would still always give a
worse maximum error than a 5th degree min-max polynomial.
30 Implementations
−6
Estimate of fixedpoint error signed
Estimate of fixedpoint error unsigned
Real Error
−8
−10
−12
log2(e)
−14
−16
−18
−20
−22
2 3 4 5 6 7 8
Degree
Figure 5.1: Estimates of the real polynomial error and the fixed polynomial error
Simulation
We see that the result of the error analysis is that using unsigned Q0.16 with
a = 0.5 or using signed Q1.15 with a = 0.8125 should give similarly precise re-
sults. An assembly code was written that uses these two polynomials to implement
calculation of x1 . Then simulations were run where both these polynomials were
used to calculate x1 for every possible input in the range [0.5, 1]. Both simulations
returned the result in unsigned Q1.15. The results of the simulations were that
the polynomial with a = 0.8125 has a worst case error of 11.1 ULPs and the poly-
nomial with a = 0.5 has a worst case error of 11.5 ULPs. The ULPs weight is 2−15
and 11ULPs ≈ 2−11.5 , so we see that the result of the simulation matches well to
the error estimation from the pre-analysis.
The polynomial with a = 0.8125 is used in all implementations.
5.6.3 Kernels
Several kernels that calculate x1 have been implemented. All of them, except one,
1
use 5 degree min-max polynomial for x+0.8125 . The polynomial can give approx-
imately 12 correct bits (worst case relative error is 2−11.85 ). Newton-Raphson
iterations can be used to increase the precision. One Newton-Raphson iteration
doubles the amount of correct bits, since the relative error after an iteration is the
square of the error prior to the iteration.
5.6 Reciprocal 31
Kernel 1
Input format Q1.15 in the range [0.5, 1i
Output format unsigned Q1.15
Error max error 11.1 ULPs or 2−11.5
Cycles (one input value) 20
multiple inputs 1.125 cycles per value
The first kernel uses one polynomial evaluation. It uses a 5th degree polynomial
1
for x+0.8125 . Before the polynomial is evaluated, 0.8125 is subtracted from the
input value, then the polynomial is evaluated using a POLYW instruction. Note
that the input range is only [0.5,1].
1 . main
2 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
3 4* nop
4 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
5 12* nop
6 stop
7
8 . m0
9 0 x9800 // -0.8125
10
11 // constants in Q4 .12 for 1/( x +0.8125)
12 . cm
13 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0
Kernel 2
Input Q1.15, all positive values
Output soft floating point, mantissa is unsigned Q1.15
Error max relative error is 2−11.85
Cycles, one input value 23
Cycles, multiple input values 1.375 cycles per value
This kernel can calculate x1 for all positive values in Q1.15. It is the same
implementation as Kernel 1, except that we first convert the input to soft floating
point format. The exponent is left unchanged, and a programmer can use it as he
wishes to scale the result in any way he wants. A programmer can also choose to
use the mantissa of the result in a multiplication before he chooses to make the
conversion from floating point.
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8 stop
Kernel 3
Input format Q1.15, all positive values
Output format soft floating point, mantissa is unsigned Q1.15
Error max relative error is 2−15.0
Cycles, one input value 35
Cycles, multiple input values 1.625 cycles per value
Kernel 4
Input format Q1.15, all positive values
Output format Q16.16
Error max relative error is 2−16
Cycles, one input value 38
2.375 per value, (1.875 if 16 bit multi-
Cycles, multiple input values
plications are use in iterations).
Same as Kernel 3 except that after the polynomial evaluation we shift the
mantissa with the value 15 − exponent. The constant 15 is stored in memory
(location m0[1].sw) and after the conversion to soft floating point the exponent
is subtracted from it.
In this kernel 32 bit multiplications (smuldd instruction) are used in the Newton-
Raphson iteration. By using 32 bit multiplications in the Newton-Raphson itera-
tion, we get up to 24 correct significant bits. But we do have an error of up to 1
ULP, and for this input domain and this output range, the smallest outputs have
16 significant bits. Therefor an error of 1 ULP for one of the smallest outputs
in this range results in a relative error of 2−16 . Larger values have either 1 ULP
error, or 24 correct significant bits. For multiple inputs, this kernel can be modi-
fied to give max 16 correct significant bits, by using 16 bit multiplications in the
Newton-Raphson iterations. It reduces the cycle cost for multiple inputs by 0.5
cycles per value.
1 . main
2
3 stofloatw < sout > vr1 .1 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr1 .2 m0 [0]. sw
6 ssubw vr1 .6 m0 [1]. sw vr1 .3
7 2* nop
8 scopyw vr1 .4 vr1 .2
9 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
10 12* nop
11
12 // Newton - Rapson iteration
13 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
14 5* nop
15 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
16 5* nop
17
18 // scale result
19 slsrd vr0 .3 d vr0 .1 d vr1 .6
20 2* nop
21 stop
22
23
24 . m0
25 0 x9800 15
Kernel 5
Input format Q1.31, all positive values
Output format soft floating point, mantissa is unsigned Q1.31
Error max relative error is 2−24
Cycles, one input value 35
Cycles, multiple input values 2.125 cycles per value
32 bit version, with 32 bit input and soft floating point output. After converting
the input value to soft floating point we use its 16 most significant bits as an input
to the same 5th degree polynomial. Then we use all the 32 bits of the mantissa of
the input value in the Newton-Raphson iteration with 32 bit multiplications.
1 . main
2 stofloat32d < sout > vr1h m1 [0]. sd
3 2* nop
4 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract
5 4* nop
6 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
7 12* nop
8
9 // Newton - Rapson iteration
10 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
11 5* nop
12 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
13 5* nop
14 stop
Kernel 5b
Input format Q1.31, all positive values
Output format soft floating point, mantissa is unsigned Q1.31
Error max relative error is 2−31
Cycles, one input value 47
Cycles, multiple input values 2.625 cycles per value
Same as kernel 5 but with two Newton-Raphson iterations, for more precision.
1 // ... following code is added before ‘‘ stop ’ ’ in kernel 5:
2 // second iteration
3 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr0 .1 d vr1 .2 d
4 5* nop
5 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr0 .1 d
6 5* nop
7 stop
Kernel 6
Input format unsigned Q0.16 all values
Output format soft floating point, mantissa is unsigned Q1.15
Error max relative error is 2−15
Cycles 64 values in 120 cycles
A Second degree polynomial for x1 is used and then two Newton-Raphson iter-
ations. Three vector multiplications and two vector additions are used to evaluate
the second degree polynomial (we do not use the poly instruction). The second
degree polynomial gives approximately 6 correct bits.
The 120 cycles include overheads. If we use the same approach to estimate
the number of cycles per value using this method, the result is 1.375 cycles per
value ( 28 for conversion to soft floating point 85 for polynomial evaluation and 82
for each Newton-Raphson iteration). 8 cycles are spent on copying intermediate
result from vector register to memory to use it later. It became rather tricky to
decide where to store intermediate results (VRF, LVM1 or LVM2), in order to be
able to access them again when needed, in an efficient way. If the vector register
file would be larger, as is suggested in [5] this copying would not be necessary.
These 120 cycles also include 8 NOPs after the last instruction to wait for the last
instruction to finish. If we have more than 64 input values, we can replace these
8 NOPs with copy instructions that copy the results from the vector registers to
memory. And after they have finished we can immediatly start evaluating a new
batch with 64 values. This means that 120 cycles per 64 values is a very realistic
estimate of how fast we can find reciprical of multiple 16-bit input values.
The source code is in Appendix A section A.1.7
5.7 Inverse square root 37
We use two different methods to deal with this. We can either precalculate
2e/2 and use e as an offset to addressing a multiplicand. √The other method is
to multiply the result of the polynomial calculation with 2 for odd e, or 1 for
even e, then right shift the exponent (equivalent to floor division by two) and then
return the result as a soft floating point number.
5.7.2 Kernels
We implemented several kernels that calculate inverse square root. We tried var-
ious ways of returning the result, and we tried to use both Newton-Raphson it-
erations and Goldschmidt iterations after 16 bit polynomial evaluation, and we
also tried to calculate a high degree polynomial with 32 bit precision to get 32 bit
results with acceptable precision.
We also try both the methods discussed in section 5.7.1.
38 Implementations
Kernel 1
Input format Q1.15, in range [0.5, 1i
Output format Unsigned Q1.15
Error Max error is 2−13.68
Cycles, one input value 20 cycles
Cycles, multiple input values 1.125 cycles per value
This kernel uses only subtraction and polynomial evaluation. It can only calculate
inverse square root for values in the range [0.5, 1i.
The polynomial used has its coefficients stored in Q2.14, which means that we
can expect a fixed point polynomial error of approximately 2−14 which is close to
maximum error in our simulations.
1 . main
2
3 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
4 4* nop
5 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
6 12* nop
7 stop
8
9 . m0
10 0 x9000
11
12 . m1
13 0
14
15 // constants in Q2 .14 invsqrt ( x +0.875)
16 . cm
17 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
Kernel 2
This kernel takes any positive Q1.15 number and calculates √1x and returns the
result in soft floating point format. The input is converted√to soft floating point,
a polynomial is evaluated and the result is multiplied with 2 or 1, depending on
whether the expoent is even or odd, and then the exponent is right shifted by one
(floor division by 2).
1 . main
2
3 stofloatw < sout > vr0 .0 d m1 [0]. sw
4 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
5 nop
6 saddw vr1 .0 vr0 .0 m0 [0]. sw
7 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
8 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
9 2* nop
10 scopyw . ne vr0 .3 m0 [1]. sw
11 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
12 12* nop
13 smulww < scale =15 , uu , sat > vr3 .0 vr2 .0 vr0 .3 // multiply result
with 1 or sqrt ( x )
14 5* nop
15 stop
16
17 . m0
18 0 x9000 0 xB505 0 x8000 0 x0001 0 0 0 0
Kernel 2b
Input format Q1.15, in range [2−15 , 1i
Output format Unsigned Q8.24
Error Maximum relative error is 2−13.91
Cycles, one input value 29 cycles
Cycles, multiple input values 3.375 cycles per value
This kernel converts to soft floating point, evaluates polynomial and uses the
exponent of the floating point representation as an offset to address a multiplicand
in the memory to convert the result of the polynomial evaluation to unsigned
Q8.24.
1 . main
2
3 stofloatw < sout > vr0 .0 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr0 .0 m0 [0]. sw
6 3* nop
7 slslw vsr1 .4 vr0 .1 m0 [1]. sw // shift exponent and save to address
register ar1
8 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
9 9* nop
10
11 smulwd < scale =31 , uu > vr3 .0 d vr2 .0 m0 [ ar1 +8]. sd
12 8* nop
13 stop
14
15 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
16 . m0
17 0 x9000 0 x0001 0 0 0 0 0 0
18 0 x0100 0 x0000 0 x016A 0 x09E6 0 x0200 0 x0000 0 x02D4 0 x13CD
19 0 x0400 0 x0000 0 x05A8 0 x279A 0 x0800 0 x0000 0 x0B50 0 x4F33
20 0 x1000 0 x0000 0 x16A0 0 x9E66 0 x2000 0 x0000 0 x2D41 0 x3CCD
21 0 x4000 0 x0000 0 x5A82 0 x799A 0 x8000 0 x0000 0 xB504 0 xF334
Kernel 3a
This kernel takes 32 bit inputs, converts to soft floating point, evaluates polynomial
for the 16 MSBs, then one Newton-Raphson
√ iteration is used to increase precision.
The result is then multiplied with 2 if the exponent is odd, exponent is shifted
to right, and the result is in soft floating point format.
1 . main
2
3 stofloat32d < sout > vr1h m1 [0]. sd
4 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
5 nop
6 saddw vr1 .0 vr1 .4 m0 [0]. sw
7 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
8 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
9 2* nop
10 scopyd . ne vr0 .2 d m0 [8]. sd
11 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
12 12* nop
13
14 // iteration
15 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
16 5* nop
17 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
18 2* nop
19 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
20 5* nop
21 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
22 5* nop
23
24 // multiply with 1 if exponent is even or sqrt (2) if it is odd
25 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
26 5* nop
27
28 stop
29
30 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
31 . m0
32 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
33 0 xB504 0 xF334 0 x8000 0 x0000
Kernel 3b
Kernel 3c
Same as Kernel3b but with two Goldschmidt iterations. Note the difference in
cycle cost with one input value and errors. These differences are explained in
sections 4.1.4 and 4.1.6.
1 ...
2 // iteration
3 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X 0 x3fff ’ ,
’0 xffff ’ ,
4 5* nop
5 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
6 2* nop
7 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
8 5* nop
9
10 // iteration 2 Goldschidt
11 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // T2 * T2
12
13 // this next is instruciton is part of previous iteration
14 // but placed here due to data dependencies
15 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R2 = R1 * T2
16 4* nop
17 smuldd < scale =31 , uu , rnd > vr6 .1 d vr6 .0 d vr4 .1 d // TT1 = T2 * T2 * T1
18 2* nop
19 ssubd vr5 .2 d m0 [4]. sd vr6 .1 d // TT2 = 1.5 - TT1
20 5* nop
21 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R3 = R2 * TT2 =
R1 * T2 * TT2
22 5* nop
23
24 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
25 smuldd < scale =31 , uu , rnd > vr3 .1 d vr5 .3 d vr0 .2 d
26 5* nop
27 stop
Kernel 4
Since the 2e is not an integer when e is odd, we can not use only shifting to
√
perform the multiplication m · 2e/2 . √
We use two different methods to solve this. In kernel 2 we shift m by b 2e c,
q
and then multiply with the constant 12 if e is odd.
In kernel 3 we store the constants 2−e/2 for every possible value of e in memory.
The constant e is then copied to an address register, and then used as an offset
to fetch an operand
√ from memory (using offset addressing mode), which is then
multiplied with m.
5.8.5 Kernels
Six kernels were implemented, three with 16 bit input and output, and three with
32 bit input and output.
Kernel 1
Input format Q1.15 in the range [0.5,1]
Output format Unsigned Q0.16
Error Max error 1 ULP (2− 15.6)
Cycles, one input value 20 cycles
Cycles, multiple input values 1.125 cycles per value
√
This
√ kernel calculates x for x ∈ [0.5, 1]. It uses a 6th degree polynomial for
x + 0.75. First the constant −0.75 is added to the input, then the polynomial is
calculated.
1 . main
2 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
3 4* nop
4 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
5 12* nop
6 stop
7
8 . m0
9 0 xa000 // -0.75
10
11 . m1
12 0 // input value
13
14 . cm // constants in Q1 .15 sqrt ( x +0.75)
15 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0
Kernel 2
Input format Q1.15, all positive values
Output format unsigned Q0.16
Error Max error 1 ULP
Cycles, one input value 31
Cycles, multiple input values 2.125 cycles per value
√
This kernel can calculate x for all positive values. The input is first converted to
soft floating point format and then the polynomial is calculated in the same way as
in kernel 1. Then the first method described in 5.8.2 to convert from soft-floating
point to fixed point is used is used.
1 . main
2 stofloatw < sout > vr0 .0 d m1 [0]. sw
3 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
4 nop
5 saddw vr1 .0 vr0 .0 m0 [0]. sw // add constant before polynomial
evaluation
6 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
7 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
8 2* nop
9 scopyw . ne vr0 .3 m0 [1]. sw // overwrite the value 1 with the
value sqrt (0.5)
10 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
11 10* nop
12 slsrw vr1 .1 vr0 .3 vr0 .4 // shift result by floor ( exponent / 2)
13 2* nop
14
15 // multiply result with 1 ( even exponent ) or sqrt (0.5) ( odd
exponent )
16 smulww < scale =15 , uu , rnd > vr3 .0 vr1 .1 vr2 .0
17
18 5* nop
19 stop
Kernel 3
Input format Q1.15, all positive values
Output format unsigned Q0.16
Error max error 1 ULP
Cycles, one input value 28
Cycles, multiple input values 3.625 cycles per value
Same as Kernel 2 except the other method described in section 5.8.2 is used. Note
that Kernel 3 is faster for one input value but slower for multiple input values.
1 . main
2
3 stofloatw < sout > vr0 .0 d m1 [0]. sw // convert to soft float
4 2* nop
5 saddw vr1 .0 vr0 .0 m0 [0]. sw // add -0.75
6 3* nop
7 scopyw vsr1 .4 vr0 .1 // copy exponent to address register to use as
offset
8
9 // evaluate polynomial
10 polyw < start =0 , scale =15 , scale2 =15 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
11 9* nop
12 smulww < scale =15 , uu , rnd > vr3 .0 vr2 .0 m1 [ ar1 +8]. sw // scale
output
13 8* nop
14 stop
15
16
17 . m0
18 0 x8400
19
20 . m1
21 0 0 0 0 0 0 0 0
22 0 x8000 0 x5a82 0 x4000 0 x2d41 0 x2000 0 x16a1 0 x1000 0 x0b50
23 0 x0800 0 x05a8 0 x0400 0 x02d4 0 x0200 0 x016a 0 x0100 0 x00b5
24 // m1 [8] - m1 [23] contains a table of sqrt (0.5) ^ e for various e .
Kernel 4a
Input format Q1.31, all positive values
Output format Soft floating point, unsigned Q0.32 mantissa
Error Max error is approximately 2−26
Cycles, one input value 57
Cycles, multiple input values 4.25 cycles per value.
This kernel calculates square root with 32 bit input and output. It uses a polyno-
mial to calculate √1x . Then it uses one Newton-Raphson iteration to increase the
precision.
Like in the reciprocal implementation, the polynomial is evaluated for the 16 MSBs
using 16 bit precision for the calculations. Then the Newton-Raphson iteration is
done with 32 bit operations.
1 . main
2
3 stofloat32d < sout > vr1h m1 [0]. sd
4 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
5 nop
6 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract constant
7 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
8 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
9 2* nop
10 scopyd . ne vr0 .2 d m0 [8]. sd
11 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
12 12* nop
13
14 // iteration
15 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
16 5* nop
17 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
18 2* nop
19 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
20 5* nop
21 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
22 5* nop
23
24 // convert inverse square root to square root
25 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
26 5* nop
27 // multiply with 1 if exponent is even or sqrt (2) if it is odd
28 smuldd < scale =30 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
29 5* nop
30 stop
Kernel 4b
Input format Q1.31, all positive values
Output format Soft floating point, unsigned Q0.32 mantissa
Error Relative error 2−29.6
Cycles, one input value 72
Cycles, multiple input values 5.25 cycles per value
Kernel 4c
Input format Q1.31, all positive values
Output format Soft floating point, unsigned Q0.32 mantissa
Error Relative error 2−30.1
Cycles, one input value 78
Cycles, multiple input values 5.25 cycles per value
Same as kernel 4b except it uses two Newton-Raphson iterations rather than two
Goldschmidt. The difference is more pipeline penalties but slightly better preci-
sion.
The difference in precision is due to the fact that when we start the second Newton
Raphson iteration we calculate r1 · r1 · x/2, where x is our input value, and r1 is
the result of a previous iteration. The value x has no rounding error. When we
use Goldschmidt iteration, we do a similar multiplication of three values, but all
three can have rounding errors. Therefore the worst case error is about a half bit
smaller than when we use Newton-Raphson iterations..
1 ...
2 Kernel 5// iteration 1
3 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X
4 5* nop
5 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
6 2* nop
7 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = R1
8 5* nop
9 5* nop
10 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R2 = R1 * T2
11 5* nop
12
13 // iteration 2
14 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R2 * X
15 5* nop
16 smuldd < scale =32 , uu , rnd > vr5 .1 d vr5 .0 d vr4 .3 d // TT1 = R2 * R2 * X /2
17 2* nop
18 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d // TT2 = 1.5 - TT1
19 5* nop
20 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R3 = R2 * TT2
21 5* nop
22
23 smuldd < scale =30 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
24 5* nop
25 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
26 smuldd < scale =31 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
27 5* nop
28 stop
Kernel 5
Input format Q1.31, all positive values
Output format Unsigned Q0.32
Error 2 ULP
Cycles, one input value 81
Cycles, multiple input values 5.5 cycles per value
Same as kernel 4b with a shift instruction in the end, to return the result in
unsigned Q0.32 rather than soft floating point format.
5.9 Logarithms 53
5.9 Logarithms
Four different kernels to calculate logarithms have been written. As before we
convert to soft floating point format and then use polynomial evaluation. The
Newton-Raphson method is not as convenient to improve the precision of loga-
rithms as it is in the special case of reciprocal and inverse square root where we
could use only multiplication and subtraction.
x = m · 2e
If we only need the integer part of the logarithm, we do not need to evaluate
p(m).
Example 5.2
For example if we want to calculate log2 (x) and the input value x is in Q3.13, we
use a kernel for log2 and we let the kernel think that the input value is in Q1.15.
To compensate, we add the constant 2 to the result. The general rule is that we
(i−1)
add log to the result where i is the number of integer bits in the fixed point
2 (b)
format of the input and b is the base of the logarithm we are calculating.
54 Implementations
5.9.3 Kernels
Kernel 1
Input format Q1.15 in the interval [0.5, 1i
Output format signed Q1.15
Error max relative error is 2−10
Cycles, one input value 17
Cycles, multiple input values 1 cycle per value
This kernel is just one POLYW instruction and is only intended to work for
inputs in the range [0.5, 1i. It can used if the programmer wants to do all the
required scaling.
1 . main
2 polyw < start =0 , scale =15 , scale2 =11 , sign = us , rnd1 , rnd2 > vr2 .0 m1
[0]. sw cm [0]
3 14* nop
4 stop
5
6 // polynomial constants in signed Q5 .11
7 . cm
8 0 xE199 0 x5178 0 x8E60 0 x6865 0 xCAAE 0 x0B7D 0 0
Kernel 1b
Input format Q1.15 in the interval [0.5, 1i
Output format signed Q1.15
Error max relative error is 2−13.78
Cycles, one input value 20
Cycles, multiple input values 1.125 cycle per value
Like Kernel 1, this kernel also only works for inputs in the range [0.5, 1i. A
constant is subtracted from the input before the polynomial evaluation. It results
in better precision but slightly longer computation time.
1 . main
2 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
3 4* nop
4 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
5 12* nop
6 stop
7
8 . m0
9 0 x8000
Kernel 2
Input format Q1.15 all positive values
Output format signed Q5.11
Error max relative error is 2−11.7
Cycles, one input value 29
Cycles, multiple input values 1.625 cycle per value
This kernel can take all positive Q1.15 inputs (but not zero). For the domain
x ∈ [2−15 , 1], log2 (x) has the range [−15, 0]. For that range 4 integer bits and a
sign bit are needed, which means that we have to use signed Q5.11 for the result
value.
If the input is in some other fixed point format, an integer can be added to the
result to compensate for it.
This kernel converts the input to soft floating point format before the polynomial
evaluation, and then adds the exponent to the result (actually the exponent is
subtracted because it is negative). The exponent needs to be shifted to the correct
position so that it can be added to the result (the shifting converts the exponent
from Q16.0 to Q5.11).
1 . main
2
3 stofloatw < sout > vr1 .1 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr1 .2 m0 [0]. sw
6 slslw vr1 .4 vr1 .3 m0 [1]. sw
7 3* nop
8 // vr2 .0 is Q5 .11 , vr1 .0 is Q1 .15 , cm [0] is Q2 .14 (15+14 -11 = 18)
9 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
10 12* nop
11 ssubw vr3 .0 vr2 .0 vr1 .4
12 5* nop
13 stop
14
15 . m0
16 0 x8000 0 x000b
A simple way to calculate logarithms with other bases is to multiply the result
with the constant log1(b) where b is the new base, since logb (x) = log 2 (x)
log2 (b) .
2
It is also possible to modify the log2 (x) kernel where the polynomial constants
are scaled by log1(b) . The exponent must also be scaled.
2
log2 (m · 2e ) p(m) e
logb (m · 2e ) = = + (5.9)
log2 (b) log2 (b) log2 (b)
p(m)
Equation 5.9 is a modification of equation 5.8. We can calculate log by
2 (b)
scaling the polynomial constants (e is here the exponent and must not be confused
with Euler’s number, the base of the natural logarithm). In kernel 2 we needed
a shift instruction to scale the exponent. We can replace that shift instruction
with a multiplication instruction, which means that we can calculate any other
logarithm in the same amount of time.
It is also possible to use a polynomial for logb (x − a), x ∈ [0.5, 1i and use that
instead. loge (b) can be calculated in some of the empty slots, where NOPs are
2
issued while waiting for the results of a previous instruction before issuing the
next instruction. In that way such an implementation should take an equally long
time.
The polynomial constants are the same as in the log2 (x) implementation except
scaled with log1(e) where e is the base of natural logarithm (Euler’s number).
2
Similar changes can be made on the 32 bit version of log2 (x), and the same
method can be used to implement log10 (x) or some other base logarithm.
1 . main
2 stofloatw < sout > vr1 .1 d m1 [0]. sw
3 2* nop
4 saddw vr1 .0 vr1 .2 m0 [0]. sw
5 smulww < scale =4 , ss , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 ( e )
6 3* nop
7 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
8 12* nop
9 ssubw vr3 .0 vr2 .0 vr1 .4
10 5* nop
11 stop
12
13 . m0
14 0 x8000 0 x000b 0 x58B9
Implemented in the same way as natural logarithm. For the input domain
[2−15 , 1] the output range is [−4.15.0]. It means that the result can be stored in
Q4.12. A polynomial can give log10 (m) with maximum error of 2−15 , which means
that the maximum error of the result is a rounding error because we have to use
Q4.12 to be able to fit all possible results.
1 . main
2
3 stofloatw < sout > vr1 .1 d m1 [0]. sw
4 2* nop
5 saddw vr1 .0 vr1 .2 m0 [0]. sw
6 smulww < scale =5 , su , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 (10)
7 3* nop
8 polyw < start =0 , scale =15 , scale2 =19 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
9 12* nop
10 ssubw vr3 .0 vr2 .0 vr1 .4
11 5* nop
12 stop
13
14 . m0
15 0 x8000 0 x000b 0 x9A21
A 32 bit version of log2 was also implemented. It was done by using an 11th
degree polynomial, which was calculated by using one POWERSD instruction which
gave [x1 , x2 , x3 , x4 ], then two double scalar double vector multiplications were used
to first multiply x4 and [x1 , x2 , x3 , x4 ] and then multiply x8 and [x1 , x2 , x3 , x4 ].
Then three triangular mac instructions were used to multiply the powers of x
and the polynomial constants, and finally accumulate the products. The result
is returned in soft floating point format, so the user can decide how to scale the
result. The mantissa is in vr4.3d which is the last scalar double in vector register
4. The exponent is in vr3.6. It is an integer (Q16.0) but vr3.7 contains zero
so the vr3.3d contains the exponent in Q16.16 so it can be shifted to any 32 bit
format and added to the mantissa after the mantissa has been shifted to the same
fixed point format.
This method gives the mantissa with a maximum error of 2−28.26 which is
consistent with the fact that polynomial constants are stored in Q3.29 which has
a ULP weight of 2−29
The cycle count of this kernel depends on what support will be for 32 bit poly-
nomial evaluations. This kernel uses a double TMAC instruction which was added
to the simulator, but will perhaps be implemented as two instructions (see section
6.4.
Instruction proposals
59
60 Instruction proposals
6.1 POWERSW
This instruction calculates powers two through eight of a scalar word.
The idea is that this instruction can be used to calculate powers of a scalar value
and then the result can be used to evaluate a polynomial. Since the datapath
of the SIMD core already has 16 multipliers, it is possible to create a pipelined
special instruction that calculates powers two through eight for a scalar value, and
if several consecutive POWERSW instructions are issued, it is possible to complete
one instruction per cycle.
6.1.1 Implementation
The instruction will have 3 multiplication pipeline stages and 4 ALU pipeline
stages. Table 6.1 shows what is performed in each stage. sat() function performs
saturation if the saturation flag is set, the >> operator is a right shift operator,
scale is the value of the scale flag, round() function performs rounding if the
rounding flag is set.
MUL stage 1 x2 := x1 · x1
x2 := sat(round(x2 >> scale)),
ALU stage 1 zero/sign extend, and select operands for next mul
stage
MUL stage 2 x3 := x1 · x2 , x4 := x2 · x2
x3 := sat(round(x3 >> scale))
x4 := sat(round(x4 >> scale))
ALU stage 2
zero/sign extend, and select operands for next mul
stage
x5 := x4 · x1 , x6 := x4 · x2
MUL stage 3 x7 := x4 · x3 , x8 := x4 · x4
ALU stage 3 xi : sat(round(xi >> scale)), ∀i ∈ {5, 6, 7, 8}
ALU stage 4 choose output, set flags, etc
6.1.2 Flags
The flags of the instruction are shown in table 6.3
Scale flag
If the scale flag is supposed to be able take any value in the range 0-16, there
would need to be a barrel shifter in the ALU stage that performs that scaling.
There will probably only be a barrel shifter in the latter of the two ALU stages
in the datapath. For all the implementations in this thesis where this instruction
was used, the value of the scale flag was 15 or 16.1 This flexibility is perhaps not
needed and it would be enough to always shift the result by 15. The additional
option to also be able to shift by 16 could give better precision in some cases, but
that flexibility can be sacrificed to make the implementaion simpler.
Return value
We can also consider whether we want to sacrifice the flexibility to be able to
choose to return either powers 1-8 or power 0-7. It depends on how adaptable
the multiplexing hardware in the last ALU stage is. The reason for wanting to
return powers 0-7, is that we can then directly use one TMAC instruction to
multiply powers 0-7 and polynomial constants, and accumulate the products. If
we only have the option to return powers 1-8, it will be necessary to add the
first polynomial constant (the one for the zero power) separately (unless in the
case where constant’s value is zero). When we want to evaluate a polynomial for
multiple input values, we can use one addition instruction to do this addition for
1 If we multiply two numbers in Q1.15 and want result in Q1.15, the scaling flag is set to 15,
similarly if we multiply two Q0.16 values we set the scale flag to 16 to get a result in Q0.16.
62 Instruction proposals
Table 6.4: The value to return as the zero power in POWERSW instruction
eight values, which results in additional 1/8 clock cycle per value. When we have
one or very few input values, we can in most cases issue an instruction to preload
this constant to the accumulator register, where here would otherwise be issued a
NOP (due to data-dependency).
Power zero
How do we return power 0 for unsigned Q0.16 or signed Q1.15? We can use
0xffff and 0x7fff (0xffff in Q0.16 equals 1 − 2−16 , and 0x7fff in signed Q1.15 equals
1 − 2−15 ). Table 6.4 shows what values should be returned as the zero power,
based on the scale flag and the sign flag. Note that if it will be decided to only
return powers 1-8, it is not necessary to consider this. Also if we only use the
scaling value 15, or only values 15 and 16, there are only 2 or 3 options.
Selection of multipliers
The most important (and obvious) thing we need to consider when we select the
multipliers we use for each multiplication stage, is to never use the same multiplier
more than once per instruction. Another thing we want to consider if we extend
the POWERSW insruction to a POLYW instruction (discussed below), is that we
might want the last multiplication stage of the POLYW instruction to be like a
TMACO instruction, and maybe it is thus possible to reuse some of the control of
the TMACO to implement the POLYW. It means that the set of multipliers we use
for POWERSW is the set of multipliers we do not use for the TMACO instruction.
Another advantage of using the set of multipliers which is not used for TMACO,
is that we can issue a TMACO instruction after a POWERSW instruction such
that it will enter the write back stage immediately one cycle after a POWERSW
6.1 POWERSW 63
instruction enters the write back stage (but there may not be a data dependency
between the last POWERSW and the first TMACO if we issue them like that).
64 Instruction proposals
6.2 POWERSD
This instruction calculates powers 1-4 of a scalar double. It is also a special
instruction similar to POWERSW. Since we only calculate up to power 4, we only
need two multiplication datapath stages. The operands flags and pipeline stages
are shown in tables 6.5, 6.6 and 6.7.
• The only function implementation in this thesis that depends on this in-
struction is the 32 bit version logarithm2 . If there is not a need to evaluate
2 We also tried to use it to implement inverse square root but as is entioned in chapter 5 we can
6.3 POLYW
This instruction evaluates a polynomial for a scalar value. It is a special instruc-
tion that works like a POWERSW instruction followed by a TMAC instruction.
Seven multipliers are used to calculate the powers and eight multipliers are used
to calculate the triangular mac. It means that this instruction needs fifteen mul-
tipliers. The datapath has sixteen multipliers as well as enough ALU hardware to
be able to evaluate one polynomial per cycle.
Implementing the control of this instruction would be a challenge though. The
POWERSW will most likely be implemented by adding a special instruction de-
coder. It would not require much further addition to also add support for this
instruction to the special decoder, because the POLYW instruction is identical
to the POWERSW instruction until after the third multiplication pipeline stage.
The pipeline stages are shown in table 6.8.
The operands and flags are in tables 6.9 and 6.10.
Table 6.8: The pipeline stages of the POLYW instruction, Same notations are
used as in table 6.1. ci are the polynomial constants
dst 16 bit
src0 16 bit
src1 128 bit (8x16 bit)
Opcode size
If we had two scale flags that both can take values between 0 and 16, each would
need 5 bits. It could possibly be pushing the limits of the size of the opcode, but
the size of the opcode has not yet been decided.
6.4 TMACDO
This instructions is a triangular MAC of two double vectors, accumulates the result
to an accumulation register, and outputs the result to the destination operand. A
The return value is: src0[0 : 31]·src1[0 : 31]+src0[32 : 63]·src1[32 : 63]+src0[64 :
95] · src1[64 : 95] + src0[96 : 127] · src1[96 : 127] + accregval, where accregval is
a value that is already in the accumulator register which is used. We could also
use an instruction called TMACD which is identical except it does not output the
result to a destination operand (it only updates the accumulator register).
The main usage in this thesis was to use this instruction along with POW-
ERSD to evaluate polynomials, but it could of course also be used to calculate dot
products and convolution and other DSP tasks which use triangular MAC.
dst 32 bit
src0 128 bit, 4 · 32bit
src1 128 bit, 4 · 32bit
6.5 SSUMD
This instruction calculates the sum of a double vector. It is intended to be used
after multiplying two double vectors, if we will not be able to implement the
TMACDO instruction.
dst 32 bit
src0 128 bit (4 · 32bit)
dst 32 bit
src 16 bit
The format flag is the fixed point format of the mantissa. It can be unsigned
Q0.16 or Q1.15 (gives absolute value if input is negative) or signed Q1.15. 3 . For
the implementation of the unsigned Q1.15 format the sign bit can be stored in
the same word as the exponent because the exponent only needs 5 bits (values
between 0 and 31).
The unsigned Q0.16 format has the MSB always at 1 and the unsigned Q1.15
has the two MSBs always valued 01. This is because the mantissa is always a value
between 0.5 and 1.
When sign flag is ’s’, the value of the exponent value in the output will be
equal to the number of leading sign bits minus one. When the sign flag is ’u’ the
value of the exponent is equal to the number of leading zeros.
Another way of viewing this is that sign=’s’ assumes Q1.15 input and sign=’u’
assumes Q0.16 input, and both have exponent 0 when the input’s absolute value
is between 0.5 and 1.
Design decisions
• Perhaps a better mnemonic could be chosen for the instruction.
• Perhaps a better name can be found for the format flag as well as its values.
3 The simulator implementation of STOFLOATW had a flag called sout (short for signed
output), instead of the format flag. If the flag is set, the result is the same as for format=15s,
when it is not set, the result is the same as for format=16u
72 Instruction proposals
• Even though format=15s was used in all the implementations (but the flag is
named sout in the implementations source code), the 15u format is probably
more suitable for most applications. It is probably enough to only allow the
format values 16u and 15u.
• It is also possible to sacrifice the flexibility of the option to use 16u and always
use the 15u output format (and hence no flag is needed). The tradeoff is that
one bit is lost from the input if the input format is unsigned Q0.16.
• A sign bit can be stored in the same word as the exponent. The shift
instruction ignores all the bits of the shift operand exept the relevant LSBs
and therefore it would not affect the conversion from soft floating point
format to fixed point format, which is often done with a shift instruction.
6.6.2 STOFLOATD
Converts a double scalar to soft floating point. The first two words of the output
contain a 32 bit mantissa, the third word contains the exponent and the last word
is not used. Operands and flags are in tables 6.15 and 6.16.
dst 64 bit
src0 32 bit
6.6.3 TOFLOATADD
These are several instructions that convert to floating point and then add the sec-
ond source operand to the mantissa.
Almost all implementations in this thesis where polynomial evaluation was used,
included adding a constant to the mantissa before the polynomial evaluation
This instruction would reduce pipeline delay. This instruction would perhaps
need too many hardware adjustments to make it worth adding it. The instruction
requires to first count leading zeros, then shifting, and then addition. Another
possibility would be do this as a long data path instruction, but since this in-
struction starts by counting leading zeros, there would need to be zero counting
hardware in the first ALU stage.
The method of using one instruction to get mantissas and another instruction
to get exponents is more convenient. It is because it is more convenient to have
the mantissas in the same vector word, and the exponent in the same vector word.
Then it is easier to use them as operands to other vector instructions.
The only advantage of the half vector over the full vector version is in the case
where we only need to convert one half vector. In all other cases we would need
to issue two instructions per vector anyway.
These instructions are useful when we need to calculate polynomials for a se-
ries of values. They require that we have more than one instance of count leading
74 Instruction proposals
zeros hardware.
Note that x and res can be vectors (or any numerical datatype for that matter).
If e is a constant, known at compile time, we do not need the AND and RSHIFT
instructions, we only need the MUL x x x and MUL res res x when appropriate.
That is, it is not necessary to check the condition since we know at compile time
which multiplications will be needed.
There are several ways to implement a special instruction that uses the cur-
rently existing hardware in the datapath to speed up these calculations. The two
multiplications and the right shift can all be performed in parallel, and the AND
instruction can removed by using the LSB of the exponent (and its shifted versions)
directly as a select bit to a multiplier. It would however be relatively complicated
to implement the control for such a special instruction, so a simpler and slower
approach might be chosen instead.
76 Instruction proposals
Fractional exponents
The method described here calculates integer powers, but could be used in com-
bination with polynomial approximation to calculate powers where the exponent
is a fractional number.
The polynomial would be used to approximate the power of the fractional part
of the exponent. It is however only possible to use polynomial approximations
for the fractional part when the base is a constant known at compile time, since
different bases of the power need a different set of polynomial constants.
The calculation would be done as:
k e = k i+f = k i · k f (6.2)
Where e is a fractional variable with integer part i and fractional part f such
that e = i + f . k i would be calculated with the method described above to
calculate powers with integer exponents and k f is calculated with polynomial
approximation.
Variable bases
Another similar method can be used to calculate powers with variable fractional
exponents and variable bases. It is however challenging to use in a fixed point
system.
We factorize the exponent into an integer and a fractional constant. The frac-
tional constant is the value of the LSB. For example if we have a variable exponent
with four fractional bits, we can factorize for example. 10.1011 as 101011 · 0.0001.
If we want to calculate b10.1011 we can calculate b101011 as a power with integer ex-
poenent and then use polynomial approximation to calculate (b101011 )0.0001 , with
a polynomial which approximates the function f (x) ≡ x0.0001 .
6.8 Other
6.8.1 Opposite signed products after multiplication instruc-
tions
This feature was described in section 4.1.2. It is used to speed up Newton-Raphson
division by returning the product with its opposite sign after a multiplication
instruction.
This feature was implemented in the simulator by adding a flag to multiplication
instructions to let them return the product with the opposite sign. It can also be
implemented as a separate instruction, depending on which is easier to implement
in the instruction decoder or depending on which approach is more consistent with
the rest of the instruction set.
Since a multiplication is already a long datapath instruction, it would cost lit-
tle to use either one of the ALU stages to return the negative of the product.
6.8 Other 77
Results
In this chapter we summarize our results. The results are divided into two sections:
a summary of the kernels we have implemented, and a summary of the features
we propose that will be added or considered to be added to the instruction set
architecture.
79
80 Results
Cycles Cycles
Kernel name Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 Q1.15 [0.5, 1i uQ1.15 abs 2−11.5 20 1.125
Kernel 2 Q1.15 [2−15 , 1i sf uQ1.15 rel 2−11.85 23 1.375
Kernel 3 Q1.15 [2−15 , 1i sf uQ1.15 rel 2−15.0 35 1.625
Kernel 4 Q1.15 [2−15 , 1i Q16.16 rel 2−16 38 2.375
Kernel 5 Q1.31 [2−31 , 1i sf uQ1.31 rel 2−24 35 2.125
Kernel 5b Q1.31 [2−31 , 1i sf uQ1.31 rel 2−31 47 2.625
Kernel 6 64 x [2−16 , 1i sf uQ1.15 rel 2−15 N/A 1.875
uQ0.16
Cycles Cycles
Kernel name Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 Q1.15 [0.5, 1i uQ1.15 abs 2−13.68 20 1.125
Kernel 2 Q1.15 [2−15 , 1i sf uQ1.15 rel 2−13.91 30 2.0
Kernel 2b Q1.15 [2−15 , 1i sf uQ1.15 rel 2−13.91 29 3.375
Kernel 3a Q1.31 [2−31 , 1i sf uQ1.31 rel 2−26.58 51 3.5
Kernel 3b Q1.31 [2−31 , 1i sf uQ1.31 rel 2−30.6 72 4.5
Kernel 3c Q1.31 [2−31 , 1i sf uQ1.31 2−29.85 66 4.5
Kernel 4 Q1.31 [2−31 , 1i sf uQ2.30 2−27 37 8
Table 7.3: A summary of the kernels that implement inverse square root
Cycles Cycles
Kernel name Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 Q1.15 [0.5, 1i uQ0.16 abs 2−15.6 20 1.125
Kernel 2 Q1.15 [2−15 , 1i uQ0.16 abs 1 ULP 31 2.125
Kernel 3 Q1.15 [2−15 , 1i uQ0.16 abs 1 ULP 28 3.625
Kernel 4a Q1.31 [2−31 , 1i sf uQ0.32 abs 2−26 57 4.25
Kernel 4b Q1.31 [2−31 , 1i sf uQ0.32 rel 2−29.6 72 5.25
Kernel 4c Q1.31 [2−31 , 1i sf uQ0.32 rel 2−30.1 78 5.25
Kernel 5 Q1.31 [2−31 , 1i uQ0.32 abs 2 ULP 81 5.5
Cycles Cycles
Kernel name Base Input Domain Output Error (one (multiple
input) inputs)
Kernel 1 2 Q1.15 [0.5, 1i sQ1.15 rel 2−10 17 1
Kernel 1b 2 Q1.15 [2−15 , 1i sQ1.15 rel 2−13.78 20 1.125
Kernel 2 2 Q1.15 [2−15 , 1i sQ5.11 rel 2−11.7 29 1.625
Kernel 3 e Q1.15 [2−15 , 1i sQ5.11 abs 2−10.96 29 1.625
Kernel 4 10 Q1.15 [2−15 , 1i sQ4.12 abs 2−12.6 29 1.625
Kernel 5 2 Q1.31 [2−31 , 1i sf sQ1.31 abs 2−28.12 37 7.75
POWERSW
Special instruction that calculates powers 1-8 of a 16 bit scalar.
POWERSD
Special instruction that calculates powers 1-4 of a 32 bit scalar
POLYW
Special instruction that calculates a polynomial of degree 8 or less, for a 16-bit
scalar.
7.2 Proposed features 83
TMACDO
Triangular multiplication and accumulation of two vector double words.
SSUMD
Adds all the four scalars in a vector double word and returns the sum.
STOFLOATW
Converts a scalar word to soft floating point format
STOFLOATD
Convert scalar double word to soft floating point format.
84 Results
VMANTW
Returns the mantissas of 8 scalars in a vector word.
VMANTD
Returns the mantissas of 4 scalars in a vector double word.
TOFLOATADD
Four instructions, listed in table 6.17, that convert inputs to soft floating point
format and add a constant to the mantissa.
Conclusions
It soon became clear that Taylor polynomials could not be used alone to im-
plement the functions because they only give good precision in a limited interval
of the function’s domain (when we use a finite amount of terms), and very often
the polynomial coefficients as well as the powers of the input, are of various mag-
nitudes that require a big dynamic range, which is difficult to deal with when we
use fixed point arithmetic.
The Remez algorithm was used to calculate the polynomial coefficients for min-max
polynomials. We found a method to roughly estimate the error a given min-max
polynomial would give, if it would be calculated in a fixed point system. We could
then tweek the input parameters (polynomial degree, and the constant we have
called a) of the Remez algorithm to find the polynomial that would give us best
precision.
Newton-Raphson and Goldschmidt iterations were also used to improve the pre-
cision of the results of x1 and √1x .
89
90 Conclusions
Our goal was to add as little hardware as possible but we conclude that in order
to efficiently implement conversion to soft floating point format, it is necessary to
have hardware that counts leading zeros and leading ones (leading sign bit). This
type of hardware is relatively small and inexpensive to add to the ALU.
Several kernel source codes were written for each of the four functions. The kernels
differed in precision and execution times and both 16 bit and 32 bit kernels were
implemented. The precision and execution times of the kernels were reasonable.
Future work
The development of the ePUMA platform will continue. When it has been decided
which features discussed in chapter 6 will be implemented and how, it will become
necessary to modify the kernel source codes accordingly. The methods discussed
in this thesis can also be applied to implement other mathematical functions.
Bibliography
[5] Andreas Karlsson. Algorithm adaptation and optimization of a novel dsp vec-
tor co-processor, 2010. Master thesis done at Linköping University, Computer
Engineering, The Institute of Technology,.
[6] Dake Liu. Embedded DSP Processor Design. Morgan Kaufmann Publishers
Inc., 2008.
[7] Peter Markstein. Software division and square root using Goldschmidt’s al-
gorithms. In In 6th Conference on Real Numbers and Computers, pages
146–157, 2004.
91
92 Bibliography
[11] Jian Wang, Joar Sohl, Olof Kraigher, and Dake Liu. ePUMA: A novel embed-
ded parallel DSP platform for predictable computing. In Education Technol-
ogy and Computer (ICETC), 2010 2nd International Conference on, volume 5,
pages V5–32 –V5–35. Institute of Electrical and Electronics Engineers, Inc.,
jun. 2010.
Appendix A
A.1 Reciprocal
A.1.1 Kernel 1
1 // input m1 [0]. sw signed Q1 .15 range 0.5 -1
2 // output vr2 .0 unsigned Q1 .15
3 // max error 11.1 * ulp
4 // 20 cycles
5
6 . main
7 saddw vr1 .0 m1 [0]. sw m0 [0]. sw
8 4* nop
9 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
10 12* nop
11 stop
12
13 . m0
14 0 x9800
15
16 . m1
17 0
18
19 // constants in Q4 .12 for 1/( x +0.8125)
20 . cm
21 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0
A.1.2 Kernel 2
1 // input m1 [0]. sw signed Q1 .15
2 // output : softfloat
3 // mantissa : vr2 .0 unsigned Q1 .15
4 // exponent : vr1 .3 ( integer )
5 //
6 // max relative error size = 2^ -11.855880
7 // 23 cycles
8
93
94 Kernel source codes
9
10 . main
11
12 stofloatw < sout > vr1 .1 d m1 [0]. sw
13 2* nop
14 saddw vr1 .0 vr1 .2 m0 [0]. sw
15 4* nop
16 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
17 12* nop
18 stop
19
20 . m0
21 0 x9800
22
23 . m1
24 0
25
26 . cm
27 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0
A.1.3 Kernel 3
A.1.4 Kernel 4
1 // input in signed Q1 .15 m1 [0]. sw
2 // output unsigned Q16 .16 soft float mantissa vr0 .1 exp vr1 .3
3 // max relative error size = 2^ -16 ( -15.9997) , max abs error 2.28
ulp
4 // 38 cycles
5
6
7 . main
8
9 stofloatw < sout > vr1 .1 d m1 [0]. sw
10 2* nop
11 saddw vr1 .0 vr1 .2 m0 [0]. sw
12 ssubw vr1 .6 m0 [1]. sw vr1 .3
13 2* nop
14 scopyw vr1 .4 vr1 .2
15
16 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
17 12* nop
18
19 // Newton - Rapson iteration
20 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
21 5* nop
22 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
23 5* nop
24
25 // scale result
26 slsrd vr0 .3 d vr0 .1 d vr1 .6
27 2* nop
28 stop
29
30
31 . m0
32 0 x9800 15
33
34
35 . m1
36 0
37
38 . cm
39 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0
A.1.5 Kernel 5
1 // input in signed Q1 .31 m1 [0]. sd
2 // output soft float mantissa vr0 .1 d exp vr1 .6
3 // max relative error size = 2^ -16 ( -15.9997) , max abs error 2.28
ulp
4 // 35 cycles
5
6
7 . main
8
9 stofloat32d < sout > vr1h m1 [0]. sd
10 2* nop
11
12 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract
96 Kernel source codes
13
14 4* nop
15
16 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
17 12* nop
18
19 // Newton - Rapson iteration
20 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
21 5* nop
22 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
23 5* nop
24
25
26
27 stop
28
29
30 . m0
31 0 x9800 15
32
33
34 . m1
35 0
36
37 . cm
38 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0
A.1.6 Kernel 5b
1 // input in signed Q1 .31 m1 [0]. sd
2 // output soft float mantissa vr0 .1 d exp vr1 .6
3
4 // max relative error size = 2^ -31
5 // 47 cycles
6 // 5 degree polynomial + two newton rapson .
7
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 2* nop
13
14 saddw vr1 .0 vr1 .4 m0 [0]. sw // subtract
15
16 4* nop
17
18 polyw < start =0 , scale =15 , scale2 =12 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20
21 // Newton - Rapson iteration
22 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr1 .2 d vr2 .0 d
23 5* nop
24 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr2 .0 d
25 5* nop
26
27
28 // second iteration
29 smuldd < scale =31 , uu , rnd , negres > vr0 .0 d vr0 .1 d vr1 .2 d
A.1 Reciprocal 97
30 5* nop
31 smuldd < scale =31 , uu , sat , rnd > vr0 .1 d vr0 .0 d vr0 .1 d
32 5* nop
33
34
35 // convert from soft float to Q16 .16
36 // 2* nop
37 stop
38
39
40 . m0
41 0 x9800 15
42
43
44 . m1
45 0
46
47 . cm
48 0 x13B1 0 xE7BC 0 x1DE8 0 xDE01 0 x2CD2 0 x8E6B 0 0 0
A.1.7 Kernel 6
1 // vector reciprocal
2 // calculates reciprocal of 8 vector words stored in m0 [0]. vw - m0
[56]. vw in 1 -15 FP
3 // method works only in the range [0.5 ,1]
4 // reciprocal is returned in 2 -14 FP format to vr0 - vr7
5
6
7 . main
8
9
10 // get mantissa
11 vmantw <u > vr0 m0 [0]. vw
12 vmantw <u > vr1 m0 [8]. vw
13 vmantw <u > vr2 m0 [16]. vw
14 vmantw <u > vr3 m0 [24]. vw
15 vmantw <u > vr4 m0 [32]. vw
16 vmantw <u > vr5 m0 [40]. vw
17 vmantw <u > vr6 m0 [48]. vw
18 vmantw <u > vr7 m0 [56]. vw
19
20 vexpow <u > m1 [64]. vw m0 [0]. vw
21 vexpow <u > m1 [72]. vw m0 [8]. vw
22 vexpow <u > m1 [80]. vw m0 [16]. vw
23 vexpow <u > m1 [88]. vw m0 [24]. vw
24 vexpow <u > m1 [96]. vw m0 [32]. vw
25 vexpow <u > m1 [104]. vw m0 [40]. vw
26 vexpow <u > m1 [112]. vw m0 [48]. vw
27 vexpow <u > m1 [120]. vw m0 [56]. vw
28
29
30
31 vcopy m0 [128]. vw vr0
32 vcopy m0 [136]. vw vr1
33 vcopy m0 [144]. vw vr2
34 vcopy m0 [152]. vw vr3
35 vcopy m0 [160]. vw vr4
36 vcopy m0 [168]. vw vr5
37 vcopy m0 [176]. vw vr6
98 Kernel source codes
103 // c = -x * i
104 // c : unsigned Q1 .15 i : signed Q4 .12 x : unsigned Q0 .16
105 vmulww < scale =13 , us , negres > m1 [0]. vw m0 [128]. vw vr0
106 vmulww < scale =13 , us , negres > m1 [8]. vw m0 [136]. vw vr1
107 vmulww < scale =13 , us , negres > m1 [16]. vw m0 [144]. vw vr2
108 vmulww < scale =13 , us , negres > m1 [24]. vw m0 [152]. vw vr3
109 vmulww < scale =13 , us , negres > m0 [96]. vw m0 [160]. vw vr4
110 vmulww < scale =13 , us , negres > m0 [104]. vw m0 [168]. vw vr5
111 vmulww < scale =13 , us , negres > m0 [112]. vw m0 [176]. vw vr6
112 vmulww < scale =13 , us , negres > m0 [120]. vw m0 [184]. vw vr7
113
114 // i = i * c or i_2 = i_1 * c
115 // i2 : unsigned Q1 .15
116 3* nop
117 vmulww < scale =12 , su > vr0 vr0 m1 [0]. vw
118 vmulww < scale =12 , su > vr1 vr1 m1 [8]. vw
119 vmulww < scale =12 , su > vr2 vr2 m1 [16]. vw
120 vmulww < scale =12 , su > vr3 vr3 m1 [24]. vw
121 vmulww < scale =12 , su > vr4 vr4 m0 [96]. vw
122 vmulww < scale =12 , su > vr5 vr5 m0 [104]. vw
123 vmulww < scale =12 , su > vr6 vr6 m0 [112]. vw
124 vmulww < scale =12 , su > vr7 vr7 m0 [120]. vw
125
126
127
128 // iteration 2
129 // c = -( x * i )
130 // c : Q1 .15 u x : Q0 .16 u i ; Q1 .15 u
131 vmulww < scale =16 , uu , negres , rnd > m1 [0]. vw m0 [128]. vw vr0
132 vmulww < scale =16 , uu , negres , rnd > m1 [8]. vw m0 [136]. vw vr1
133 vmulww < scale =16 , uu , negres , rnd > m1 [16]. vw m0 [144]. vw vr2
134 vmulww < scale =16 , uu , negres , rnd > m1 [24]. vw m0 [152]. vw vr3
135 vmulww < scale =16 , uu , negres , rnd > m0 [96]. vw m0 [160]. vw vr4
136 vmulww < scale =16 , uu , negres , rnd > m0 [104]. vw m0 [168]. vw vr5
137 vmulww < scale =16 , uu , negres , rnd > m0 [112]. vw m0 [176]. vw vr6
138 vmulww < scale =16 , uu , negres , rnd > m0 [120]. vw m0 [184]. vw vr7
139
140 // i = i * c
141 //15+15 -15
142 3* nop
143 vmulww < scale =15 , uu , rnd , sat > vr0 vr0 m1 [0]. vw
144 vmulww < scale =15 , uu , rnd , sat > vr1 vr1 m1 [8]. vw
145 vmulww < scale =15 , uu , rnd , sat > vr2 vr2 m1 [16]. vw
146 vmulww < scale =15 , uu , rnd , sat > vr3 vr3 m1 [24]. vw
147 vmulww < scale =15 , uu , rnd , sat > vr4 vr4 m0 [96]. vw
148 vmulww < scale =15 , uu , rnd , sat > vr5 vr5 m0 [104]. vw
149 vmulww < scale =15 , uu , rnd , sat > vr6 vr6 m0 [112]. vw
150 vmulww < scale =15 , uu , rnd , sat > vr7 vr7 m0 [120]. vw
151
152 5* nop
153
154
155 stop
156
157
158
159
160 . m0
161 0 x4000 0 x4100 0 x4200 0 x4300 0 x4400 0 x4500 0 x4600 0 x4700
162 0 x4800 0 x4900 0 x4a00 0 x4b00 0 x4c00 0 x4d00 0 x4e00 0 x4f00
163 0 x5000 0 x5100 0 x5200 0 x5300 0 x5400 0 x5500 0 x5600 0 x5700
164 0 x5800 0 x5900 0 x5a00 0 x5b00 0 x5c00 0 x5d00 0 x5e00 0 x5f00
165 0 x6000 0 x6100 0 x6200 0 x6300 0 x6400 0 x6500 0 x6600 0 x6700
166 0 x6800 0 x6900 0 x6a00 0 x6b00 0 x6c00 0 x6d00 0 x6e00 0 x6f00
167 0 x7000 0 x7100 0 x7200 0 x7300 0 x7400 0 x7500 0 x7600 0 x7700
100 Kernel source codes
A.2.2 Kernel 2
A.2 Inverse square root 101
A.2.3 Kernel 2b
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x0001 0 x8000 in Q1 .31 x
3 // returns result in unsigned Q8 .24 to vr3 .0 d
4 // max relative error is 2^ -13.91
5 // 29 cycles
6
7 . main
8
9 stofloatw < sout > vr0 .0 d m1 [0]. sw
10 2* nop
11 saddw vr1 .0 vr0 .0 m0 [0]. sw
12 3* nop
13 slslw vsr1 .4 vr0 .1 m0 [1]. sw
14 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
15 9* nop
16
17 smulwd < scale =31 , uu > vr3 .0 d vr2 .0 m0 [ ar1 +8]. sd
18 8* nop
19 stop
20
102 Kernel source codes
A.2.4 Kernel 3a
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -26.58
6 // ( tested with approx 29 thousand values in the entire range
7 // 51 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
28 5* nop
29 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
30 5* nop
31
32
33 // multiply with 1 if exponent is even or sqrt (2) if it is odd
34 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
35 5* nop
36
A.2 Inverse square root 103
37 stop
38
39 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
40 . m0
41 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
42 0 xB504 0 xF334 0 x8000 0 x0000
43
44
45 . m1
46 0
47
48
49 // constants in Q2 .14 invsqrt ( x +0.875)
50 . cm
51 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
A.2.5 Kernel 3b
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31 x
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .6
5 // max relative error is 2^ -30.6 tested with 29000 values
6 // 72 cycles
7
8 . main
9
10 stofloat32d < sout > vr1h m1 [0]. sd
11 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
12 nop
13 saddw vr1 .0 vr1 .4 m0 [0]. sw
14 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
15 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
16 2* nop
17 scopyd . ne vr0 .2 d m0 [8]. sd
18 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20
21 // iteration
22 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
23 5* nop
24 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
25 2* nop
26 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
27 5* nop
28 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
29 5* nop
30
31 // iteration 2
32 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
33 5* nop
34 smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
35 2* nop
36 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
37 5* nop
38 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
39 5* nop
40
41 // multiply with 1 if exponent is even or sqrt (2) if it is odd
104 Kernel source codes
A.2.6 Kernel 3c
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -29.85
6 // ( tested with approx 30 thousand values in the entire range
7 // 66 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18
19 scopyd . ne vr0 .2 d m0 [8]. sd
20 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
21 12* nop
22
23 // iteration
24 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R1 * X 0 x3fff ’ ,
’0 xffff ’ ,
25 5* nop
26 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R1 * R1 * X /2
27 2* nop
28 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
29 5* nop
30
31 // iteration 2 Goldschidt
32 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // T2 * T2
33
A.2 Inverse square root 105
A.2.7 Kernel 4
1 // calculates invsqrt ( x ) of the value in m1 [0]. sd
2 // for x in range 0 x00000001 to 0 x7fffffff in Q1 .31
3 // uses 11 degree polynomial
4 // returns result in soft float format
5 // mantissa in vr4 .3 d and exponent in vr3 .6
6 // max log2 ( relative error ) is -27
7 // 37 cycles
8
9 . main
10
11 stofloat32d < sout > vr3h m1 [0]. sd
12 2* nop
13 saddd vr3 .0 d vr3 .2 d m0 [0]. sd
14 tmacdo < scale =30 , rnd , ss > vr5 .0 d m0 [8]. vd cm [3] // dc term to
accumulator
15 4* nop
16 powersd < scale =31 , start =1 , s , rnd > vr0 vr3 .0 d
17 7* nop
18 svmuldd < scale =31 , rnd , ss > vr1 vr0 .3 d vr0
19 tmacdo < scale =32 , rnd , ss > vr4 .0 d vr0 cm [4]
20 4* nop
21 svmuldd < scale =31 , rnd , ss > vr2 vr1 .3 d vr0
22 tmacdo < scale =32 , rnd , ss > vr4 .1 d vr1 cm [5]
23 3* nop
24 tmacdo < scale =29 , rnd , ss > vr4 .3 d vr2 cm [2]
25 6* nop
26
27 stop
106 Kernel source codes
28
29
30 . m0
31 0 x9000 0 x0000 0 x111A 0 xCEE5 0 x107b 0 x2e11 0 0
32 0 x4000 0 0 0 0 0 0 0
33
34 . m1
35 0 0 0 x107b 0 x2e11 0 x21f 0 x420f
36
37 // constants in invsqrt ( x +0.875)
38 . cm
39 0 xF639 0 xD2EC 0 x0860 0 xB952 0 xF805 0 x6766 0 x07FA 0 x36B6
40 0 xF7CA 0 x23F3 0 x08B6 0 x97D1 0 xF78C 0 xA1AD 0 x081D 0 xADE6
41 0 xDCCA 0 x9833 0 xBC42 0 xF6B5 0 x94E7 0 x8F5B 0 0
42 0 x446B 0 x3B95 0 0 0 0 0 0
43 0 xB1CE 0 x975D 0 x4305 0 xCA90 0 xC02B 0 x3B30 0 x3FD1 0 xB5B0
44 0 xBE51 0 x1F98 0 x45B4 0 xBE8A 0 xBC65 0 x0D66 0 x40ED 0 x6F2D
A.3.2 Kernel 2
1 // calculates sqrt ( x ) of the value in m1 [0]
2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in unsigned Q0 .16 to vr3 .0
4
5 . main
A.3 Square root 107
6
7 stofloatw < sout > vr0 .0 d m1 [0]. sw
8 scopyw vr0 .3 m0 [2]. sw // copy value 1 in Q1 .15
9 nop
10 saddw vr1 .0 vr0 .0 m0 [0]. sw // add constant before polynomial
evaluation
11 slsrw vr0 .4 vr0 .1 m0 [3]. sw // shift exponent
12 sandw vr0 .2 vr0 .1 m0 [3]. sw // set flag if exponent is odd number
13 2* nop
14 scopyw . ne vr0 .3 m0 [1]. sw // overwrite the value 1 with the
value sqrt (0.5)
15 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
16 10* nop
17 slsrw vr1 .1 vr0 .3 vr0 .4 // shift result by floor ( exponent / 2)
18 2* nop
19
20 // multiply result with 1 ( even exponent ) or sqrt (0.5) ( odd
expoent )
21 smulww < scale =15 , uu , rnd > vr3 .0 vr1 .1 vr2 .0
22
23 5* nop
24 stop
25
26 . m0
27 0 xa000 0 x5a82 0 x8000 0 x0001
28
29 . m1
30 0
31
32
33
34 // constants in Q1 .15 sqrt ( x +0.75)
35 . cm
36 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0
A.3.3 Kernel 3
1 // calculates sqrt ( x ) of the value in m1 [0]
2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in Q2 .14 to vr2 .0
4
5 . main
6
7 stofloatw < sout > vr0 .0 d m1 [0]. sw // convert to soft float
8 2* nop
9 saddw vr1 .0 vr0 .0 m0 [0]. sw // add -0.75
10 3* nop
11 scopyw vsr1 .4 vr0 .1 // copy exponent to address register to use as
offset
12
13 // evaluate polynomial
14 polyw < start =0 , scale =15 , scale2 =15 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
15 9* nop
16 smulww < scale =15 , uu , rnd > vr3 .0 vr2 .0 m1 [ ar1 +8]. sw // scale
output
17 8* nop
18 stop
19
108 Kernel source codes
20
21 . m0
22 0 x8400
23
24 . m1
25 0 0 0 0 0 0 0 0
26 0 x8000 0 x5a82 0 x4000 0 x2d41 0 x2000 0 x16a1 0 x1000 0 x0b50
27 0 x0800 0 x05a8 0 x0400 0 x02d4 0 x0200 0 x016a 0 x0100 0 x00b5
28 // m1 [8] - m1 [23] contains a table of sqrt (0.5) ^ e for various e .
29
30
31 // degree 6
32 0 x6EDA 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8
33
34
35 // constants in Q1 .14 sqrt ( x +0.75)
36 . cm
37 0 x7DFD 0 x4106 0 xEF5B 0 x0A3E 0 x0107 0 x0EA6 0 0
38 0 x6EDB 0 x49E7 0 xE75D 0 x105F 0 xF26D 0 x0E65 0 xF0E8 0
A.3.4 Kernel 4a
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q0 .32
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is approx 2^ -26.58
6 // ( tested with approx 29 thousand values in the entire range
7 // 57 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
28 5* nop
29 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
30 5* nop
31
32 // iteration 2
33 // smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
34 // 5* nop
35 // smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
36 // 2* nop
37 // ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
A.3 Square root 109
38 // 5* nop
39 // smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
40 // 5* nop
41
42 // multiply with 1 if exponent is even or sqrt (2) if it is odd
43 // smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
44 // 5* nop
45 smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
46 5* nop
47 smuldd < scale =30 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
48 5* nop
49 stop
50
51 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
52 . m0
53 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
54 0 x5A82 0 x799A 0 x8000 0 x0000
55
56
57 . m1
58 0
59
60
61 // degree 6
62 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
63
64
65 // constants in Q2 .14 invsqrt ( x +0.875)
66 . cm
67 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
A.3.5 Kernel 4b
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -29.68
6 // ( tested with approx 29 thousand values in the entire range
7 // 72 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
15 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
16 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
17 2* nop
18 scopyd . ne vr0 .2 d m0 [8]. sd
19 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
20 12* nop
21
22 // iteration
23 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R0 * X
24 5* nop
25 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // T1 = R0 * R0 * X /2
110 Kernel source codes
26 2* nop
27 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d // T2 = 1.5 - T1
28 5* nop
29 // iteration 2
30 smuldd < scale =31 , uu , rnd > vr6 .0 d vr4 .2 d vr4 .2 d // TT1 = T2 * T2 * T1
31
32 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d // R1 = R0 * T2
33 4* nop
34
35 smuldd < scale =31 , uu , rnd > vr6 .1 d vr6 .0 d vr4 .1 d // TT1 * T1
36
37
38 2* nop
39 ssubd vr5 .2 d m0 [4]. sd vr6 .1 d // TT2 = 1.5 - TT1
40 5* nop
41 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d // R2 = R1 * TT2
42 5* nop
43
44 smuldd < scale =30 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
45 5* nop
46 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
47 smuldd < scale =31 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
48 5* nop
49 stop
50
51 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
52 . m0
53 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
54 0 x5A82 0 x799A 0 x8000 0 x0000
55
56
57
58 . m1
59 0
60
61 // degree 6
62 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
63
64
65 // constants in Q2 .14 invsqrt ( x +0.875)
66 . cm
67 0 x446C 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
A.3.6 Kernel 4c
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
4 // exponent in vr1 .7
5 // max relative error is 2^ -30.1
6 // ( tested with approx 29 thousand values in the entire range
7 // 78 cycles
8
9 . main
10
11 stofloat32d < sout > vr1h m1 [0]. sd
12 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
13 nop
14 saddw vr1 .0 vr1 .4 m0 [0]. sw
A.3 Square root 111
A.3.7 Kernel 5
1 // calculates invsqrt ( x ) of the value in m1 [0]. sw
2 // for x in range 0 x00000001 0 x80000000 in Q1 .31
3 // returns result in soft floatingpoint , mantissa in vr3 .0
112 Kernel source codes
4 // exponent in vr1 .7
5 // max error 2 ULP (2.47 ULP )
6 // 51 cycles
7
8 . main
9
10 stofloat32d < sout > vr1h m1 [0]. sd
11 scopyw vr0 .4 m0 [10]. sw // copy value 1 in Q1 .15
12 nop
13 saddw vr1 .0 vr1 .4 m0 [0]. sw
14 slsrw vr1 .7 vr1 .6 m0 [3]. sw // shift exponent
15 sandw vr0 .2 vr1 .6 m0 [3]. sw // set flag if exponent is odd number
16 2* nop
17 scopyd . ne vr0 .2 d m0 [8]. sd
18 polyw < start =0 , scale =15 , scale2 =14 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
19 12* nop
20
21 // iteration
22 smulwd < scale =31 , uu , rnd > vr4 .0 d vr2 .0 vr1 .2 d // R * X
23 5* nop
24 smulwd < scale =32 , uu , rnd > vr4 .1 d vr2 .0 vr4 .0 d // R * R * X /2
25 2* nop
26 ssubd vr4 .2 d m0 [4]. sd vr4 .1 d
27 5* nop
28 smulwd < scale =31 , uu , rnd > vr4 .3 d vr2 .0 vr4 .2 d
29 5* nop
30
31 // iteration 2
32 smuldd < scale =31 , uu , rnd > vr5 .0 d vr4 .3 d vr1 .2 d // R * X
33 5* nop
34 smuldd < scale =32 , uu , rnd > vr5 .1 d vr4 .3 d vr5 .0 d // R * R * X /2
35 2* nop
36 ssubd vr5 .2 d m0 [4]. sd vr5 .1 d
37 5* nop
38 smuldd < scale =31 , uu , rnd > vr5 .3 d vr4 .3 d vr5 .2 d
39 5* no
40
41 // multiply with 1 if exponent is even or sqrt (2) if it is odd
42 // smuldd < scale =31 , uu , rnd > vr3 .0 d vr4 .3 d vr0 .2 d
43 // 5* nop
44 smuldd < scale =31 , uu , rnd > vr3 .0 d vr5 .3 d vr1 .2 d // x * x ^ -0.5 = x
^0.5
45 5* nop
46 // multiply with 1 if exponent is even , 2^ -0.5 if odd .
47 smuldd < scale =30 , uu , rnd > vr3 .1 d vr3 .0 d vr0 .2 d
48 5* nop
49 slsrd < rnd > vr3 .2 d vr3 .1 d vr1 .7
50 2* nop
51 stop
52
53 // scalars are in Q8 .24 in m0 [8]. sd - m0 [38]. sd
54 . m0
55 0 x9000 0 xB505 0 x8000 0 x0001 0 xc000 0 x0000 0 0
56 0 x5A82 0 x799A 0 x8000 0 x0000
57
58
59 . m1
60 0
61
62
63 // degree 6
64 // 0 x446B 0 xD8E8 0 x2177 0 xDF73 0 x2159 0 xFD25 0 x75B5 0
65
66
A.4 Logarithms 113
A.4 Logarithms
A.4.1 Kernel 1
A.4.2 Kernel 1b
A.4.3 Kernel 2
1 // calculates log_2 ( x ) of the value in m1 [0]
2 // for x in range 0 x0000 0 x8000 in Q1 .15
3 // returns result in Q5 .11 to vr3 .0
4
5 . main
6
7 stofloatw < sout > vr1 .1 d m1 [0]. sw
8 2* nop
9 saddw vr1 .0 vr1 .2 m0 [0]. sw
10 slslw vr1 .4 vr1 .3 m0 [1]. sw
11 3* nop
12 // vr2 .0 is Q5 .11 , vr1 .0 is Q1 .15 , cm [0] is Q2 .14 (15+14 -11 = 18)
13 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
14 12* nop
15 ssubw vr3 .0 vr2 .0 vr1 .4
16 5* nop
17 stop
18
19 . m0
20 0 x8000 0 x000b
21
22 . m1
23 0
24
25 // constants in Q2 .14 log2 ( x +1)
26
27 . cm
28 0 x0000 0 x37A7 0 xE590 0 x1CD9 0 x13D6 0 x3754 0 0
29 0 x0000 0 x5C70 0 xD416 0 x2FEA 0 x20F2 0 x5BE7 0 0
A.4.4 Kernel 3
1 // calculates log ( x ) ( natrual log ) of the value in m1 [0]
2 // for x in range 0 x0000 0 x8000 in Q1 .15
3 // returns result in Q5 .11 to vr3 .0
4
5 . main
6
7 stofloatw < sout > vr1 .1 d m1 [0]. sw
8 2* nop
9 saddw vr1 .0 vr1 .2 m0 [0]. sw
10 smulww < scale =4 , ss , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 ( e )
11 3* nop
12 polyw < start =0 , scale =15 , scale2 =18 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
13 12* nop
14 ssubw vr3 .0 vr2 .0 vr1 .4
15 5* nop
16 stop
17
18 . m0
19 0 x8000 0 x000b 0 x58B9
20
21 . m1
22 0
A.4 Logarithms 115
23
24
25
26 // constants in Q2 .14 log2 ( x +1)
27
28 . cm
29 0 x0000 0 x4013 0 xE190 0 x2136 0 x16D6 0 x3FB3 0 0
30 0 x8000
A.4.5 Kernel 4
1 // calculates log10 ( x ) of the value in m1 [0]
2 // for x in range 0 x0001 0 x8000 in Q1 .15
3 // returns result in Q4 .12 to vr3 .0
4
5 . main
6
7 stofloatw < sout > vr1 .1 d m1 [0]. sw
8 2* nop
9 saddw vr1 .0 vr1 .2 m0 [0]. sw
10 smulww < scale =5 , su , rnd > vr1 .4 vr1 .3 m0 [2]. sw // multiply
exponent with 1/ log_2 (10)
11 3* nop
12 polyw < start =0 , scale =15 , scale2 =19 , sign = ss , rnd1 , rnd2 > vr2 .0
vr1 .0 cm [0]
13 12* nop
14 ssubw vr3 .0 vr2 .0 vr1 .4
15 5* nop
16 stop
17
18 . m0
19 0 x8000 0 x000b 0 x9A21
20
21 . m1
22 0
23
24
25 // constants in signed Q0 .16 log10 ( x +1)
26 . cm
27 0 x0000 0 x6F4F 0 xCB1F 0 x39B2 0 x27AB 0 x6EA9 0 0
A.4.6 Kernel 5
1 // calculates log_2 ( x ) of the value in m1 [0]. sd
2 // for x in range 0 x00000001 to 0 x7fffffff in Q1 .31
3 // uses 11 degree polynomial
4 // returns result in soft float format
5 // mantissa in vr4 .3 d and exponent in vr3 .6 in signed Q1 .31
6 //
7 // 37 cycles
8
9
10
11
12 . main
116 Kernel source codes
13
14 stofloat32d < sout > vr3h m1 [0]. sd
15 2* nop
16 saddd vr3 .0 d vr3 .2 d m0 [0]. sd
17 tmacdo < scale =28 , rnd , ss > vr5 .0 d m0 [8]. vd cm [3] // dc term to
accumulator
18
19
20 4* nop
21 powersd < scale =31 , start =1 , s , rnd > vr0 vr3 .0 d
22 7* nop
23 svmuldd < scale =31 , rnd , ss > vr1 vr0 .3 d vr0
24 tmacdo < scale =29 , rnd , ss > vr4 .0 d vr0 cm [0]
25 4* nop
26
27 svmuldd < scale =31 , rnd , ss > vr2 vr1 .3 d vr0
28 tmacdo < scale =29 , rnd , ss > vr4 .1 d vr1 cm [1]
29 3* nop
30 tmacdo < scale =29 , rnd , ss > vr4 .3 d vr2 cm [2]
31 6* nop
32
33 stop
34
35
36 . m0
37 0 x9800 0 x0 0 0 0 0 0 0
38 0 x4000 0
39
40
41 . cm
42 0 x38d1 0 xead0 0 xdd08 0 xaae6 0 x1cb0 0 xb0f0 0 xe584 0 xb35
43 0 x1a12 0 x7262 0 xe552 0 x2a11 0 x1c75 0 x8219 0 xdfd5 0 x82fa
44 0 x19b7 0 x7a30 0 xea6c 0 xbe87 0 x7fff 0 xffff 0 x0 0 x0
45 0 xf66a 0 x8e 0 0 0 0 0 0
The following table lists the instruction that were supported by the simulator that
was used for implementations in this thesis. The purpose of including this list here
is to aid the reading of the kernel source codes.
Since both the simulator and the instruction set architecture are still under
development, all the instructions of the architecture have not been implemented,
and some of the instructions in the simulator are outdated or have been changed.
The instruction set architecture is documented in the Sleipnir Instruction Set
Manual [2].
Some of the instructions were added to the simulator for the purpose of this
thesis work and some of the instructions are experimental and will not be part of
the final instruction set. The table might contain errors.
117
118 Simulator instruction set