Hardware Acceleration of ECC

Magnus Hirth
Hardware Acceleration of Asymmetric

Elliptic Curve Cryptography
Master’s thesis
Master’s thesis in Electronics Systems Design and Innovation

Supervisor: Per Gunnar Kjeldsberg
July 2019
NTNU
Norwegian University of Science and Technology
Faculty of Information Technology and Electrical
Engineering
Department of Electronic Systems
Magnus Hirth
Hardware Acceleration of Asymmetric

Elliptic Curve Cryptography
Master’s thesis in Electronics Systems Design and Innovation

Supervisor: Per Gunnar Kjeldsberg
July 2019
Norwegian University of Science and Technology

Faculty of Information Technology and Electrical Engineering
Department of Electronic Systems
i
Asymmetric cryptography, which is also known as public-key cryptography,

provide algorithms for encryption and decryption of data, digital signatures and
authentication. Compared with traditional asymmetric techniques, e.g. the RSA
algorithm, the elliptic curve cryptography (ECC) achieves an equivalent level of
security with smaller key sizes resulting in memory as well as bandwidth savings.
Computational intensive operations like scalar multiplication on elliptic curves are
required during the processing of ECC protocols. Using dedicated hardware units
for these operations improves execution time in an energy efficient manner. Most
implementations are based on high-end CPUs and GPUs and their use in mobile
devices with limited power resources such as smartcards is untested.
This assignment is a continuation of an autumn project focusing on a theoretical
and practical study of ECC, including experiments and profiling using Python and
C-based code versions. Based on the results from these profiling experiments, this
master thesis work will test the hypothesis that a hardware accelerated ECC
implementation where the entire scalar multiplication operation is optimized to
minimize memory transfers leads to a more energy efficient yet generic
implementation.
iii
NTNU
Abstract
Faculty Name
IE
Master Thesis
Hardware Acceleration of Asymmetric Elliptic Curve Cryptography

by Magnus H IRTH
With the great number of mobile, battery powered devices and IoT de-
vices being developed, there is a need for efficient, energy effective cryptog-
raphy. Elliptic curve cryptography (ECC) provides high security with small
key size, and seems very well suited for use in embedded, low-power sys-
tems.
The mathematics of ECC are based on set theory, performing operations
on elliptic curves, usually over finite prime fields or binary fields. The secu-
rity of these mathematical operations are based on the Elliptic Curve Discrete
Logarithm Problem.
This thesis has explored how to design a coprocessor for accelerating el-
liptic curve cryptography, based on the results from a pre-study. The copro-
cessor designed in the thesis, ECCo, was designed for use with the ARM
CM33 processor. The CM33 provides a coprocessor interface for tight inte-
gration of coprocessors, which allows instructions to be issued to connected
coprocessors from software. This motivated the design of an instruction set
for the coprocessor.
For the design in this thesis the operations of modular addition, modu-
lar multiplication and integer division was implemented. The design used
for testing consisted of a controller, register bank and arithmetic module.
A pure software implementation of elliptic curve cryptography, libecc, was
compared to the ECCo. Results showed that the hardware accelerated de-
signed performed 3.8x - 27x times better than the pure software implemen-
tation.
Area estimates of the design was aquired through synthesis, using Ques-
tasim. The ECCo accounted for 45% of the area when synthesizing ECCo+CM33.
The estimates showed that the ECCo area consumption was largely domi-
nated by the divisor (73.18% of the total ECCo area), which was implemented
using the SystemVerilog division operator, "/", and no optimization in syn-
thesis. However, the atomic operations of ECC, Modular Multiplication and
Modular Addition, only occupied 1.97% and 1.92%, respectively.
v
Preface
This thesis is a continuation of an autumn project which explored how an
hardware accelerator of elliptic curve cryptography should be implemented
in order to address the shortcomings of elliptic curve cryptography in soft-
ware. Part of the theory is reused from the project. The project will from now
on be referred to as the pre-study.
vii
Contents
Abstract iii
Preface v
1 Introduction 1
1.1 Asymmetric Cryptography . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective and Approach . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Finite Field Arithmetic . . . . . . . . . . . . . . . . . . . 6
2.2 Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 EC over F p . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 EC over F2k . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Point Arithmetics . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 ECC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 ARM Cortex M33 . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Hardware Acceleration . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 libecc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.10 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Previous Work 15
3.1 Modular Addition Implementation . . . . . . . . . . . . . . . . 15
3.2 Modular Multiplication Implementation . . . . . . . . . . . . . 16
3.3 FPGA Elliptic Curve Coprocessor . . . . . . . . . . . . . . . . . 18
3.4 Pre-Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Methodology and Architecture Design 19

4.1 ECCo Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Choice of Alorithms . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Interpretation of Algorithms . . . . . . . . . . . . . . . . . . . . 21
4.4 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Internal Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Area Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 23
viii
5 Implementation 25
5.1 ECCo Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 ECCo Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Internal Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Register Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Arithmetic Module . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5.1 Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5.2 Integer Division . . . . . . . . . . . . . . . . . . . . . . . 32
5.5.3 Modular Addition . . . . . . . . . . . . . . . . . . . . . 33
5.5.4 Modular Multiplication . . . . . . . . . . . . . . . . . . 33
5.5.5 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5.6 Verification - Arithmetic Module . . . . . . . . . . . . . 35
5.6 Controller Module . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.6.1 Verification - Controller Module . . . . . . . . . . . . . 37
5.7 Verification - ECCo . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.8 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.8.1 ECCo Wrapper . . . . . . . . . . . . . . . . . . . . . . . 39
5.8.2 Big Number library . . . . . . . . . . . . . . . . . . . . . 39
5.8.3 Benchmark Software . . . . . . . . . . . . . . . . . . . . 40
6 Results 43
6.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Future Work 47
7.1 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . 47
7.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8 Conclusion 49
A Test Data Python script 51
B Internal Interfaces SV Code 55
C Test Data 57
D ECCo C Wrapper 59
E ECCo Big Number library 65
F Benchmark & Test program 77

ix
List of Abbreviations
CM33 ARM Cortex M33

CP CoProcessor
EC Elliptic Curve
ECC Elliptic Curve Cryptography
ECCo Elliptic curve Cryptography Coprocessor
DMA Direct Memory Access
DSP Digital Signal Processor
DUT Design Under Test
FPU Floating Point Unit
FSM Finite State Machine
ISA Instruction Set Architecture
LSB Least Significant Bit
MA Modular Addition
MM Modular Multiplication
MSB Most Significant Bit
OOP Object Oriented Programming
SIMD Singel Instruction Multiple Data
SM Scalar Multiplication
SV SystemVerilog
SVA SystemVerilog Assertions
TLS Transport Level Security protocol
1
Chapter 1
Introduction
Today, many mobile and embedded devices are being used daily, and the
number of such devices are ever increasing. Embedded devices are used in
many applications where security is a concern, be it for a company or per-
sonal privacy: In hospitals, smart cards (banking, SIM, access control), mo-
bile phones, wifi routers, etc. Many of these use battery powered devices,
which in addition to security issues require low power solutions. This issue
motivates the exploration of low-power implementation of cryptographic al-
gorithms. A field of cryptography which seems suited for low-power appli-
cations is Elliptic Curve Cryptography (ECC), which was introduced in the
80s by Neil Koblitz [1] and Victor Miller [2]. It has gained popularity for
desktop and server use, and many of the algorithms in the Transport Level
Security protocol 1.3 (TLS 1.3) are elliptic curve (EC) algorithms.
In this thesis an implementation of a coprocessor for the ARM Cortex-
M33 (CM33) designed for accelerating Elliptic Curve Cryptography (ECC)
is designed and tested. The work is a continuation of the autumn project
on hardware acceleration of ECC, which concluded that the optimal use of
a hardware accelerator were to perform the entire operation of scalar multi-
plication (SM) in hardware. The implementation in this thesis aims at accel-
erating the entire SM in hardware, and taking advantage of the features the
coprocessor interface of the CM33 provides.
In this thesis cryptosystem is used in the same way as defined in [3]: “A
cryptosystem is a general term referring to a set of cryptographic primitives
used to provide information security services. Most often the term is used in
conjunction with primitives providing confidentiality, i.e., encryption.”
Also, the term big numbers are used to refer to numbers of bit length longer
than a processors word length.
1.1 Asymmetric Cryptography

Asymmetric cryptography, also known as public key cryptography, are cryp-
tosystems which uses key pairs: A public key and a private key. The private
key is only known to the owner, while the public key can be obtained by any-
one without compromising the security of the system. The private key may
be used to create a digital signature of a message, which allows anyone who
got both the public key and the message to verify that the message has not
2 Chapter 1. Introduction
been corrupted, or the private key may be used to decrypt a message which
has been encrypted using the public key.
The security of public key cryptography systems relies on the private key
being infeasible for an attacker to compute, but not impossible given infinite
time and resources. That is, public key cryptosystems are computationally se-
cure and it is infeasible for an attacker to compute the private key if it requires
≈ 10100 instructions [4].
Another very common type of cryptosystems are symmetric cryptogra-
phy which uses a single shared key. These systems usually require smaller
key sizes and have lower power consumption compared to public key sys-
tems [5][6]. Because of this symmetric cryptosystems are prefered when en-
crypting large amounts of data, but since they require the shared key to be
shared over a secure channel it is usually not sufficient to rely solely on sym-
metric key cryptography. As a possible solution to this, a public key cryp-
tosystem was introduced in 1976 by Whitfield Diffie and Martin E. Hellman
[4] which enables two parties to securely share a key over an insecure chan-
nel, thus allowing secure communication through a combination of asym-
metric and symmetric cryptosystems.
This combination of symmetric and asymmetric cryptosystems are now
standard and the TLS 1.3 [7] standard describes a set of cryptosystems to
use for secure communication over insecure channels. A number of these
systems are public key systems and with the increasing demand for high
security without reducing the efficiency of low power devices such as IoT
[8][9] and mobile devices [10] it seems like a good incentive to explore the
possibilities of accelerating public key cryptosystems.
Further more, TLS defines a number of ellptic curve (EC) cryptosystems
to use. EC cryptosystems are systems that uses mathematics based on elliptic
curves and have traits that makes them suited for use in resource limited
environments, such as for IoT devices. ECC algorithms are often considered
safer than their non-EC counterparts [1], and this safety is provided with
smaller key sizes. The benefit of smaller key sizes is that less storage for the
variables of the algorithm is required and less data needs to be transfered
between devices. An efficient and good implementation of ECC algorithms
could potentially benefit IoT devices by reducing power consumption while
still maintaning high security.
1.2 Objective and Approach

The objective of this thesis is to explore how to design a coprocessor for accel-
erating elliptic curve cryptography, based on the conclusion of the pre-study
[11]. This thesis tries to describe how such a coprocessor could be imple-
mented, and implement as much of the proposed design as possible. The
implemented design should be benchmarked and compared to the perfor-
mance of a pure software implementation, to show what benefits a coproces-
sor could provide.
The design approach is to consider multiple possible designs before choos-
ing one that is appropriate for the setup used in this thesis. All modules
1.3. Main Contributions 3
should be tested separately during the development process, using test data
generated by software scripts, providing reliable test data.
1.3 Main Contributions

The main contributions of this thesis is the design of a flexible coprocessor
aimed at accelerating elliptic curve cryptography, with the possibility of ex-
tending use to non-EC asynchronous cryptography. Detailing both the de-
sign and the design process.
Also, for this thesis a generic modular addition algorithm was designed.
A C library for big numbers was implemented. The library was designed
for use with the elliptic curve coprocessor, supporting conversion to and
from string representation and loading/storing to/from coprocessor regis-
ters.
1.4 Structure
Chapter 2 presents mathematical and other related background information
necessary for the rest of the thesis. In Chapter 3 previous work relevant
for this thesis is presented. Chapter 4 details the methodology and design
choices of the coprocessor. Chapter 5 describes the implementation details
of the design, and Chapter 6 presents the results of the thesis. Finally, Chap-
ter 7 discusses thoughts on future work on the coprocessor, and Chapter 8
concludes the report.
5
Chapter 2
Background
This thesis is mainly concerned with elliptic curve cryptography, which are
cryptosystems that uses mathematical operations on elliptic curves over fi-
nite fields. In order to give the reader a better understanding of these subjects
this chapter gives a brief introduction into the mathematical field of set the-
ory, focusing on the understanding of finite fields, and explaining the funda-
mentals of elliptic curves and related arithmetic operations on elliptic curves.
Further, this chapter describes algorithms for implementation of modular
arithmetic and elliptic curve operations in hardware, which are used later
in the implementation of the coprocessor. Lastly this chapter also briefly de-
scribes the tools used.
2.1 Set theory

A set is (informally) a collection of objects (or elements). Sets are classified
according to their mathematical properties. In this report the sets of interest
are the finite fields, also called Galois fields, denoted by GF (q) or Fq . Finite
fields are, without going into details, a set with a finite number, q, of elements
where q = pk (p is prime and k > 0), on which the multiplication, addition,
subtraction and division operations are defined [12, p.310]. In this thesis we
are only interested in finite fields of integers, and, in particular, finite fields
Fq containing all integers from 0 up to, but not including, q. For the rest of
the thesis all fields will be assumed to be of this kind. These fields can be
constructed with the modulo operator, because: x = y mod q, where y can
be any integer, x will always be in the range 0 ≤ x < q. A simple example of
such a finite field is F7 , shown in Equation 2.1. It is a field with 7 elements,
and can be constructed with modulo 7.
F7 = {0, 1, 2, 3, 4, 5, 6} (2.1)
If there exists a positive integer n such that n · a = 0 for all a ∈ F then the
smallest such number is called the characteristic of F. If no such number exist
then the characteristic of F is said to be zero [12, p.170]. In our example of F7
the characteristic is 7, since 7 · a ≡ 0 (mod 7) for a ∈ F7 . The characteristic
of any finite field GF ( pk ) is p [12, p.311]. The size of a field, q, is also called
the order of the field.
Of particular interest when working with elliptic curves are finite fields
where q = p1 , prime fields, and finite fields where q = 2k , binary fields.
6 Chapter 2. Background
2.1.1 Finite Field Arithmetic

For this report we are only concerned with finite fields, which implies that
all arithmetic operations in field elements are, in fact, moldular arithmetic
operations.
The reader is assumed to have basic knowledge of modular arithmetics,
but examples of the basic operations on F7 are illustrated in Equations 2.2-
2.5.
4+6 = 3 (2.2)
1−5 = 3 (2.3)
2·5 = 3 (2.4)
5 · 4−1 = 3 (2.5)
Equations 2.2, 2.4 and 2.5 is 3 since 10 ≡ 3 (mod 7) and Equation 2.3 is
3 since −4 ≡ 3 (mod 7). Equation 2.5 is an example of modular division
which is the most complicated operation of the four. In order to perform
modular division one needs to find the modular inverse of the divisor, which
is why modular division often is written as in Equation 2.5, avoiding the
division operator, "/", to avoid confusion with integer division. [13]
To find the modular inverse of a field element the Extended Euclidean
Algorithm is used [14]. It is an extension to the Euclidean Algorithm which
is an algorithm for finding the greatest common divisor of two numbers, a
and b [15]. The extended algorithm can further be used to find two numbers,
x and y, such that:
ax + by = gcd( a, b) (2.6)
For the level of details needed in this report we can now simply say that
a and b has to be co-prime (gcd( a, b) = 1) and assign b = q, the field size. It
can be shown that this leads to Equation 2.7.
ax ≡ 1 (mod q) (2.7)
This allows us to find the inverse x of element a by solving for x (x ∈ Fq ).
In Equation 2.5 a = 4 and q = 7, and so, we can find the inverse of 4 by
solving for x in Equation 2.7:
4x ≡ 1 (mod 7)
⇓
x=2
Equation 2.5 can then be explained by replacing 4−1 with the modular
inverse of 4:
5·2 ≡ 3 (mod 7)
2.2. Elliptic Curves 7
2.2 Elliptic Curves

Only elliptic curves over F p and F2m are presented as these are the most com-
mon in ECC. Details will not be provided, only required conditions and a
brief explanation of arithmetic on the curves are provided. A more detailed
explanation can be found in [16]. The goal of this section is to get an intu-
itive understanding of what elliptic curves are, and the difference between
continuous and discrete elliptic curves.
2.2.1 EC over F p
“Let F p be a prime finite field so that p is an odd prime number, and let
a, b ∈ F p satisfy 4a3 + 27b2 6≡ 0 (mod p). Then an elliptic curve E(F p ) over
F p defined by the parameters a, b ∈ F p consists of the set of solutions or
points P = ( x, y) for x, y ∈ F p to the equation:
y2 ≡ x3 + ax + b (mod p) (2.8)
together with an extra point O called the point at infinity.” [16]
(0,6) (3,6) (4,6)

(5,5)
(6,4)
(6,3)
(5,2)
(0,1) (3,1)(4,1)
(1,0) x
F IGURE 2.1: Illustration of y2 = x3 − 2x + 1 with the solutions

to Equation 2.8 in F7 plotted.
Figure 2.1 illustrates the elliptic curve y2 = x3 − 2x + 1, x ∈ [−7, 7]. The

continuous curve is the common way to illustrate an elliptic curve, over an
infinite field. However, in cryptography finite fields are used, in which case
there only exists discrete solutions to the elliptic curve, and for all of the
solutions the x and y values must be in F p .
The discrete solutions to the elliptic curve (Equation 2.8) are plotted in
Figure 2.1, and it is apparent that only the solutions (0, 1) and (1, 0) lie on
the curve itself. This is because the x and/or y values resulting in the other
solutions produced a LHS or RHS value in Equation 2.8 which were ≥ 7.
2.2.2 EC over F2k

“Let F2m be a characteristic 2 finite field, and let a, b ∈ F2m satisfy b 6= 0
in F2m . Then a elliptic curve E(F2m ) over F2m defined by the parameters
a, b ∈ F2m consists of the set of solutions or points P = ( x, y) for x, y ∈ F2m to
the equation:
y2 + xy ≡ x3 + ax2 + b (mod p) (2.9)

together with an extra point O called the point at infinity.” [16]
(0,6)(1,6)
(6,4)
(2,3)
(2,2)(3,2)(4,2)
(0,1) (4,1)(5,1)
(1,0) x
F IGURE 2.2: Illustration of y2 + xy = x3 − 2x2 + 1 with the

solutions to Equation 2.9 in F7 plotted.
Figure 2.2 illustrates the elliptic curve y2 + xy = x3 − 2x2 + 1, x ∈ [−7, 7].

Also here both the continuous curve over an infinite field is plottet, along
with the discrete solutions to the elliptic curve.
2.2. Elliptic Curves 9
2.2.3 Point Arithmetics

In this report the arithmetic operations we are interested in on elliptic curves
are point addition and point doubling. An intuitive geometric understand-
ing of these operations where provided by Neal Koblitz [1], as illustrated in
Figure 2.3.
y y
P1
P2 P1
x x
P3 = P1 + P2
P3 = 2P1
F IGURE 2.3: Illustration of elliptic curve point addition and

doubling.
Let P1 = ( x1 , y1 ), P2 = ( x2 , y2 ) and P3 = ( x3 , y3 ) be points on an elliptic

curve, where P3 = P1 + P2 . Draw a line P1 P2 through P1 and P2 , then their
sum P3 will be the negative of the intersection of P1 P2 and the curve.
The following equations is a result of the observations from Figure 2.3,
but there is not provided enough information to prove it. For a detailed ex-
planation see [1].
x3 ≡ − x1 − x2 + α2 (mod p) (2.10)
y3 ≡ −y1 + α( x1 − x3 ) (mod p) (2.11)
where
( y2 − y1
x2 − x1 ifP1 6= P2
α= 3x12 + a (2.12)
2y1 ifP1 = P2
In the case of elliptic curves over F2m , when P1 6= P2 :
x3 ≡ α2 + α + x1 + x2 + a (mod p) (2.13)
y3 ≡ α( x1 + x3 ) + x3 + y1 (mod p) (2.14)
y + y2
α= 1 (2.15)
x1 + x2
and when P1 = P2 :
x3 ≡ α2 + α + a (mod p) (2.16)
y3 ≡ x12
+ ( α + 1) x3 (mod p) (2.17)
y
α = x1 + 1 (2.18)
x1
Note that all of these operations require modular inversion for the divi-
sion in the calculation of α, which is an expensive operation.
2.3 Scalar Multiplication

The central mathematical operation in all EC cryptosystems are the scalar
multiplication, which is to multiply a scalar with a point on an elliptic curve.
There are multiple different algorithms for performing a scalar multiplica-
tion. Most of these are based on the observation that any multiplication of a
point and a scalar can be expressed as a combination of point additions and
doublings, e.g. 11P = P + 2( P + 2(2P)). There are many optimized algo-
rithms for this, and in many applications it is desirable to use algorithms that
have a constant execution time, for security reasons. However, in this thesis
a basic algorithm, with varying execution time, is presented.
Algorithm 1 displays the pseudocode for this algorithm, called Double-
and-add (left-to-right).
Algorithm 1 Double-and-add (left-to-right) [17]

INPUT: Base point P ∈ EF , scalar k = (k t−1 , ..., k0 )2
OUTPUT: Point Q = k · P
1: R0 ← ∞; R1 ← P
2: for i from t − 1 downto 0 do
3: R0 ← 2R0
4: if k i = 1 then
5: R0 ← R0 + R1
6: end if
7: end for
8: Q ← R0
In this algorithm P is the base point on the curve, which is being multi-
plied with the scalar k, and Q is the resulting point on the curve. t is the bit
length of k. What Algorithm 1 does is to iterate through all the bits in k, start-
ing to the left (most significant bit). First R0 is set to the point at infinity, and
R1 to the base point P. For each iteration it performes point doubling of R0
(doubling of point at infinity returns the point at infinity), and if the current
bit i is 1 then the point addition of R0 and R1 is stored in R0 (addition of a
point at infinity and a point P returns the point P).
This algorithm will perform t point doublings and, in worst case, t point
additions.
2.4. Coordinate Systems 11
2.4 Coordinate Systems

Elliptic Curves are often represented using affine coordinates, ( x, y), as we
have done so far, but there are several different coordinate systems with dif-
ferent attributes available. The purpose for using different coordinate sys-
tems is usually to increase performance. The way computation time is com-
pared between coordinate systems is by calculating how many inversions (I),
multiplications (M), and squarings (S) an addition or doubling operation re-
quire. From equations 2.10, 2.11 and 2.12 we see that in affine coordinates (A)
the computation times are t(A + A) = I + 2M + S and t(2A) = I + 2M + 2S.
[18]
An alternative coordinate representation often used in practice is projec-
tive coordinates (P ). Here a point P is represented by a touple ( X, Y, Z ),
where x = X Y
Z and y = Z . Using projective coordinates the computation time
is t(P + P ) = 12M + 2S and t(2P ) = 7M + 5S. [18] The main motivation
for using projective coordinates is reduced computation time since there is
no inversion using projective coordinates, which is an expensive operation,
as noted in Chapter 2.2.
There are other common alternatives for coordinates, as described in [3,
p.86] and [18], but they will not be discussed here.
2.5 ECC Algorithms

Elliptic curve cryptography is commonly used for handshakes and digital
signatures, such as in the Transport Layer Security (TLS) protocol 1.3 [7]. To
add some perspective as to how the scalar multiplication is used in ECC this
section will outline the Elliptic Curve Digital Signature Algorithm (ECDSA)
[19].
The two parties involved will be refered to as Alice and Bob [20], where
Alices private and public key are d A and Q A , respectively. Same for Bob,
d B and Q B . For all ECC algorithms Alice and Bob have to agree on a set of
parameters, D. In the case of F p these parameters are D = (q, a, b, G, n, h),
where:
q Are the field order (Number of elements in the field. See Chapter 2.1)
a, b Are the elliptic curve coefficients (See Equation 2.8)
G Is the base point on the curve.
n Is the order of G; The smallest positive number such that n · G = O

n
h Is a number such that h = q
For F2m the parameters are D = (m, f ( x ), a, b, G, n, h), where f ( x ) is an irre-

ducible binary polynomial of degree m specifying the representation of F2m .
Algorithm 2 ECDSA signature generation [19]

INPUT: Domain parameters D, private key d and message m
OUTPUT: Signature (r, s)
1: Select k ∈ [1, n − 1]
2: Compute kG = ( x, y)
3: Compute r = x mod n. If r = 0 then go to step 1
4: Compute e = H (m)
5: Compute s = k −1 (e + dr ) mod n. If s = 0 the go to step 1
6: Return (r, s)
If Alice wants to send a message to Bob with a digital signature to verify

that the message has not been corrupted during sending, she can use ECDSA,
as shown in Algorithm 2. First, a random number k are multiplied with the
base point G, and the resulting x value are used to compute r, one of the two
parts of the signature. Then, a hash function H (m) are used to produce a hash
from the message. A hash function is a one-way function, where the message
is very difficult to guess for anyone who knows the hash value. The hash and
Alices private key is used to produce the second part of the signature s.
Algorithm 3 ECDSA signature verification [19]

INPUT: Domain parameters D, public key Q, message m and signature (r, s)
OUTPUT: Acceptance or rejection of the signature
1: Verify that r and s are integers in the interval [1, n − 1] If any verification
fails then return(“Reject the signature”).
2: Compute e = H (m)
3: Compute w = s−1 mod n
4: Compute u1 = ew mod n and u2 = rw mod n
5: Compute X = u1 G + u2 Q.
6: If X = ∞ then reject the signature
7: Convert the x-coordinate x of X to an integer x compute v = x mod n
8: If v = r then accept the signature
When Bob then receives the message and the signature from Alice he can
use Algorithm 3 to verify that the message has not been corrupted during
sending, and be sure that it is the exact same message as Alice sent. The proof
of the verification is out of scope for this thesis, but note that the verification
requires two scalar multiplications.
Relating to the TLS 1.3 [7] standard: ECDH [4] [21] is often used to pass
a symmetric key between Alice and Bob, along with an ECDSA-signature
which verifies that the symmetric key has not been corrupted during trans-
mission.
2.6. Tools 13
2.6 Tools
For simulation and synthesis the tool Questasim [22] is used. Questasim is
developed by Mentor [23]. It is a high-performance tool supporting sim-
ulation, debugging and functional coverage using HDL languages such as
VHDL [24], Verilog [25], and SystemVerilog [26], including SystemVerilogs
object oriented features and SVA.
2.7 ARM Cortex M33

The Cortex-M33 [27] (CM33) is a processor developed my ARM [28]. It uses
the ARMv8-M [29] instruction set architecture and is developed for embed-
ded applications, allowing low power consumption while still providing ef-
ficient security and debug capabilities. It contains features such as an FPU
and DSP with SIMD instructions.
The CM33 also features a coprocessor interface, which allows for tight
integration of coprocessors and accelerators with the CM33. The coproces-
sors are accessible from software using assembly instructions provided in
the ARMv8-M instruction set [29]:
CPD, CPD2 Coprocessor data processing instructions.
MCR, MCR2 32-bit data transfer to the coprocessor.
MRC, MRC2 32-bit data transfer to the CM33.
MCRR, MCRR2 64-bit data transfer to the coprocessor.
MRRC, MRRC2 64-bit data transfer to the CM33.
2.8 Hardware Acceleration

Hardware acceleration is commonly known as a method to speed up calcu-
lations by using specialized hardware, designed for a specific task, which
often supplements a general purpose CPU [30]. A very common applica-
tion of hardware acceleration is graphical processing units (GPUs), which are
used in virtually every desktop. Other areas where hardware acceleration is
common is in the field of AI and neural networks, and relevant to this the-
sis: cryptography. The security of cryptosystems are based on mathematics
which often require heavy computations, which usually can greatly benefit
from dedicated hardware.
2.9 libecc
libecc [31] is a library implementing EC mathematics hierarchically, as illus-
trated in Figure 2.4. The library provides separate modules which provides
natural numbers arithmetics, field arithmetics (Chapter 2.1), elliptic curve
+−−−−−−−−−−−−−−−−−−−−−−−−−+
|EC * DSA s i g n a t u r e |
|algorithms | <−−−−−−−−−−−−−−−−−−+
|( ISO 14888 − 3) | |
+−−−−−−−−−−−+−−−−−−−−−−−−−+ |
^ |
| |
+−−−−−−−−−−−+−−−−−−−−−−−−−+ +−−−−−−−−−−+−−−−−−−−−−−−+
|Curves ( SECP , Brainpool , | | Hash |
|FRP , . . . ) | | functions |
| | | |
+−−−−−−−−−−−+−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−−−−−−−−−−+
^ @@@@@@@@@@@@@@@@@@@@@@@@@@@@
| @{ Useful a u x i l i a r y modules }@
+−−−−−−−−−−−+−−−−−−−−−−−−−+ @+−−−−−−−−−−−−−−−−−−−−−−−−+@
| E l l i p t i c curves | @| Utils |@
| c o r e ( s c a l a r mul , . . . ) | @+−−−−−−−−−−−−−−−−−−−−−−−−+@
+−−−−−−−−−−−+−−−−−−−−−−−−−+ @| Sig S e l f t e s t s |@
^ @| A r i t h S e l f t e s t s |@
| @| User Examples |@
| @+−−−−−−−−−−−−−−−−−−−−−−−−+@
| @| E x t e r n a l deps |@
+−−−−−−−−−−−+−−−−−−−−−−−−−+ @+−−−−−−−−−−−−−−−−−−−−−−−−+@
| Fp f i n i t e f i e l d s | @| LibECC conf f i l e s |@
| arithmetic | @+−−−−−−−−−−−−−−−−−−−−−−−−+@
+−−−−−−−−−−−+−−−−−−−−−−−−−+ @| Scripts |@
^ @+−−−−−−−−−−−−−−−−−−−−−−−−+@
| @@@@@@@@@@@@@@@@@@@@@@@@@@@@
+−−−−−−−−−−−+−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−−−−−−−−−−−+
| NN n a t u r a l | <−−−−−−+ Machine r e l a t e d |
| numbers a r i t h m e t i c | | ( words , . . . ) |
+−−−−−−−−−−−−−−−−−−−−−−−−−+ +−−−−−−−−−−−−−−−−−−−−−−−−+
F IGURE 2.4: libecc architecture [31]
operations (Chapter 2.2), hardcoded values for curves, and implementation

of the ECDSA algorithm (Chapter 2.5). Also, as seen in Figure 2.4, it provides
implementation of some required hash function, self tests and some utilies,
which will not be described here (see [31] for details).
Libecc does not actually implement multiple precision arithmetics but im-
plements finite field and point arithmetics on big numbers up to a maximum
integer width, which is determined at compile time. It uses projective coor-
dinates, no dynamic memory allocation and is written without any depen-
dencies, including the standard libc library.
2.10 Python
Python [32] is an interpreted, general-purpose programming language with
dynamic type checking. Python has several interesting features which makes
it flexible and easy to use, e.g. Python integers have an unlimited range [33]
which makes handling of big numbers trivial. Internally Python represents
big numbers as an array of fixed sized integers, but it is hidden when work-
ing with Python. Python also supports object oriented programming.
15
Chapter 3
Previous Work
In this chapter, existing algorithms for hardware implementations of modu-

lar addition and modular multiplication is presented. A thorough explana-
tion and proof of correctness for these algorithms are not provided, see their
respective references for more details.
An FPGA implementation of ECC coprocessors are presented, and finally
the results from the pre-study is presented.
3.1 Modular Addition Implementation

Modular addition (MA) is the operation of calculating S = X + Y (mod n),
and is in effect the same operation for both addition and subtraction, if using
2’s complement to represent signed numbers.
A straight forward way of implementing MA is to assume that 0 ≤ A, B <
n and do Algorithm 4 [34]. This algorithm may be performed in a single
cycle with minimal control logic, depending on the timing constraints and
the critical path through the additions on line 1 and 2.
Algorithm 4 Modular Addition Algorithm

INPUT: Addends A & B, modulo n
OUTPUT: Sum S
1: Compute S0 = A + B
2: Compute S00 = S0 − n
3: if S00 ≥ 0 then
4: S = S00
5: else
6: S = S0
7: end if
The operations on lines 1 and 2 are normal addition and subtraction, and
the subtraction will require the 2’s complement of n to either be calculated
during operation or precomputed and be an input to the HW module. Algo-
rithm 4 is restricted to positive numbers smaller than n.
Another method was proposed in [35]. Let n < 2k and m = 2k − n, where
k may be the word size of the system. It is assumed that A, B < 2k . Modular
addition can the be computed as in Algorithm 5.
16 Chapter 3. Previous Work
Algorithm 5 Omura’s Method, Modular Addition Algorithm

INPUT: Addends A & B
OUTPUT: Sum S
2: if there is a carry then
3: S = S0 + m
4: else
5: S = S0
6: end if
The value of m will need to either be computed during operation or pre-

computed and be an input to the HW module. Here the additions in line 1
and 3 are normal additions. If there is no carry the result is A + B, which
may be larger than n, in which case it will be reduced later. However, if there
is a carry it will be ignored, which implies that S0 = A + B − 2k . And the
correctness of the algorithm is given by:
S = S0 + m
= ( A + B − 2k ) + (2k − n )
= A+B−n
Omura’s algorithm is still restricted to positive numbers, but accepts ad-

dends greater than the modulo.
3.2 Modular Multiplication Implementation

Modular multiplication (MM) is the operation of calculating P = A · B (mod n).
There are many algorithms for performing MM, many of which relies on
alternative number representations for higher efficiency, such as the Mont-
gomery modular multiplication [34]
An intuitive way of calculating MM is the multiply-and-divide method
[34], illustrated in Algorithm 6.
Algorithm 6 Multiply and Divide Algorithm

INPUT: Multiplicand A, multiplier B, modulo n
OUTPUT: Product P
1: P0 = A · B
2: P = P0 % n
3: return P
This is, however, not an efficient implementation. The word size of P0

will have to be twice that of A and B in order to avoid overflow, and the need
to optimize the modulo reduction % will introduce unnecessary complexity
to the design. Unless the product P0 is needed an interleaving algorithm is
usually to be preferred.
3.2. Modular Multiplication Implementation 17
A basic interleaving algorithm is presented in Algorithm 7, where A and

B are k-bit numbers between 0 ≤ A, B < n of which Ai and Bi represents the
ith bit.
Algorithm 7 Modular Multiplication Interleaving Algorithm

OUTPUT: Product P
1: P = 0
2: for i = 0 to k − 1 do
3: P = 2 · P + A · Bk−1−i
4: P=P%n
5: end for
6: return P
Since A, B, P < n it follows that
2P + A · Bj ≤ 2(n − 1) + (n − 1) = 3n − 3
Thus, maximum two subtractions are needed to reduce P to 0 ≤ P < n,
which means the modulo operation in line 4 may be implemented as condi-
tional subtractions.
Another efficient modular multiplication algorithm was proposed by Pe-
ter Montgomery in [36]. The result from the Montgomery algorithm is
P = A · B · r −1 (mod n)
where A, B < n and gcd(n, r) = 1. This adds overhead by requiring con-
version of the result. The number of bits in A or B is less than k, and we take
r = 2k [34]. The multiplication is shown in Algorithm 8.
Algorithm 8 Montgomery Modular Multiplication Algorithm

OUTPUT: Product P = A · B · r −1 (mod n)
1: P = 0
2: for i = 0 to k − 1 do
3: P = P + Ai · B
4: if P is odd then
5: P = P+n
6: end if
7: P = P/2
8: end for
9: return P
Here, the division on line 7 is just a right shift, and the operations on line
3 and 5 can be combined: the LSB of P can be calculated before computing
the sum on line 3.
18 Chapter 3. Previous Work
Coprocessor Modular Modular Modular Point Point Scalar

Multiplication Addition Subtraction Doubling Addition Multiplication
CP 1 100 - - - - -
CP 2 100 99 99 - - -
CP 3 147 146 146 899 801 -
CP 4 147 146 146 899 801 240000
TABLE 3.2: Execution times of coprocessors, in clock cycles.
3.3 FPGA Elliptic Curve Coprocessor

In [17] four different EC coprocessors were implemented and tested on an
FPGA, each one implementing different arithmetic operations: CP 1 imple-
mented modular multiplication (Chapter 2.1.1); CP 2 implemented modu-
lar multiplication, addition and subtraction (Chapter 2.1.1); CP 3 also imple-
mented point doubling and addition (Chapter 2.2.3); and CP 4 implemented
SM in addition to the arithmetic operations (Chapter 2.3).
The execution time of the implemented operations in each CP is listed in
Table 3.2. The execution time is displayed in clock cycles.
The tests were performed using 256-bit values. The connected microcon-
troller used 8-bit word width, and the coprocessors were connected to and
read the operands from RAM. Execution times includes reading operands
and writing results.
3.4 Pre-Study
In the pre-study [11] possible partitioning between hardware and software
for an ECC accelerator was explored. Using a pure software implementation
of ECC profiling results were analyzed, trying to determine which parts of
the software implementation could benefit the most from hardware acceler-
ation.
The results showed that roughly 18.8% of execution time during testing
was spent on managing the software implementation of big numbers: ini-
tialization, checking correct behavior, and handling number meta data. The
conclusion was that as much as possible of an EC cryptosystem, in particular
the scalar multiplication, should be performed by a coprocessor to reduce the
overhead of dealing with big numbers in software.
19
Chapter 4
Methodology and Architecture

Design
The main goal for this thesis is to implement an Elliptic Curve Cryptography
Coprocessor (ECCo) which primary purpose is to accelerate the scalar multi-
plication in EC cryptosystems, as was the conclusion of the pre-study [11]. To
perform the scalar multiplication the fundamental mathematical operations
needed are modular multiplication and modular addition (Chapter 2.1.1),
and integer division, when using affine coordinates (Chapter 2.4). These op-
erations are enough to perform point doubling and point addition (Chapter
2.2.3), which allows implementation of an entire scalar multiplication (SM).
The primary goal when designing the ECCo is therefore to implement the
modular arithmetic operations.
The design of a coprocessor are potentially a complex and lengthy pro-
cess. In the design process of the ECCo, to try to simplify this process,
reusable design patterns was actively used: communication between sub-
modules in the ECCo was generalized with clearly defined protocols; test
data for all arithmetic operations was generated with a single Python script,
utilizing Pythons OOP features; and a common testbench setup was used for
all modules. These design patterns are further explained in their respective
methodology and implementation chapters.
This chapter discusses which choices where made during the design and
testing of the ECCo, and why these choices were made. Further, it highlights
important aspects of the design process, specifically where and why reusable
design patterns where used.
4.1 ECCo Design

The goal of the ECCo is to be able to perform scalar multiplication. With-
out any restrictions from any specific systems this allows for a number of
different implementations.
1. It may be designed as a SM module which only performs the SM, simi-

lar to familiar division and multiplication modules. This module could
be integrated in a processor, or connected to a buss, possibly using
DMA to fetch operands.
20 Chapter 4. Methodology and Architecture Design
2. It may be designed as a collection of modules, each implementing an

atomic operation (i.e. modular addition or modular multiplication, see
Chapters 3.1 - 3.2), similar to an FPU. This would be particularly suited
for tight integration with a processor, and provide a flexible design
which could be used for non-EC cryptosystems which also rely on finite
field arithmetic, like RSA.
3. It may be designed as a combination of the previous solutions: Pro-

viding both the atomic operations and the SM operation. This could
provide both a flexible design and an optimized SM, and would also be
very well suited for tight integration with a processor.
The ECCo design in this thesis will interface with the ARM Cortex M33
(Chapter 2.7) for use from software. The CM33 provides a coprocessor inter-
face which allows for tight integration of coprocessors and issuing opcodes to
the coprocessor from software. Because of this, Solutions 2. and 3. are good
choices. Ideally, Solution 3. would be chosen, but due to time limitations So-
lution 2. is the choice for this thesis. Allowing for estimates of SM speedup
with and without the coprocessor by comparing speed of atomic operations
in hardware and software. This minimal implementation will also be able to
give an indication on how the size of the coprocessor will compare to that of
the CM33 core itself.
Since the ECCo will be controlled from software through the coprocessor
interface an instruction set has to be defined for the ECCo. The instruction
set proposed in this thesis is presented in Chapter 5.1. The proposed instruc-
tion set includes more than the atomic operations and data transfer; It also
includes logical, comparison, and shift operations. The pre-study concluded
that an entire SM should be performed in the coprocessor in order to max-
imize the benefit of the coprocessor. By including these flow-control and
common operations the ECCo will be able to perform an entire SM without
datatransfer between the ECCo and CM33 during execution, even though it
is being controlled from SW.
4.2 Choice of Alorithms

The two essential atomic operations are modular addition and modular mul-
tiplication, both of which can be implemented with multiple different algo-
rithms (as described in Chapters 3.1 - 3.2). When choosing which algorithms
to implement, this thesis chose the simplest algorithms in order to reduce
time spent on implementation. Optimizations of the algorithms will be left
for furute work.
The modular multiplication algorithm implemented is the modular multi-
plication interleaving algorithm (Algorithm 7), which is described in Chapter
3.2. This algorithm requires no overhead or added complexity from number
conversion, but is not the most efficient algorithm and is not designed for
security.
4.3. Interpretation of Algorithms 21
For the modular addition Algorithm 4 is the simplest presented algo-

rithm, but it does not support negative numbers (i.e. no subtraction) nor in-
termediate sums greater than 2n. To address these limitations an improved,
generic version of the algorithm was designed. The new algorithm is de-
scribed in Algorithm 9.
Algorithm 9 Generic Modular Addition Algorithm

INPUT: Addends A & B, modulo n
OUTPUT: Sum S
2: while S0 ≥ n do
3: S0 = S0 − n
4: end while
5: while S0 < 0 do
6: S0 = S0 + n
7: end while
8: S = S0
This algorithm can handle both positive and negative numbers, and in-
termediate sums larger than 2n. Notice that the while loops are mutually
exclusive; After the intermediate sum, S0 = A + B, has been calculated, S0
will either be reduced or increased. Clearly, the while loops are not syn-
thesizable. Details on the interpretation of this algorithm are presented in
Chapter 5.
4.3 Interpretation of Algorithms

The mathematical foundation of ECC requires several abstract concepts and
algorithms to be "translated" into hardware, i.e. the modulo operator; mul-
tiplication over a finite field (see Chapters 2.1.1 and 3); EC point addition
(Chapter 2.2.3 and 3). There are often many ways of doing this, depending
on the algorithm being implemented and system requirements. A significant
decision when designing the implementation is the choice between sequen-
tial or combinatorial. Combinatorial designs are much more restricted by
the clock frequency of the system, and can make it harder to meet timing
requirements. For this thesis the sequential approach is preferred, and state
machines has been designed to implement the chosen algorithms. The rea-
son being that a sequential implementation is more similar to a state machine
representation of the system, which makes it easier to reason about the be-
havior of the system.
4.4 Test Data

In order to verify the results from the implementations of arithmetic oper-
ations a set of known test data is required. In the pre-study [11] test data
for the scalar multiplication and point arithmetic from reliable sources was
22 Chapter 4. Methodology and Architecture Design
used. This test data will be reused in this thesis. Test data for simpler opera-
tions (i.e. modular addition, division, etc.) is easy to generate using a Python
script. Using a Python script will also allow generating more test data for SM
and point arithmetic, since a Python implementation of these operations was
written for the pre-study. The details of this script are described in Chapter
5, and full source code is listed in Appendix A.
Generation of test data contains a repeating pattern, regardless of what
data is being generated: reading data from file, and writing properly format-
ted data to file. This can be handled by Pythons OOP features (see Chapter
5.5.5 and 2.10).
4.5 Verification
In order to both verify correct behavior and to speed up the development
process, the entire ECCo and each sub-module are separately tested with a
testbench verifying correct behavior. In the case of the arithmetic operations
this includes checking results with test data, previously mentioned in Chap-
ter 4.4.
Design of testbenches are a repeating process, which can be simplified
by following a design pattern. During the development of ECCo the chosen
pattern was:
• Each testbench consisted of a module, for instantiating and connect-

ing the design under test (DUT); An interface connected to the DUT; A
package with module specific parameters; A test program.
• All signals in the DUTs interface are connected to, and controlled by,
the testbench. Allowing independent testing of all sub-modules.
• The testbench uses drivers and dummy implementation of modules to

control the DUT. These dummies and driver can be reused between
testbenches, and can utilize system verilogs OOP features.
4.6 Internal Interfaces

During design of the ECCo a repeating design question is how to commu-
nicate between sub-modules. The sub modules of the system are primarily
modules implementing the operations defined by the instruction set, all of
which may share a common communication protocol. Because of this all
communication between sub-modules have been cleary defined using two
interfaces: one for all communication with the register bank, another for all
communication with the ECCo controller module. See Chapter 5.3 for further
details.
4.7. Area Measurement 23
4.7 Area Measurement

To aquire the results for area measurement the design was synthezised. The
results presented are relative values, compared between synthesis of the CM33+ECCo
and the CM33 only.
The speed results were measured during simulation, counting clock cy-
cles used to execute benchmarking code of modular addition and modular
multiplication, for both software and hardware implementations of those op-
erations. Further details in Chapter 5.8.3.
25
Chapter 5
Implementation
This chapter describes implementation details about the work done for this
thesis: proposed instruction set for the ECCo; the implementation of the
ECCo and its integration with the CM33; testbench architecture and verifi-
cation of the ECCo and its sub-modules; test data generation using a Python
script; C implementation of the big numbers library, and the ECCo software
wrapper; benchmarking of modular arithmetic operations, using the ECCo
and a pure software implementation.
The logical, shift and comparison operations mentioned are not imple-
mented in the ECCo for this thesis. The proposed instruction set includes
these instructions, and discusses why they should be included in a future
implementatin of an elliptic curve coprocessor.
5.1 ECCo Instruction Set

The ECCo instruction set was aimed at allowing software controlled imple-
mentations of SM, while reducing data transfer between between CM33 and
ECCo. The instruction set designed in this thesis is listed in Table 5.2.
The connection between these instructions and the coprocessor instruc-
tions of the ARMv8-M instruction set (Chapter 2.7) is: the MCRR and MRRC
are used to for the Load and Store instructions; the CPD and CPD2 instruc-
tions are used for all other instructions, where the opc1 and opc2 arguments
are opcodes for the issued operation (see [29] for description of assembly in-
structions).
In the instruction set the conditional operations are not explicily listed,
the reason being that all operations has a conditional conterpart, using the
CPD2 instruction.
While further evaluation about the necessity of all instructions are re-
quired, the instruction set proposed in this thesis are based on the following
reasoning:
• The arithmetic instructions are fundamental for the SM (as discussed in
Chapter 4).
• The logical instructions allows functionality like masking and setting
registers to zero.
• Shift instructions allows efficient divide/multiply by 2, as required in
algorithms like Montgomery (Algorithm 8)
26 Chapter 5. Implementation
Operation Parameter 1 Parameter 2 Parameter 3

(register) (register) (register)
Modular Multiplicand Multiplier Product
Multiplication
Modular Addition Addend Addend Sum
Integer division Dividend Divisor Quotient
Negate 2’s complement Operand Result
or Operand 1 Operand 2 Result
and Operand 1 Operand 2 Result
xor Operand 1 Operand 2 Result
not Operand Result
Left shift Operand Shift size Result
Logic right shift Operand Shift size Result
Arithmetic right shift Operand Shift size Result
Is zero Operand
Is equal Operand 1 Operand 2
Less than Operand 1 Operand 2
Greater than Operand 1 Operand 2
Load Offset Index
Store Offset Index
Increment Operand Result
Decrement Operand Result
Invert comparison
Set signed bit Index
Unset signed bit Index
TABLE 5.2: Instruction set for elliptic curve coprocessor.

5.2. ECCo Architecture 27
• Comparison and conditional instructions allow control flow.
• Increment and decrement are common operations. Since immediate

values are not available for the coprocessor instructions this avoids the
need of using a register for increment/decrement value.
• Inverting comparison allows for comparisons like greater or equal to, by

inverting Less than.
• Set/Unset are required because the signed bit is not accessible through
the data transfer instructions (see Chapter 5.4 for details).
An implementation of this instruction set will therefore allow an entire

scalar multiplication to be performed in the ECCo, without data transfer dur-
ing execution, while still being controlled by the CM33.
5.2 ECCo Architecture

The architecture of the ECCo were based on Solution 2 in Chapter 4.1. The
architecture is illustrated in Figure 5.1.
F IGURE 5.1: Architecture of ECCo, connected to the CM33 pro-

cessor through the coprocessor interface.
The ECCo is connected to the CM33 through the coprocessor interface. In-
ternally the sub-modules are connected through two interfaces, as discussed
in Chapter 4.6. These interfaces are described in Chapter 5.3.
5.3 Internal Interfaces

There were used two internal interfaces in the design: in_OpModule which
defines the protocol for issuing an operation to one of the operation-modules
(a sub-module implementing one or more of the operations in the instruction
set), and in_Registers which defines the protocol for reading from and writing
to the register bank of the ECCo.
The in_OpModule interface uses a valid-ready protocol: when the sub-
module is ready to accept a new operation a ready signal is asserted. An
operation is issued by raising the valid signal, and it is accepted on the first
clock cycle where valid and ready are both asserted. As long as valid is asserted
all parameter values of the interface must be valid and stable. The interface
also defines an error signal, which is asserted whenever an operation fails.
The parameters of in_OpModule are:
op1Reg Register index of operand 1

op2Reg Register index of operand 2
resReg Register index of result
opcode Opcode for the requested operation
Figure 5.2 illustrates the protocol of the in_Opmodule interface. At t3 an

operation is accepted. The controller issues another operation at t6, and has
to wait, while keeping the parameters valid, until the previous operation
has completed. At t9 the operation completed successfully, and the second
operation is accepted. The second operation fails, as indicated by the error
signal at t11. When the following, third, operation is accepted at t13, both
the ready and error signals are deasserted. The SV interface implementation
of in_OpModules is listed in Appendix B.
F IGURE 5.2: Illustration of in_OpModule communication proto-

col.
Because of this generalization of communication with all operation sub-

modules, a common state machine is implemented as the controller in all of
them, which is illustrated in Figure 5.3.
5.4. Register Bank 29
F IGURE 5.3: Illustration of FSM implementing the in_OpModule

communication protocol.
In the state machine in Figure 5.3 StartT, ReadyT, and WaitT are names
of possible transitions. This is because the output of the state machine are
determined by both state and input. In IDLE the ready signal is asserted, and
the value of error may be either 0 or 1. In WAIT both ready and error is always
0.
The in_Registers interface exposes all the registers directly, for reading. To
write, the signals enable, register, and data are used, indicating when to enable
writing, which register to write to, and the write data, respectively. The SV
interface implementation of in_Registers are listed in Appendix B.
5.4 Register Bank

The register bank is a module containing 16 registers, which may be read
from and written to. The choice of 16 registers was done based on a limita-
tion from the CM33 which required the indexing of register using no more
than 4 bits. However, it may not be necessary with these many registers
to perform the SM. An evaluation of necessary number of registers are left
for future work, considering both the area usage of the register bank and
required number of registers for the SM implementation. All 16 registers
are exposed for reading through the in_Registers interface. Writing is imple-
mented following the in_Registers protocol.
The registers are of width WORD_W IDTH + 1, e.g. if the ECCo is in-
stantiated with a word width of 256-bit the word width of the registers will
be 257-bit. The reason for this is that parameter values from standards such
as [37] and [38] require WORD_W IDTH-bits to represent positive values.
Because of this the signed bit of registers are manipulated through dedicated
instructions, to avoid using a 64-bit data transfer to access the signed bit.
Register Name Register Index Writable Readable
CR0 0 X X
CR1 1 X X
... ... ... ...
CR13 13 X X
Modulo Register 14 X X
Status Register 15 X
TABLE 5.4: List of ECCo registers.
Table 5.4 lists all registers in the register bank. There is only two non-
general registers: the modulo register and the status register. The modulo
register is used for storing the modulo during modular arithmetic operations.
The status register is read-only (all writing to it is done inside the register
bank) and contains information about the current status of the ECCo:
Bit 0 Comparison result bit.
Bit 1-15 Active bits. These are reserved for future use in an asynchronous de-
sign, for indicating which operation modules are currently working
and which are idle.
Bit 16-30 Signed bits. The signed bits of register 0-14, respectively.
Bit 31- Unused.
5.5 Arithmetic Module

The arithmetic operations sub-module is implemented as a controller imple-
menting the in_OpModule protocol and wrapping the modules implement-
ing each individual arithmetic operation: negation, integer division, modu-
lar addition, and modular multiplication. In Figure 5.4 the block diagram of
the arithmetic module are shown. The arithmetic controller implements the
in_OpModule FSM, as illustrated in Figure 5.3.
5.5. Arithmetic Module 31
F IGURE 5.4: Block diagram of arithmetic module.

5.5.1 Negation
The negation operation is a single cycle operation which is straight forward
to implement, and performs a 2’s complement negation of the operand. It is
continually calculated:
1 a s s i g n r e s = ~( operand ) + 1 ;
5.5.2 Integer Division

The integer division is a necessary operation when using Affine coordinates,
but its implementation is not very interresting in regards to the ECCo. There-
fore, it was initially implemented using an opensource design from Open-
Cores [39]. However, this design did not function properly and instead in-
teger division was implemented using the SystemVerilog division operator,
"/".
It is also a single cycle operation, but requires divide-by-zero detection
and handling of negative numbers: If the divisor and/or dividend is negative
its positive 2’s complement is used in the division and the sign of the result
is calculated using basic algebra rules, as shown in Listing 5.1.
1 // MSB o f dividend ( op1 ) and d i v i s o r ( op2 )

2 l o g i c msbOp1 , msbOp2 ;
3 // I n t e r n a l s i g n a l s
4 l o g i c [WORD_WIDTH: 0 ] intOp1 ;
5 l o g i c [WORD_WIDTH: 0 ] intOp2 ;
6 l o g i c [WORD_WIDTH: 0 ] i n t R e s ;
7
8 // The division i s continuously calculated .
9 assign divideByZero = ( op2 == 0 ) ;
10 assign i n t R e s = intOp1 / intOp2 ;
11 assign msbOp1 = op1 [WORD_WIDTH] ;
12 assign msbOp2 = op2 [WORD_WIDTH] ;
13
14 always_comb begin
15 intOp1 = op1 ;
16 intOp2 = op2 ;
17 i f ( msbOp1 && msbOp2 ) begin
18 intOp1 = (~ op1 ) + 1 ;
19 intOp2 = (~ op2 ) + 1 ;
20 end
21 e l s e i f ( msbOp1 )
22 intOp1 = (~ op1 ) + 1 ;
23 e l s e i f ( msbOp2 )
24 intOp2 = (~ op2 ) + 1 ;
25 end
26
27 a l w a y s _ f f @( posedge ck )
28 r e s <= ( msbOp1 ^ msbOp2 ) ? (~ i n t R e s ) + 1 : i n t R e s ;
L ISTING 5.1: Division SV implementation.
5.5.3 Modular Addition

The modular addition is implemented using Algorithm 9, designed for this
thesis, as discussed in Chapter 4.2. This algorithm is interpreted as illustrated
by the FSM in Figure 5.5, and the datapath in Figure 5.6. The transitions in
the illustration are referred to by name.
F IGURE 5.5: FSM interpretation of Generic Modular Addition

Algorithm.
DoneT Transition to IDLE when an addition has finished. Asserting done for
one cycle.
WaitT Transition in IDLE when not performing an operation.
ReduceT Transition to REDUCE when the intermediate sum is greater than
the modulo, and need to be reduced to 0 ≤ Sum < Modulo.
IncreaseT Transition to INCREASE when the intermediate sum is less than
0, and need to be increased to 0 ≤ Sum < Modulo.
If initially: op1 + op2 < mod then the calculation only takes one cycle to
complete, or else op1 mux selects the intermediate result as operand 1 and
op2 mux selects either mod or −mod as operand 2, depending on if the state
is INCREASE or REDUCE, respectively. In worst case the addition could
take 2WORD_W IDTH − 1 cycles to perform, calculating ((2WORD_W IDTH − 1) +
0) % 1.
5.5.4 Modular Multiplication

The modular multiplication is implemented using the Algorithm 7, as dis-
cussed in Chapter 4.2. This algorithm is interpreted as illustrated by the FSM
in Figure 5.7, and the datapath in Figure 5.8. The transitions in the illustration
are referred to by name.
DoneT Transition to IDLE when an multiplication has finished. Asserting

done for one cycle.
F IGURE 5.6: Block diagram of modular addition module.
WaitT Transition in IDLE when not performing an operation.
AddT Transition to ADD when calculating the sum of 2 · P + A · Bk−1−i (as

described in Chapter 3.2).
ReduceT Transition to REDUCE when the intermediate sum is greater than

the modulo, and need to be reduced to 0 ≤ Sum < Modulo.
ReduceDoneT Transition to REDUCE_DONE when the intermediate sum

is greater than the modulo, and need to be reduced to 0 ≤ Sum <
Modulo, before finishing to operation.
The modular multiplication always has an execution time of at least WORD_WIDTH

cycles since it has to iterate through all bits of op2, except the signed bit. None
of op1, op2, or mod are allowed to be negative. The emphpartial product mux
selects the current value of A · Bk−1−i . op1 mux and op2 mux selects whether
to calculate 2 · P + A · Bk−1−i or to reduce the intermediate result.
5.5.5 Test Data

Test data was generated using a python script, which was written with an
architecture as illustrated in Figure 5.9. The test data solutions are created by
python operators, as shown in Listing 5.2.
1 def modular_addition ( op1 : i n t , op2 : i n t , mod : i n t ) −> i n t :

2 r e t u r n ( op1 + op2 ) % mod
3
4 def m o d u l a r _ m u l t i p l i c a t i o n ( op1 : i n t , op2 : i n t , mod : i n t ) −>
int :
F IGURE 5.7: FSM interpretation of Multiply and Divide Algo-

rithm.
5 r e t u r n ( op1 * op2 ) % mod

6
7 def i n t e g e r _ d i v i s i o n ( op1 : i n t , op2 : i n t ) −> i n t :
8 i f op1 < 0 and op2 < 0 :
9 r e s = abs ( op1 ) // abs ( op2 )
10 e l i f op1 < 0 :
11 r e s = −(abs ( op1 ) // op2 )
12 e l i f op2 < 0 :
13 r e s = −(op1 // abs ( op2 ) )
14 else :
15 r e s = op1 // op2
16 return res
L ISTING 5.2: Test data solution calculations.
Notice the integer division // does not handle division of negative num-
bers correctly. Instead any negative numbers are negated, and basic algebra
rules are used to determine the sign of the result, just as it is implemented in
hardware.
The script source code is listed in Appendix A. Test data values used for
verification are listed in Appendix C.
5.5.6 Verification - Arithmetic Module

The arithmetic module was tested using a TB design as illustrated in Figure
5.10. The test program communicates with the arithmetic module through
an in_OpModule driver, and controls and verifies the register content during
testing through a dummy register bank, connected to the arithmetic module.
During testing the values listed in Appendix C were used to verify correct
results from arithmetic operations.
F IGURE 5.8: Block diagram of modular multiplication module.
5.6 Controller Module

The controllers primary purpose is to handle communication with the CM33
using the coprocessor interface, the FSM in Figure 5.11 illustrates the imple-
mented state machine which does this. This is a synchronous design: the
controller will wait for any multicycle operation to finish before signaling to
the CM33 that it is ready to accept further instructions.
The outputs of the FSM is the coprocessor interface signals valid and er-
ror, and an internal valid, which are used in the in_OpModule interface. The
transitions in the illustration are referred to by name. The output signals of
the FSM are determined by both state and input, easiest described as the set
of all possible transitions:
RyT - ready transition Transition to READY, with ready asserted and error
deasserted, waiting for an instruction to be issued.
ET - error transition Transition to READY, with both ready and error asserted.
May be from an write error, read error, data processing error or an in-
valid instruction being issued.
WaT - wait transition Transition to WAIT when valid is asserted and a data
processing operation is issued.
WaWT - wait wait transition Transition to WAIT, from WAIT, while current
data processing operation is not yet finished.
5.6. Controller Module 37
F IGURE 5.9: Class diagram of python script generating test

data.
WaRT - wait ready transition Transition to WAIT, from WAIT, when a data
processing operation finished successfully and valid is asserted, request-
ing a new data processing operation immediately.
WaET - wait error transition Transition to WAIT, from WAIT, when a data
processing operation finished with error and valid is asserted, request-
ing a new data processing operation immediately.
ReT - read transition Transition to READ, when the processor wants to read
from a coprocessor register.
ReRT - read ready transition Transition to READ, from WAIT, when a data
ing a data transfer operation (read) immediately.
ReET - read error transition Transition to READ, from WAIT, when a data
ing a data transfer operation (read) immediately.
WrT - write transition Transition to WRITE, when the processor wants to

write to a coprocessor register.
WrRT - write ready transition Transition to WRITE, from WAIT, when a data
ing a data transfer operation (write) immediately.
WrET - write error transition Transition to WRITE, from WAIT, when a data
ing a data transfer operation (write) immediately.
5.6.1 Verification - Controller Module

The testbench setup for the verification of the controller module is illustrated
in Figure 5.12.
F IGURE 5.10: Block diagram of Arithmetic Module TB.
Operation module dummies for the arithmetic, logical, comparison and

shift modules are connected to the controller, and controlled by the test pro-
gram. A dummy register bank is connected to the controller, and the con-
troller is tested using a coprocessor interface driver for communication.
5.7 Verification - ECCo

The testbench setup for verification of the entire ECCo is illustrated in Figure
5.13.
A coprocessor interface driver is used to communicate with the ECCo,
and the test values from Appendix C are used to check for correct behavior
of the implemented operations.
5.8 Software
For this thesis three software components were implemented: a wrapper for
the coprocessor interface instructions; a big number library for use with the
ECCo; and a benchmarking program.
5.8. Software 39
F IGURE 5.11: FSM of ECCo controller module.
The big number library and ECCo wrapper were used to verify that com-
munication with the ECCo using the coprocessor interface was working as
expected. To verify correct behavior of the ECCo controller and the imple-
mented operations the test data form Appendix C were used. The source
code of the test programs used for verification are listed in Appendix F.
5.8.1 ECCo Wrapper

The ECCo wrapper was implemented to simplify calling the ECCo from C
using the coprocessor interface. The coprocessor instructions of the ARMv8-
M instruction set have to be called from assembly, using string literals to
refer to coprocessor registers and opcodes. Therefore a series of macros were
created for all the instructions in the proposed instruction set (Table 5.2). The
code for the wrapper is listed in Appendix D.
5.8.2 Big Number library

When using the ECCo some minor handling of big numbers in software are
still required. For this a big number library was implemented for use with
the ECCo. The functionality it provided was:
F IGURE 5.12: Testbench setup for verification of the controller

module.
• Converting to and from number strings on hexadecimal format.

• Comparing two numbers.
• Loading a number to an ECCo register.
• Storing a number from an ECCo register.
• Some other convenient functionality.
The source code for the big number library is listed in Appendix E.
5.8.3 Benchmark Software

For benchmarking the pure software implementation of ECC, ANSSI libecc
(Chapter 2.9), were compared to the ECCo. The benchmarked operations
5.8. Software 41
F IGURE 5.13: Testbench setup for verification of ECCo.
were the modular multiplication and modular addition. As these are the
fundamental operations of SM the execution time of these will give an in-
dication of the possible speedup. The benchmarking was performed by do-
ing the setup of parameters once, instantiating operand 1 (OP1), operand 2
(OP2), and modulo (MOD) to large 256-bit values. The same values were
used for the libecc and ECCo benchmarks. Then the operation OP1 = OP1 +
OP2 % MOD were performed for the modular addition benchmark, and
OP1 = OP1 ∗ OP2 % MOD for the modular multiplication benchmark.
The benchmarks were performed doing runs of 10 and 100 iterations, i.e.
performing the operation 10 or 100 times, updating the OP1 value each time.
The test values were large 256-bit values, making them similar to values used
during 256-bit SM. These benchmarks does, however, not include tests of
edge cases, such as when MOD << OP1 + OP2 in which case the ECCo will
have a very long execution time, nor does it guarantee coverage of the case
when MOD > OP1 + OP2 or MOD > OP1 ∗ OP2.
The source code for the benchmarking programs are listed in Appendix
F.
43
Chapter 6
Results
The simulation tests described in Chapter 5, verifying correct behavior of all

sub-modules and correct results from implemented arithmetic operations, all
succeeded.
This chapter presents the results from the benchmark, comparing the ex-
ecution time between the modular arithmetic software implementation by
libecc and the ECCo implementation. Lastly, the area estimates from synthe-
sis are presented.
6.1 Speed
The execution time of modular addition and modular multiplication is com-
pared between benchmark code running the operations on ECCo and using
the software implementation from libecc. Table 6.2 summarizes the bench-
marking results. The execution time is measured in clock cycles. As a ref-
erence, a simulation run without any operation was performed in order to
measure the setup time of the system. This empty run had an execution time
of 36,790 cycles (this is included in the results presented in Table 6.2).
The results show that the ECCo performed 3.8 times faster for modular
addition at 10 iterations, and 8 times faster at 100 iterations. As for the mod-
ular multiplication the ECCo performed 7.8 times faster at 10 and 27 times
faster at 100 iterations.
While the ECCo is significantly faster than the compared software imple-
mentation another notable result is how the ECCo and software implemen-
tation scales differently: From 10 to 100 iterations the ECCo had an increase
Operation Exec. Time - 10 Exec. Time - 100

Iterations Iterations
Modular Addition - ECCo 42,818 43,294
Modular Addition - libecc 164,906 347,966
Modular Multiplication - ECCo 46,840 87,864
Modular Multiplication - libecc 367,664 2,375,744
TABLE 6.2: Execution time of atomic operations. Measured in

clock cycles.
44 Chapter 6. Results
Measurement Increase
Combinational Area 3.12x

Noncombinational Area 1.36x
Total Area 1.83x
TABLE 6.4: Area increase for design when adding ECCo.
Module Sub-Module ECCo Acc. Area Comb. Area Noncomb. Area
Arithmetic 84.63% 5.80% 13.61%

Multiplication* 1.97% 1.56% 4.81%
Addition* 1.92% 1.53% 4.61%
Negation 0.78% 0.25% 4.49%
Division 73.18% 83.02% 4.50%
Controller 4.65% 5.25% 0.42%
Register Bank 10.72% 2.59% 67.31%
TABLE 6.6: Area distribution of ECCo modules. (*modular)
in execution time of 1.01x (addition) and 1.8x (multiplication), while the soft-
ware implementation had an increase of 2.1x (addition) and 6.5x (multiplica-
tion). This gives an indication on the benefit of having a coprocessor which
allows an extensive amount of operations to be performed without the need
for data transfer between processor and coprocessor.
6.2 Area
The design of the CM33 with the ECCo was synthesizable, and did not have
any negative slack. It was synthesized without any optimization, at a fre-
quency of 128MHz. The area results are presented as a comparison between
synthesis estimates of the design with and without the ECCo included (Ta-
ble 6.4), and a area distribution between the sub-modules of the ECCo (Table
6.6).
The values shown in Table 6.4 are percentage increase in area when syn-
thesizing the CM33 and CM33+ECCo. Clearly, the ECCo contains a great
deal of combinatorial logic, increasing area of combinatorial cell area by 312%.
In total the ECCo’s area equals 83% of existing design.
The values shown in Table 6.6 are the area distribution of the ECCo sub-
modules.
6.2. Area 45
ECCo Accumulative Area The area percentage of the ECCo occupied by this
module, included its sub-modules. The percentages of Arithmetic, Con-
troller, and Register Bank modules add up to 100%, being all the sub-
modules of the ECCo. The percentages of Multiplication, Addition, Nega-
tion, and Division are included in the Arithmetic percentage, but they do
not sum up to 84.63% since the Arithmetic module contains some logic
of its own.
Combinatorial Area The area percentage of combinatorial cells for only this
module, not including any of its sub-modules. E.g. the Arithmetic mod-
ule uses 5.8% of the total area of combinatorial cells in the ECCo, ex-
cluded its sub-modules, and the Division module uses 83.02% of the
total combinatorial area of the ECCo.
Noncombinatorial Area Same as for combinatorial.
Not surprisingly, a majority of the noncombinational area are occupied by

the register bank. However, most of the area of the ECCo are occupied by the
divider, which were synthesized using the SV division operator "/" without
any optimization from the synthesizer.
The implementation of the most essential modules, Modular Multpilcation
and Modular Addition, only occupied 1.97% and 1.92%, respectively. Com-
bined with the benchmark results, this gives an indication of the advantages
of using the ECCo: Significant speedup, with only a small area increase, as-
suming the divisor can be more efficiently implemented. Assuming a more
efficient divisor implementation: the register bank may be the module occu-
pying the largest area, currently being 5x the size of the Modular Multiplica-
tion and Modular Addition modules, and 2x the size of the controller.
47
Chapter 7
Future Work
The ECCo implementation in this thesis has only included a small subset
of necessary operations and features for the suggested design of a complete
elliptic curve coprocessor. This chapter discusses possible changes and con-
siderations for future work on the coprocessor proposed in this thesis.
7.1 Instruction Set Architecture

The instruction set proposed in Table 5.2 is intended for a design aimed for
solution 2 in Chapter 4.1. The desired solution, however, is solution 3, which
requires some additional, higher level operations to be included in the in-
struction set. More specifically point arithmetic (Chapter 2.2.3) and/or scalar
multiplication (Chapter 2.3).
Also, another desirable functionality would be to have a way of generat-
ing random numbers of the coprocessors word size. This is because random
numbers used in many cryptography algorithms, like ECDSA (Chapter 2.5).
The currently implemented arithmetic operations of modular addition
and modular multiplication are also the fundamental operations of common,
non-EC crypto systems, like RSA [20] and Diffie-Hellman [4]. Adding in-
structions for these common algorithms could be usefull, but would require
the possibility of working with numbers of bit sizes up to 4096-bit to provide
acceptable security.
7.2 Security
An issue which has not been addressed in this thesis, but which must be
considered for future work, is security of the implementation against attacks
such as side-channel attacks. A way of trying to defend against side-channel
attacks is by using constant time algorithms for calculations, which should
be considered both for the finite-field arithmetic, point operations and the
scalar multiplication algorithm.
7.3 Algorithms
While the implemented algorithms for modular addition and modular mul-
tiplication are simple, with more complex and efficient methods available
48 Chapter 7. Future Work
(Chapters 3.1 and 3.2), the current implementation already provides signifi-
cant speedup over pure software implementation. A future change in choice
of algorithms is necessary for further development, a decision in which a
compromise between security and efficiency surely is needed.
The integer division will, however, need a more area efficient implemen-
tation. Reducing the area consumption of the divisor module could, poten-
tially, significantly reduce the total area of the ECCo.
49
Chapter 8
Conclusion
This thesis has explored how to design a coprocessor for accelerating elliptic
curve cryptography, based on the results from the prestudy [11]. The co-
processor designed in the thesis, ECCo, was designed for use with the ARM
CM33 processor. The CM33 provides a coprocessor interface for tight integra-
tion of coprocessors, which allows the instructions to be issued to connected
coprocessors from software.
This lead to the ECCo being designed with an instruction set providing
the atomic mathematical operations for ECC, with the possibility of adding
implementations of scalar multiplication to the instruct set in a future work.
As time did not allow for the entire proposed instruction set to be im-
plemented only the atomic arithmetic operations were implemented, and an
ECCo design with a controller, register bank and arithmetic module were
used to compare execution time with an ECC software implementation, and
to estimate area usage by synthesis. The ECCo accounted for 45% of the
area when synthesizing ECCo+CM33. The estimates showed that the ECCo
area consumption was largely dominated by the divisor (73.18% of the total
ECCo area), which was implemented using the SystemVerilog division oper-
ator, "/", and no optimization in synthesis. However, the atomic operations
of ECC, Modular Multiplication and Modular Addition, only occupied 1.97%
and 1.92%, respectively. These modules also performed 3.8x - 27x faster than
a pure software implementation of ECC.
While the implemented algorithms for modular addition and modular
multiplication are simple, with more complex and efficient methods avail-
able (Chapters 3.1 and 3.2), the current implementation already provides
significant speedup over pure software implementation. Providing a com-
plete system which allows efficiency to be achieved through several meth-
ods: reducing data transfers, optimizing implementation of mathematical
operations and flexibility and ease-of-use.
51
Appendix A
Test Data Python script
1 import a r g p a r s e
2 import csv
3 import i o
4 import os
5 import r e
6 import s h u t i l
7 import sys
8 from abc import ABC, a b s t r a c t c l a s s m e t h o d
9 from typing import *
10
11
12 # E xce pt ion c l a s s used t o d i f f e r e n t i o t e between known and unknown e r r o r s .
13 c l a s s DataError ( E xce pt ion ) :
14 pass
15
16
17 # ##############################################################################
18 # #
19 # Baseclass #
20 # #
21 # ##############################################################################
22
23 c l a s s DataABC (ABC) :
24 " " " DataABC i s t h e b a s e c l a s s f o r a l l c a l c u l a t i o n s . I t handles reading from
25 and w r i t i n g t o csv data f i l e s , w r i t i n g t o C f i l e s , and number f o r m a t t i n g
26 ( decimal , hex & b i n a r y ) .
27 """
28 headers = [ ]
29 data = []
30
31 def _ _ i n i t _ _ ( s e l f , headers , f i l e : i o . IOBase , numBase : i n t ) −> None :
32 s e l f . headers = headers
33 rd = csv . r e a d e r ( f i l e )
34 # F i r s t l i n e o f t h e f i l e must be t h e headers
35 f i l e H e a d e r s = rd . __next__ ( )
36 i f s e l f . headers ! = f i l e H e a d e r s :
37 r a i s e DataError ( f ’ [ ! ! ] DataABC , _ _ i n i t _ _ : I n v a l i d headers ! Want { s e l f . headers } − got
,→ { f i l e H e a d e r s } ’ )
38
39 # Read a l l data
40 f o r j , c o l s i n enumerate ( rd ) :
41 # Report and s k i p empty l i n e s
42 i f not c o l s :
43 p r i n t ( f ’ [ ] DataABC , _ _ i n i t _ _ : Reading { f i l e } : Found empty l i n e ( { j + 2 } ) .
,→ Ignoring . . . ’ )
44 continue
45 # Represent t h e data as a d i c t , indexed by header names
46 tmp = d i c t ( )
47 f o r i , h i n enumerate ( s e l f . headers ) :
48 # S a n i t y c h e c k s t o avoid decimal i n t e r p r e t e d as hex e t c .
49 i f not r e . match ( r ’^−?\d+$ ’ , c o l s [ i ] ) and numBase == 10 :
50 r a i s e DataError ( f ’ DataABC , _ _ i n i t _ _ : Reading { f i l e } : T r i e d i n t e r p r e t i n g
,→ non−decimal number as decimal : " { c o l s [ i ] } " ’ )
51 e l i f not r e . match ( r ’ ^−?0x[0 − 9a−fA−F ]+ $ ’ , c o l s [ i ] ) and numBase == 16 :
52 r a i s e DataError ( f ’ DataABC , _ _ i n i t _ _ : Reading { f i l e } : T r i e d i n t e r p r e t i n g non−hex
,→ number as hexadecimal : " { c o l s [ i ] } " ’ )
53 e l i f not r e . match ( r ’ ^−?0b [ 0 1 ] + $ ’ , c o l s [ i ] ) and numBase == 2 :
54 r a i s e DataError ( f ’ DataABC , _ _ i n i t _ _ : Reading { f i l e } : T r i e d i n t e r p r e t i n g
,→ non−b i n a r y number as b i n a r y : " { c o l s [ i ] } " ’ )
55 tmp [ h ] = i n t ( c o l s [ i ] , numBase )
56 s e l f . data . append ( tmp )
57
58 @abstractclassmethod
59 def c a l c u l a t e ( s e l f ) −> None :
60 pass
61
62 @staticmethod
63 def _formatNumber (num : i n t , numFormat : i n t ) −> s t r :
64 # Determine number format s t r i n g
65 i f numFormat == 16 :
66 r e t u r n f ’ 0x {num : x } ’ i f num >= 0 e l s e f ’ −0x { abs (num) : x } ’
67 e l i f numFormat == 2 :
68 r e t u r n f ’ 0b {num : b } ’ i f num >= 0 e l s e f ’ −0b { abs (num) : b } ’
52 Appendix A. Test Data Python script
69 else :
70 r e t u r n f ’ {num} ’
71
72
73 def _formatDataCsv ( s e l f , numFormat : i n t ) −> Generator [ D i c t [ s t r , s t r ] , None , None ] :
74 # I t e r a t e through data values , y i e l d d i c t i o n a r i e s with s t r i n g s o f form atted numbers
75 f o r d i n s e l f . data :
76 tmp = d i c t ( )
77 f o r k , v i n d . items ( ) :
78 tmp [ k ] = s e l f . _formatNumber ( v , numFormat )
79 y i e l d tmp
80
81 def writeCsv ( s e l f , f i l e : i o . IOBase , numFormat : i n t ) −> None :
82 wr = csv . D i c t W r i t e r ( f i l e , f i e l d n a m e s= s e l f . headers )
83 # F i r s t writeCsv t h e header l i n e
84 wr . w r i t e h e a d e r ( )
85 # Write a l l data t o t h e f i l e
86 f o r d i n s e l f . _formatDataCsv ( numFormat ) :
87 wr . writerow ( d )
88
89 def _formatDataC ( s e l f , numFormat : i n t ) −> Generator [ L i s t [ s t r ] , None , None ] :
91 tmp = l i s t ( )
92 for v in d . values ( ) :
93 tmp . append ( s e l f . _formatNumber ( v , numFormat ) )
94 y i e l d tmp
95
96 def writeC ( s e l f , f i l e : i o . IOBase , numFormat : i n t , fileName : s t r , arrayName : s t r ) −> None :
97 # Need t o know s i z e o f a l l t h e a r r a y s dimensions
98 numEntries = l e n ( s e l f . data ) + 1 # Zero t e r m i n a t e d
99 numHeaders = l e n ( s e l f . headers )
100 numChars = 0
101 # I t e r a t e through a l l v a l u e s and f i n d t h e l o n g e s t s t r i n g
103 for v in d . values ( ) :
104 l = l e n ( s e l f . _formatNumber ( v , numFormat ) )
105 i f l > numChars :
106 numChars = l
107 numChars += 1 # One e x t r a , f o r t e r m i n a t i n g zero
108
109 # P r i n t some g e n e r a l i n f o r m a t i o n comments
110 p r i n t ( f ’ // Created by { sys . argv [ 0 ] } with data from { fileName }\n// Number base : { numFormat } ’ ,
,→ f i l e = f i l e , end= ’ \n\n ’ )
111 # P r i n t some macros with meta data
112 p r i n t ( f ’ # d e f i n e { arrayName . upper ( ) }_NUM_ENTRIES { numEntries − 1} ’ , f i l e = f i l e )
113 p r i n t ( f ’ # d e f i n e { arrayName . upper ( ) }_NUM_HEADERS { numHeaders } ’ , f i l e = f i l e )
114 p r i n t ( f ’ # d e f i n e { arrayName . upper ( ) }_NUM_CHARS { numChars − 1} ’ , f i l e = f i l e , end= ’ \n\n ’ )
115 # P r i n t a comment with t h e headers
116 p r i n t ( f ’ // [ { " , " . j o i n ( s e l f . headers ) } ] ’ , f i l e = f i l e )
117 # Write t h e a c t u a l data
118 p r i n t ( f ’ char { arrayName } [ { numEntries } ] [ { numHeaders } ] [ { numChars } ] = { { ’ , f i l e = f i l e )
119 f o r data i n s e l f . _formatDataC ( numFormat ) :
120 print ( f """ { { " { ’ " , " ’ . j o i n ( data ) } " } } , " " " , f i l e = f i l e )
121 # End with zero t e r m i n a t i o n
122 print ( ’ {0}\n } ; ’ , f i l e = f i l e )
123
124
125
126 # ##############################################################################
127 # #
128 # Addition #
129 # #
130 # ##############################################################################
131
132 c l a s s ModAddData ( DataABC ) :
133 def _ _ i n i t _ _ ( s e l f , f i l e : i o . IOBase , numBase : i n t ) :
134 super ( ) . _ _ i n i t _ _ ( [ ’ modulo ’ , ’ operand1 ’ , ’ operand2 ’ , ’ r e s u l t ’ ] , f i l e , numBase )
135
136 def c a l c u l a t e ( s e l f ) :
137 # For each e n t r y c a l c u l a t e op1+op2 % mod
138 f o r i , d i n enumerate ( s e l f . data ) :
139 s e l f . data [ i ] [ ’ r e s u l t ’ ] = ( d [ ’ operand1 ’ ] + d [ ’ operand2 ’ ] ) % d [ ’ modulo ’ ]
140
141
142 # ##############################################################################
143 # #
144 # Multiplication #
145 # #
146 # ##############################################################################
147
148 c l a s s ModMulData ( DataABC ) :
150 super ( ) . _ _ i n i t _ _ ( [ ’ modulo ’ , ’ operand1 ’ , ’ operand2 ’ , ’ r e s u l t ’ ] , f i l e , numBase )
151
153 # For each e n t r y c a l c u l a t e op1 * op2 % mod
155 s e l f . data [ i ] [ ’ r e s u l t ’ ] = ( d [ ’ operand1 ’ ] * d [ ’ operand2 ’ ] ) % d [ ’ modulo ’ ]
156
157
158 # ##############################################################################
159 # #
160 # Division #
161 # #
162 # ##############################################################################
163
164 c l a s s DivData ( DataABC ) :
Appendix A. Test Data Python script 53

166 super ( ) . _ _ i n i t _ _ ( [ ’ operand1 ’ , ’ operand2 ’ , ’ r e s u l t ’ ] , f i l e , numBase )
167
169 # For each e n t r y c a l c u l a t e op1/op2 , i n t e g e r d i v i s i o n
171 op1 = d [ ’ operand1 ’ ]
172 op2 = d [ ’ operand2 ’ ]
173 # I n t e g e r d i v i s i o n doesn ’ t behave as expected when d e a l i n g with
174 # n e g a t i v e numbers ( e . g . i t t h i n k s 3//−4 = − 1) so j u s t g i v e i t
175 # p o s i t i v e numbers i n s t e a d and use b a s i c a r i t h m e t i c r u l e s f o r
176 # determining r e s u l t s i g n .
177 i f op1 < 0 and op2 < 0 :
178 s e l f . data [ i ] [ ’ r e s u l t ’ ] = abs ( op1 ) // abs ( op2 )
179 e l i f op1 < 0 :
180 s e l f . data [ i ] [ ’ r e s u l t ’ ] = −(abs ( op1 ) // op2 )
181 e l i f op2 < 0 :
182 s e l f . data [ i ] [ ’ r e s u l t ’ ] = −(op1 // abs ( op2 ) )
183 else :
184 s e l f . data [ i ] [ ’ r e s u l t ’ ] = op1 // op2
185
186
187 # ##############################################################################
188 # #
189 # Main code #
190 # #
191 # ##############################################################################
192
193 i f __name__ == " __main__ " :
194 # Setup a r g p a r s e
195 par = a r g p a r s e . ArgumentParser ( )
196 par . add_argument ( ’ FILE ’ , type= s t r , help= ’ data f i l e on e i t h e r hexa , b i n a r y or decimal format . ’ )
197 par . add_argument ( ’−o ’ , metavar= " FILE " , type= s t r , help= ’ o p t i o n a l output f i l e ’ )
198 par . add_argument ( ’−c ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ output t h e data as C−a r r a y i n s t e a d o f CSV ’ )
199 par . add_argument ( ’−b ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ c r e a t e a backup f i l e ’ )
200 # Use a mutually e x c l u s i v e group f o r s e l e c t i n g number format
201 formatGroup = par . add_mutually_exclusive_group ( r e q u i r e d =True )
202 formatGroup . add_argument ( ’−−dec ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ input data i s on decimal format . ’ )
203 formatGroup . add_argument ( ’−−hex ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ input data i s on hexadecimal
,→ format . ’ )
204 formatGroup . add_argument ( ’−−bin ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ input data i s on b i n a r y format . ’ )
205 # Use a mutually e x c l u s i v e group f o r s e l e c t i n g output number format
206 formatGroup = par . add_mutually_exclusive_group ( r e q u i r e d = F a l s e )
207 formatGroup . add_argument ( ’−−outDec ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ output data i s on decimal
,→ format . ’ )
208 formatGroup . add_argument ( ’−−outHex ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ output data i s on hexadecimal
,→ format . ’ )
209 formatGroup . add_argument ( ’−−outBin ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ output data i s on b i n a r y
,→ format . ’ )
210 # Use a mutually e x c l u s i v e group f o r s e l e c t i n g o p e r a t i o n
211 operationGroup = par . add_mutually_exclusive_group ( r e q u i r e d =True )
212 operationGroup . add_argument ( ’−−add ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ c a l c u l a t e data f o r modular
,→ a d d i t i o n . ’ )
213 operationGroup . add_argument ( ’−−mul ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ c a l c u l a t e data f o r modular
,→ m u l t i p l i c a t i o n . ’ )
214 operationGroup . add_argument ( ’−−div ’ , a c t i o n = ’ s t o r e _ t r u e ’ , help= ’ c a l c u l a t e data f o r i n t e g e r
,→ d i v i s i o n . ’ )
215
216 args = v a r s ( par . p a r s e _ a r g s ( ) )
217 dataFile = a r g s [ ’ FILE ’ ]
218 bkupFile = f ’ { d a t a F i l e } . backup ’
219 outFile = args [ ’o ’ ] i f args [ ’o ’ ] e l s e d a t a F i l e
220 csvOut = not a r g s [ ’ c ’ ]
221
222 # S e l e c t data o p e r a t i o n
223 i f a r g s [ ’ add ’ ] :
224 d a t a C l a s s = ModAddData
225 cArrayName = ’ dataAdd ’
226 e l i f a r g s [ ’ mul ’ ] :
227 d a t a C l a s s = ModMulData
228 cArrayName = ’ dataMul ’
229 e l i f a r g s [ ’ div ’ ] :
230 d a t a C l a s s = DivData
231 cArrayName = ’ dataDiv ’
232
233 # S e l e c t input number base
234 i f a r g s [ ’ dec ’ ] :
235 i n B a s e = 10
236 e l i f a r g s [ ’ hex ’ ] :
237 i n B a s e = 16
238 e l i f a r g s [ ’ bin ’ ] :
239 inBase = 2
240 # S e l e c t output number base
241 i f a r g s [ ’ outDec ’ ] :
242 outBase = 10
243 e l i f a r g s [ ’ outHex ’ ] :
244 outBase = 16
245 e l i f a r g s [ ’ outBin ’ ] :
246 outBase = 2
247 else :
248 outBase = i n B a s e
249 cArrayName = f ’ { cArrayName } { outBase } ’
250
251 # Perform c a l c u l a t i o n
252 try :
253 with open ( d a t a F i l e , ’ r ’ , newline= ’ ’ ) as f i n :
254 data = d a t a C l a s s ( f i n , i n B a s e )
54 Appendix A. Test Data Python script
255 data . c a l c u l a t e ( )
256 i f args [ ’b ’ ] :
257 s h u t i l . copy ( d a t a F i l e , bkupFile )
258 with open ( o u t F i l e , ’w’ , newline= ’ ’ ) as f o u t :
259 i f csvOut :
260 data . writeCsv ( fout , outBase )
261 else :
262 data . writeC ( fout , outBase , o u t F i l e , cArrayName )
263 e x c e p t DataError as e :
264 p r i n t ( e , f i l e =sys . s t d e r r )
L ISTING A.1: Python script for generating test data

55
Appendix B
Internal Interfaces SV Code
1 interface in_Registers ;
2 l o g i c [NUM_REGS−1 : 0 ] [WORD_WIDTH: 0 ] registers ;
3 l o g i c [WORD_WIDTH: 0 ] wData ;
4 logic [3 : 0] wReg ;
5 logic wEnable ;
6
7 modport s l a v e (
8 output r e g i s t e r s ,
9 in pu t wData ,
10 in pu t wReg ,
11 in pu t wEnable
12 );
13 modport master (
14 in pu t r e g i s t e r s ,
15 output wData ,
16 output wReg ,
17 output wEnable
18 );
19 endinterface
20
21 i n t e r f a c e in_OpModule ;
22 logic ready ;
23 logic error ;
24 logic valid ;
25 l o g i c [ 3 : 0 ] opcode ;
26 l o g i c [ 3 : 0 ] op1Reg ;
27 l o g i c [ 3 : 0 ] op2Reg ;
28 l o g i c [ 3 : 0 ] resReg ;
29
30 modport s l a v e (
31 output ready ,
32 output e r r o r ,
33 in pu t v a l i d ,
34 in pu t opcode ,
35 in pu t op1Reg ,
36 in pu t op2Reg ,
37 in pu t resReg
38 );
39 modport master (
40 in pu t ready ,
41 in pu t e r r o r ,
42 output v a l i d ,
43 output opcode ,
56 Appendix B. Internal Interfaces SV Code
44 output op1Reg ,
45 output op2Reg ,
46 output resReg
47 );
48 endinterface
L ISTING B.1: SystemVerilog code for the internal
interfaces of ECCo.
57
Appendix C
Test Data
modulo , operand1 , operand2 , r e s u l t

7 ,15 ,1 ,2
11 ,3 ,2 ,5
11 ,3 , − 4 ,10
233 ,75 ,77 ,152
233 ,567 ,895 ,64
233 ,567 , − 895 ,138
28657 ,16578 ,19504 ,7425
514229 ,546500 ,357980 ,390251
99194853094755497 ,98275954794755497 ,12457956214 ,98275967252711711
99194853094755497 ,98275954794755497 , − 12457956214 ,98275942336799283
92567853094755497 ,98275954794755497 ,92657924597654697 ,5798173202899200
92567853094755497 , − 98275954794755497 , − 92657924597654697 ,86769679891856297
75356465794755497 ,65245765798756497 ,70253759756423697 ,60143059760424697
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,55228977
55228977394393414412853003502097247104908965897402951232160234933662925082798 ,45228977
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,55228977
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,65228977
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,35289773
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,95289773
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,85289773
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,45228977
L ISTING C.1: Modular addition test data.
modulo , operand1 , operand2 , r e s u l t

7 ,15 ,1 ,1
11 ,3 ,2 ,6
233 ,75 ,77 ,183
233 ,567 ,895 ,224
58 Appendix C. Test Data
28657 ,16578 ,19504 ,381

514229 ,546500 ,357980 ,218095
99194853094755497 ,98275954794755497 ,12457956214 ,31017271154744113
92567853094755497 ,98275954794755497 ,92657924597654697 ,48036520782282743
75356465794755497 ,65245765798756497 ,70253759756423697 ,65782237743603078
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,5522
55228977394393414412853003502097247104908965897402951232160234933662925082798 ,4522
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,5522
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,6522
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,3528
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,9528
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,8528
74225698149877013133163669918490695756676765155849109751738796007550114900164 ,4522
L ISTING C.2: Modular multiplication test data.
operand1 , operand2 , r e s u l t
5 ,1 ,5
3 ,2 ,1
3 , − 4 ,0
75 ,77 ,0
567 ,895 ,0
567 , − 895 ,0
16578 ,19504 ,0
546500 ,357980 ,1
98275954794755497 ,12457956214 ,7888609
98275954794755497 , − 12457956214 , − 7888609
98275954794755497 ,92657924597654697 ,1
98275954794755497 ,97 ,1013154173141809
98275954794755497 , − 97 , − 1013154173141809
65245765798756497 ,70256423697 ,928680
55228977394654679572853003502097247104908965897402951232160234933662925082798 ,4128
65228977394654679572853003502097247104908965897402951232160234933662925082798 ,4128
3528977394654679572853003502097247104908965897402951232160234933662925082798 ,41285
9528977394654679572853003502097247104908965897402951232160234933662925082798 ,91285
8528977394654679572853003502097247104908965897402951232160234933662925082798 ,91285
45228977394393414412853003502097247104908965897402951232160234933662925082798 ,1329
L ISTING C.3: Integer division test data.

59
Appendix D
ECCo C Wrapper
1 # i f n d e f ECC_H
2 # d e f i n e ECC_H
3
4 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
5 * *
6 * I n t e r n a l e c c . h macros *
7 * *
8 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
9
10 // Coprocessor number o f t h e ECCo
11 # d e f i n e __ECC_COPROC " p0 "
12
13 /* * * * * * * * * * * * *
14 * Opcodes *
15 * * * * * * * * * * * * */
16
17 // A r i t h m e t i c
18 # d e f i n e __ECC_OPC1_MUL " 0 x0 "
19 # d e f i n e __ECC_OPC1_ADD " 0 x1 "
20 # d e f i n e __ECC_OPC1_DIV " 0 x2 "
21 # d e f i n e __ECC_OPC1_NEG " 0 x3 "
22 // L o g i c a l
23 # d e f i n e __ECC_OPC1_LOG " 0xd "
24 # d e f i n e __ECC_OPC2_OR " 0 x0 "
25 # d e f i n e __ECC_OPC2_AND " 0 x1 "
26 # d e f i n e __ECC_OPC2_XOR " 0 x2 "
27 # d e f i n e __ECC_OPC2_NOT " 0 x3 "
28 // S h i f t
29 # d e f i n e __ECC_OPC1_SFT " 0 xe "
30 # d e f i n e __ECC_OPC2_LSL " 0 x0 "
31 # d e f i n e __ECC_OPC2_LSR " 0 x1 "
32 # d e f i n e __ECC_OPC2_ASR " 0 x2 "
33 // Comparison
34 # d e f i n e __ECC_OPC1_CMP " 0 xf "
35 # d e f i n e __ECC_OPC2_ZR " 0 x0 "
36 # d e f i n e __ECC_OPC2_NZR " 0 x1 "
37 # d e f i n e __ECC_OPC2_EQ " 0 x2 "
38 # d e f i n e __ECC_OPC2_NEQ " 0 x3 "
39 # d e f i n e __ECC_OPC2_LT " 0 x4 "
40 # d e f i n e __ECC_OPC2_GT " 0 x5 "
41 // M i s c e l l a n e o u s
42 # d e f i n e __ECC_OPC1_INC " 0 xa "
43 # d e f i n e __ECC_OPC1_DEC " 0xb "
44 # d e f i n e __ECC_OPC1_SSB " 0 xc "
45 # d e f i n e __ECC_OPC2_SSB " 0 x0 "
46 # d e f i n e __ECC_OPC1_USB " 0 xc "
47 # d e f i n e __ECC_OPC2_USB " 0 x1 "
48
49
50 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
51 * *
52 * Exported e c c . h macros *
53 * *
54 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
60 Appendix D. ECCo C Wrapper
55
56 # i f n d e f NULL
57 # d e f i n e NULL ( ( void * ) 0 )
58 # endif
59
60 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
61 * Coprocessor i n t e r f a c e meta *
62 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
63
64 # define ECC_OP1_WIDTH 4
65 # define ECC_OP1_MAX 15
66 # define ECC_OP2_WIDTH 3
67 # define ECC_OP2_MAX 7
68 # define ECC_REG_IDX_WIDTH 4
69 # define ECC_REG_IDX_MAX 15
70 # define ECC_WORD_WIDTH 256
71 # define ECC_WORD_WIDTH_BYTE (ECC_WORD_WIDTH/8)
72 # define ECC_MODULO_REG " 14 "
73 # define ECC_STATUS_REG " 15 "
74
75 /* * * * * * * * * * * * * * * * * * * * * * * * * * *
76 * Arithmetic operations *
77 * * * * * * * * * * * * * * * * * * * * * * * * * * */
78
79 // A l l arguments a r e c o p r o c e s s o r r e g i s t e r indexes , which must be i n t e g e r s
i n double quotes .
80 # d e f i n e ECC_MUL( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_MUL" , c r " op2Reg " , c r " op1Reg " , c r " resReg " , #0 " )
81 # d e f i n e ECC_ADD( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_ADD" , c r " op2Reg " , c r " op1Reg " , c r " resReg " , #0 " )
82 # d e f i n e ECC_DIV ( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# " __ECC_OPC1_DIV " , c r " op2Reg " , c r " op1Reg " , c r " resReg " , #0 " )
83 # d e f i n e ECC_NEG( opReg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_NEG" , cr0 , c r " opReg " , c r " resReg " , #0 " )
84
85
86 /* * * * * * * * * * * * * * * * * * * * * * * *
87 * Logical operations *
88 * * * * * * * * * * * * * * * * * * * * * * * */
89
i n double quotes .
91 # d e f i n e ECC_OR( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_LOG" , c r " op2Reg " , c r " op1Reg " , c r " resReg " , # "__ECC_OPC2_OR
)
92 # d e f i n e ECC_AND( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_LOG" , c r " op2Reg " , c r " op1Reg " , c r " resReg " , # "
__ECC_OPC2_AND)
93 # d e f i n e ECC_XOR( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_LOG" , c r " op2Reg " , c r " op1Reg " , c r " resReg " , # "
__ECC_OPC2_XOR )
94 # d e f i n e ECC_NOT( opReg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# "__ECC_OPC1_LOG" , cr0 , c r " op1Reg " , c r " resReg " , # "
__ECC_OPC2_NOT )
95
96
97 /* * * * * * * * * * * * * * * * * * * * * *
98 * S h i f t operations *
99 * * * * * * * * * * * * * * * * * * * * * */
100
i n double quotes .
102 # d e f i n e ECC_LSL ( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# " __ECC_OPC1_SFT " , c r " op2Reg " , c r " op1Reg " , c r " resReg " , # " __ECC_OPC2_LSL
)
103 # d e f i n e ECC_LSR ( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# " __ECC_OPC1_SFT " , c r " op2Reg " , c r " op1Reg " , c r " resReg " , # " __ECC_OPC2_LSR
)
104 # d e f i n e ECC_ASR( op1Reg , op2Reg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" ,
# " __ECC_OPC1_SFT " , c r " op2Reg " , c r " op1Reg " , c r " resReg " , # " __ECC_OPC2_ASR
)
105
Appendix D. ECCo C Wrapper 61
106
107 /* * * * * * * * * * * * * * * * * * * * * * * * * * *
108 * Comparison o p e r a t i o n s *
109 * * * * * * * * * * * * * * * * * * * * * * * * * * */
110
i n double quotes .
112 # d e f i n e ECC_ZR( reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_CMP " , cr0 , c r " reg " , cr0 , # " __ECC_OPC2_ZR )
113 # d e f i n e ECC_NZR( reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_CMP " , cr0 , c r " reg " , cr0 , # "__ECC_OPC2_NZR )
114 # d e f i n e ECC_EQ( op1Reg , op2Reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_CMP " , c r " op2Reg " , c r " op1Reg " , cr0 , # "__ECC_OPC2_EQ )
115 # d e f i n e ECC_NEQ( op1Reg , op2Reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_CMP " , c r " op2Reg " , c r " op1Reg " , cr0 , # "__ECC_OPC2_NEQ )
116 # d e f i n e ECC_LT ( op1Reg , op2Reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_CMP " , c r " op2Reg " , c r " op1Reg " , cr0 , # " __ECC_OPC2_LT )
117 # d e f i n e ECC_GT( op1Reg , op2Reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_CMP " , c r " op2Reg " , c r " op1Reg " , cr0 , # " __ECC_OPC2_GT )
118
119
120 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
121 * Miscellaneous operations *
122 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
123
i n double quotes .
125 # d e f i n e ECC_INC ( opReg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_INC " , cr0 , c r " opReg " , c r " resReg " , #0 " )
126 # d e f i n e ECC_DEC( opReg , resReg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_DEC " , cr0 , c r " opReg " , c r " resReg " , #0 " )
127 # d e f i n e ECC_SSB ( reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_SSB " , cr0 , c r " reg " , cr0 , # " __ECC_OPC2_SSB )
128 # d e f i n e ECC_USB ( reg ) asm v o l a t i l e ( " cdp "__ECC_COPROC" , # "
__ECC_OPC1_USB " , cr0 , c r " reg " , cr0 , # " __ECC_OPC2_USB )
129
130
131 /* * * * * * * * * * * * * * * * * * * * * * * * * *
132 * Data t r a n s f e r macros *
133 * * * * * * * * * * * * * * * * * * * * * * * * * */
134
135 / * Load c o p r o c e s s o r r e g i s t e r macros . O f f s e t i s i n hexa . ’ reg ’ i s a
coprocessor
136 r e g i s t e r index and must be a decimal i n t e g e r i n double quotes . ’ Rt ’ and
’ Rt2 ’ a r e
137 32− b i t i np ut v a r i a b l e s . * /
138 # d e f i n e ECC_LOAD_0( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0 x0 ,
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
139 # i f ECC_WORD_WIDTH > 64
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
141 # else
142 # d e f i n e ECC_LOAD_1( Rt , Rt2 , reg )
143 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
146 # else
148 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
151 # else
153 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
156 # else
158 # endif

%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
161 # else
163 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
166 # else
168 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
171 # else
173 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
176 # else
178 # endif
%0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
181 # else
183 # endif
185 # d e f i n e ECC_LOAD_10 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0 xa
, %0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
186 # else
187 # d e f i n e ECC_LOAD_10 ( Rt , Rt2 , reg )
188 # endif
190 # d e f i n e ECC_LOAD_11 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0xb
, %0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
191 # else
193 # endif
195 # d e f i n e ECC_LOAD_12 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0 xc
, %0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
196 # else
198 # endif
200 # d e f i n e ECC_LOAD_13 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0xd
, %0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
201 # else
203 # endif
205 # d e f i n e ECC_LOAD_14 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0 xe
, %0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
206 # else
208 # endif
210 # d e f i n e ECC_LOAD_15 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mcrr "__ECC_COPROC" , #0 xf
, %0, %1, c r " reg : : "rm" ( Rt ) , "rm" ( Rt2 ) )
211 # else
213 # endif
214
215 / * S t o r e c o p r o c e s s o r r e g i s t e r macros . O f f s e t i s i n hexa . ’ reg ’ i s a
coprocessor
216 r e g i s t e r index and must be a decimal i n t e g e r i n double quotes . ’ Rt ’ and
’ Rt2 ’ a r e
217 32− b i t output v a r i a b l e s . * /
Appendix D. ECCo C Wrapper 63
218 # d e f i n e ECC_STORE_0 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mrrc "__ECC_COPROC" , #0 x0 ,

%0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
220 # d e f i n e ECC_STORE_1 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mrrc "__ECC_COPROC" , #0 x1
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
221 # else
222 # d e f i n e ECC_STORE_1 ( Rt , Rt2 , reg )
223 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
226 # else
228 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
231 # else
233 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
236 # else
238 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
241 # else
243 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
246 # else
248 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
251 # else
253 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
256 # else
258 # endif
, %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
261 # else
263 # endif
265 # d e f i n e ECC_STORE_10 ( Rt , Rt2 , reg ) asm v o l a t i l e ( " mrrc "__ECC_COPROC" , #0
xa , %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
266 # else
268 # endif
xb , %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
271 # else
273 # endif
xc , %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
276 # else

278 # endif
xd , %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
281 # else
283 # endif
xe , %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
286 # else
288 # endif
xf , %0, %1, c r " reg : " =rm" ( Rt ) , " =rm" ( Rt2 ) )
291 # else
293 # endif
294
295 # endif // ECC_H
L ISTING D.1: ECCo C wrapper source.

65
Appendix E
ECCo Big Number library
1 # i f n d e f ECC_WORD_H
2 # d e f i n e ECC_WORD_H
3
4 # i n c l u d e < s t d b o o l . h>
5
6 # include " ecc . h"
7
8 / * Length o f a r r a y i n word s t r u c t . Define here i n s t e a d o f e c c . h s i n c e i t
depends
9 on a r r a y type . * /
10 # d e f i n e EW_LENGTH (ECC_WORD_WIDTH_BYTE/ s i z e o f ( i n t ) )
11
12 / * +4 t o f i t t e r m i n a t i n g ’ \ 0 ’ , l e a d i n g ’ 0 b ’ and o p t i o n a l ’ − ’ s i g n . * /
13 # d e f i n e EW_STR_LENGTH ECC_WORD_WIDTH+4
14
15 / * ecc_word i s t h e d a t a t y p e t o work with b i g numbers width t h e same width
as
16 t h e ECC c o p r o c e s s o r s word s i z e . * /
17 typedef s t r u c t {
18 i n t word [EW_LENGTH ] ;
19 bool i s _ z e r o ;
20 bool i s _ n e g a t i v e ;
21 } ecc_word_t ;
22
23 / * S t r i n g −type b i g enough t o r e p r e s e n t any number on e i t h e r
24 binary , decimal or hexadecimal format . * /
25 t y p e d e f char e w _ s t r _ t [EW_STR_LENGTH ] ;
26
27 / * I n i t i a l i z e s a ecc_word . Returns a p o i n t e r t o t h e given word . * /
28 ecc_word_t * e w _ i n i t ( ecc_word_t * ) ;
29
30 / * C r e a t e s a new copy o f an ecc_word . Returns a p o i n t e r t o d s t . * /
31 ecc_word_t * ew_copy ( c o n s t ecc_word_t * r e s t r i c t s r c , ecc_word_t * r e s t r i c t
dst ) ;
32
33
34 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
35 * *
36 * Content h a n d l e r s *
37 * *
38 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
39
40 / * S e t s t h e c o n t e n t o f a ecc_word t o 0 . Returns a p o i n t e r t o t h e given word
. */
41 ecc_word_t * ew_zero ( ecc_word_t * ) ;
42
43 / * S e t t h e value t o an i n t e g e r value . * /
44 ecc_word_t * e w _ s e t _ i n t ( ecc_word_t * , i n t ) ;
45
46 / * S e t t h e value o f a word t o a number r e p r e s e n t e d by a s t r i n g i n
hexadecimal
47 ( 0 x p r e f i x ) format . Return a p o i n t e r t o t h e word , or NULL on f a i l u r e . * /
48 ecc_word_t * e w _ s e t _ s t r ( ecc_word_t * , c o n s t char [ ] ) ;
49
66 Appendix E. ECCo Big Number library
50 / * S e t p a r t s o f t h e c o n t e n t o f a word , based on t h e given o f f s e t . * /

51 ecc_word_t * e w _ s e t _ o f f s ( ecc_word_t * w, i n t o f f s , i n t r1 , i n t r 2 ) ;
52
53 / * Return a p o i n t e r t o t h e hexadecimal f o r m a t t e d s t r i n g o f t h e number . * /
54 char * e w _ t o _ s t r ( c o n s t ecc_word_t * , char [ ] , i n t ) ;
55
56
57 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
58 * *
59 * Comparison *
60 * *
61 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
62
63 / * Check i f two words a r e equal . * /
64 bool ew_eq ( c o n s t ecc_word_t * , c o n s t ecc_word_t * ) ;
65
66
67 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
68 * *
69 * Coprocessor i n t e r r a c t i o n *
70 * *
71 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
72
73 / * Load t h e given word i n t o a c o p r o c e s s o r r e g i s t e r . * /
74 void ew_load_cr0 ( c o n s t ecc_word_t * ) ;
89 / * CP r e g i s t e r 15 i s s t a t u s r e g i s t e r and u n w r i t e a b l e * /
90
91 / * S t o r e t h e value o f a c o p r o c e s s o r s r e g i s t e r i n t h e given word . Takes
92 c o p r o c e s s o r r e g i s t e r index as second argument . * /
93 void e w _ s t o r e _ c r 0 ( ecc_word_t * ) ;
103 void e w _ s t o r e _ c r 1 0 ( ecc_word_t * ) ;
109
110 / * Convenience macros * /
111 # d e f i n e EW_LOAD_MOD(WORD) ew_load_cr14 (WORD)
112 # d e f i n e EW_STORE_MOD(WORD) e w _ s t o r e _ c r 1 4 (WORD)
113 # d e f i n e EW_STORE_STATUS(WORD) e w _ s t o r e _ c r 1 5 (WORD)
114
115
116 /* * * * * * * * * * * * * * * * * * * * * * * * * *
117 * O f f s e t s e l e c t macros *
118 * * * * * * * * * * * * * * * * * * * * * * * * * */
119
120 # d e f i n e EW_GET_0 ( Rt , Rt2 , W) Rt = W−>word [ 0 ] ; Rt2 = W−>word [ 1 ]
Appendix E. ECCo Big Number library 67

123 # else
124 # d e f i n e EW_GET_1 ( Rt , Rt2 , W)
125 # endif
128 # else
130 # endif
133 # else
135 # endif
138 # else
140 # endif
142 # d e f i n e EW_GET_5 ( Rt , Rt2 , W) Rt = W−>word [ 1 0 ] ; Rt2 = W−>word [ 1 1 ]
143 # else
145 # endif
148 # else
150 # endif
153 # else
155 # endif
158 # else
160 # endif
163 # else
165 # endif
168 # else
170 # endif
173 # else
175 # endif
178 # else
180 # endif
183 # else
185 # endif
188 # else
190 # endif
193 # else

195 # endif
196
197 # d e f i n e EW_SET_0 ( Rt , Rt2 , W) e w _ s e t _ o f f s (W, 0 , Rt , Rt2 )
200 # else
201 # d e f i n e EW_SET_1 ( Rt , Rt2 , W)
202 # endif
205 # else
207 # endif
210 # else
212 # endif
215 # else
217 # endif
220 # else
222 # endif
225 # else
227 # endif
230 # else
232 # endif
235 # else
237 # endif
240 # else
242 # endif
244 # d e f i n e EW_SET_10 ( Rt , Rt2 , W) e w _ s e t _ o f f s (W, 1 0 , Rt , Rt2 )
245 # else
247 # endif
250 # else
252 # endif
255 # else
257 # endif
260 # else
262 # endif
265 # else

267 # endif
270 # else
272 # endif
273
274 # e n d i f // ECC_WORD_H
L ISTING E.1: Header file for big number

implementation of an ECCo word.
1 # i n c l u d e " ecc_word . h "

2
3 # i n c l u d e < e e _ p r i n t f . h>
5
7
8 ecc_word_t *
9 e w _ i n i t ( ecc_word_t * w)
10 {
11 f o r ( i n t i = 0 ; i < EW_LENGTH; i ++ )
12 w−>word [ i ] = 0 ;
13 w−>i s _ z e r o = true ;
14 w−>i s _ n e g a t i v e = f a l s e ;
15 r e t u r n w;
16 }
17
18 ecc_word_t *
19 ew_copy ( c o n s t ecc_word_t * r e s t r i c t s r c , ecc_word_t * r e s t r i c t d s t )
20 {
21 i f ( ! s r c −>i s _ z e r o )
23 dst −>word [ i ] = s r c −>word [ i ] ;
24 else
26 dst −>word [ i ] = 0 ;
27
28 dst −>i s _ z e r o = s r c −>i s _ z e r o ;
29 dst −>i s _ n e g a t i v e = s r c −>i s _ n e g a t i v e ;
30 return dst ;
31 }
32
33
34 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
35 * *
36 * Content h a n d l e r s *
37 * *
38 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
39
40 ecc_word_t *
41 ew_zero ( ecc_word_t * w)
42 {
43 i f ( !w−>i s _ z e r o ) {
45 w−>word [ i ] = 0 ;
46 w−>i s _ z e r o = 1 ;
47 }
48 r e t u r n w;
49 }
50
51 ecc_word_t *
52 e w _ s e t _ i n t ( ecc_word_t * w, i n t v a l )
53 {
54 ew_zero (w) ;
55 w−>word [ 0 ] = v a l ;
56 w−>i s _ z e r o = f a l s e ;
57 r e t u r n w;
58 }
59
60 ecc_word_t *
61 e w _ s e t _ s t r ( ecc_word_t * w, c o n s t char s t r [ ] )
62 {
63 int s h i f t , tmp ;
64 i n t * num = w−>word ;
65 c o n s t char * c ;
66
67 f o r ( c = s t r ; * c ! = ’ \0 ’ ; c++ )
68 ;
69
70 / * Check s i g n * /
71 i f ( * s t r == ’− ’ ) {
72 w−>i s _ n e g a t i v e = t r u e ;
73 s t r ++;
74 }
75 else
76 w−>i s _ n e g a t i v e = f a l s e ;
77
78 / * S a n i t y checks * /
79 i f ( * s t r ++ ! = ’ 0 ’ ) {
80 MSG( ( " e w _ s e t _ s t r : badly f o r m a t t e d s t r i n g , must s t a r t with ’ 0 x ’ or
’ − 0x ’\n " ) ) ;
81 r e t u r n NULL;
82 }
83 i f ( * s t r != ’ x ’ ) {
84 MSG( ( " e w _ s e t _ s t r : badly f o r m a t t e d s t r i n g , must s t a r t with ’ 0 x ’ or
’ − 0x ’\n " ) ) ;
86 }
87
88 / * S e t word t o zero i f non−zero * /
89 i f ( !w−>i s _ z e r o ) {
90 do
91 * num = 0 ;
92 while ( ++num ! = w−>word+EW_LENGTH ) ;
93 w−>i s _ z e r o = t r u e ;
94 num = w−>word ;
95 }
96
97 do {
98 tmp = 0 ;
99 f o r ( s h i f t = 0 ; s h i f t < 32 && −−c ! = s t r ; s h i f t += 4 ) {
100 s w i t ch ( * c ) {
101 case ’ f ’ : case ’F ’ :
102 tmp ^= 0 x f << s h i f t ;
103 break ;
104 case ’ e ’ : case ’E ’ :
105 tmp ^= 0 xe << s h i f t ;
106 break ;
107 c a s e ’ d ’ : c a s e ’D ’ :
108 tmp ^= 0xd << s h i f t ;
109 break ;
110 c a s e ’ c ’ : c a s e ’C ’ :
111 tmp ^= 0 xc << s h i f t ;
112 break ;
113 case ’b ’ : case ’B ’ :
114 tmp ^= 0xb << s h i f t ;
115 break ;
116 c a s e ’ a ’ : c a s e ’A ’ :
117 tmp ^= 0 xa << s h i f t ;
118 break ;
119 default :
120 i f ( * c < ’ 0 ’ && * c > ’ 9 ’ ) {
121 MSG( ( " e w _ s e t _ s t r : i n v a l i d c h a r a c t e r i n s t r i n g : %c " , * c )
);
123 }
124 tmp ^= ( * c − ’ 0 ’ ) << s h i f t ;
125 }
126 }
127 i f ( tmp && w−>i s _ z e r o )
128 w−>i s _ z e r o = f a l s e ;
129 * num = tmp ;
130 } while ( c ! = s t r && ++num ! = w−>word+EW_LENGTH ) ;
131
132 r e t u r n w;
133 }
134
135 ecc_word_t *
136 e w _ s e t _ o f f s ( ecc_word_t * w, i n t o f f s , i n t r1 , i n t r 2 )
137 {
138 i f ( w−>i s _ z e r o )
139 i f ( r 1 || r 2 )
140 w−>i s _ z e r o = f a l s e ;
141 o f f s *= 2 ;
142 w−>word [ o f f s ] = r1 ;
143 w−>word [ o f f s +1] = r 2 ;
144 r e t u r n w;
145 }
146
147 char *
148 e w _ t o _ s t r ( c o n s t ecc_word_t * w, char s [ ] , i n t sz )
149 {
150 int i = 0, shift ;
151 const i n t * num = w−>word+EW_LENGTH;
152 unsigned char tmp ;
153
154 i f ( sz < 4 ) {
155 MSG( ( " e w _ t o _ s t r : too s m a l l s t r i n g : sz = %d\n " , sz ) ) ;
157 }
158 i f ( w−>i s _ n e g a t i v e )
159 s [ i ++] = ’− ’ ;
160 s [ i ++] = ’ 0 ’ ;
161 s [ i ++] = ’ x ’ ;
162
163 while ( i < sz && num−− ! = w−>word )
164 f o r ( s h i f t = 2 8 ; s h i f t >= 0 && i < sz ; s h i f t −= 4 , i ++ )
165 s w i tc h ( ( tmp = ( * num >> s h i f t ) & 0 x f ) ) {
166 case 0 xf :
167 s[i] = ’f ’;
168 break ;
169 c a s e 0 xe :
170 s [ i ] = ’e ’ ;
171 break ;
172 c a s e 0xd :
173 s [ i ] = ’d ’ ;
174 break ;
175 c a s e 0 xc :
176 s[ i ] = ’c ’ ;
177 break ;
178 c a s e 0xb :
179 s [ i ] = ’b ’ ;
180 break ;
181 c a s e 0 xa :
182 s [ i ] = ’a ’ ;
183 break ;
184 default :
185 s [ i ] = ( tmp > 9 ) ? ’X ’ : tmp + ’ 0 ’ ;
186 }
187
188 i f ( i < sz )
189 s [ i ] = ’ \0 ’ ;
190 else {
191 MSG( ( " e w _ t o _ s t r : too s m a l l s t r i n g : sz = %d\n " , sz ) ) ;
193 }
194 return s ;
195 }
196
197
198 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
199 * *
200 * Comparison *
201 * *
202 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
203
204 bool
205 ew_eq ( c o n s t ecc_word_t * l h s , c o n s t ecc_word_t * r hs )
206 {
207 c o n s t i n t * lw = l h s −>word+EW_LENGTH;
208 c o n s t i n t * rw = rhs −>word+EW_LENGTH;
209
210 i f ( l h s −>i s _ z e r o && rhs −>i s _ z e r o )
211 return true ;
212 while ( *−−lw == *−−rw )
213 i f ( lw == l h s −>word )
214 return true ;
215 return f a l s e ;
216 }
217
218 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
219 * *
220 * Coprocessor load *
221 * *
222 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
223
224 # d e f i n e _EW_LOAD_CR(N) void ew_load_cr ##N( c o n s t ecc_word_t * w) { \
225 v o l a t i l e r e g i s t e r i n t r1 , r 2 ; \
226 /* O f f s e t 0 */ \
227 EW_GET_0 ( r1 , r2 , w) ; \
228 ECC_LOAD_0( r1 , r2 , #N) ; \
229 /* O f f s e t 1 */ \
230 EW_GET_1 ( r1 , r2 , w) ; \
231 ECC_LOAD_1( r1 , r2 , #N) ; \
232 /* O f f s e t 2 */ \
233 EW_GET_2 ( r1 , r2 , w) ; \
234 ECC_LOAD_2( r1 , r2 , #N) ; \
235 /* O f f s e t 3 */ \
236 EW_GET_3 ( r1 , r2 , w) ; \
237 ECC_LOAD_3( r1 , r2 , #N) ; \
238 /* O f f s e t 4 */ \
239 EW_GET_4 ( r1 , r2 , w) ; \
240 ECC_LOAD_4( r1 , r2 , #N) ; \
241 /* O f f s e t 5 */ \
242 EW_GET_5 ( r1 , r2 , w) ; \
243 ECC_LOAD_5( r1 , r2 , #N) ; \
244 /* O f f s e t 6 */ \
245 EW_GET_6 ( r1 , r2 , w) ; \
246 ECC_LOAD_6( r1 , r2 , #N) ; \
247 /* O f f s e t 7 */ \
248 EW_GET_7 ( r1 , r2 , w) ; \
249 ECC_LOAD_7( r1 , r2 , #N) ; \
250 /* O f f s e t 8 */ \
251 EW_GET_8 ( r1 , r2 , w) ; \
252 ECC_LOAD_8( r1 , r2 , #N) ; \
253 /* O f f s e t 9 */ \
254 EW_GET_9 ( r1 , r2 , w) ; \
255 ECC_LOAD_9( r1 , r2 , #N) ; \
256 /* O f f s e t a */ \
257 EW_GET_10 ( r1 , r2 , w) ; \
258 ECC_LOAD_10 ( r1 , r2 , #N) ; \
259 /* O f f s e t b */ \
260 EW_GET_11 ( r1 , r2 , w) ; \
261 ECC_LOAD_11 ( r1 , r2 , #N) ; \
262 /* O f f s e t c */ \
263 EW_GET_12 ( r1 , r2 , w) ; \
264 ECC_LOAD_12 ( r1 , r2 , #N) ; \
265 /* O f f s e t d */ \
266 EW_GET_13 ( r1 , r2 , w) ; \
267 ECC_LOAD_13 ( r1 , r2 , #N) ; \
268 /* O f f s e t e */ \
269 EW_GET_14 ( r1 , r2 , w) ; \
270 ECC_LOAD_14 ( r1 , r2 , #N) ; \
271 /* O f f s e t f */ \
272 EW_GET_15 ( r1 , r2 , w) ; \
273 ECC_LOAD_15 ( r1 , r2 , #N) ; \
274 \
275 i f ( w−>i s _ n e g a t i v e ) / * S e t signed b i t i f n e g a t i v e * / \
276 ECC_NEG( #N, #N) ; \
277 e l s e / * E l s e make s ur e i t ’ s unset * / \
278 ECC_USB( #N) ; \
279 }
280
281 _EW_LOAD_CR( 0 )
291 _EW_LOAD_CR( 1 0 )
292 _EW_LOAD_CR( 1 1 )
293 _EW_LOAD_CR( 1 2 )
294 _EW_LOAD_CR( 1 3 )
295 _EW_LOAD_CR( 1 4 )
296
297
298 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
299 * *
300 * Coprocessor s t o r e *
301 * *
302 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
303
304 # d e f i n e _EW_STORE_CR(N) void e w _ s t o r e _ c r ##N( ecc_word_t * w) { \
305 r e g i s t e r i n t r1 , r 2 ; \
306 unsigned mask ; \
307 \
308 / * Check s i g n * / \
309 ECC_STORE_0 ( r1 , r2 , ECC_STATUS_REG ) ; \
310 mask = 1 << ( 0 x10 + N) ; \
311 i f ( r 1 & mask ) { \
312 w−>i s _ n e g a t i v e = t r u e ; \
313 ECC_NEG( #N, #N) ; \
314 } \
315 else \
316 w−>i s _ n e g a t i v e = f a l s e ; \
317 \
318 w−>i s _ z e r o = t r u e ; \
319 /* O f f s e t 0 */ \
320 ECC_STORE_0 ( r1 , r2 , #N) ; \
321 EW_SET_0 ( r1 , r2 , w) ; \
322 /* O f f s e t 1 */ \
323 ECC_STORE_1 ( r1 , r2 , #N) ; \
324 EW_SET_1 ( r1 , r2 , w) ; \
325 /* O f f s e t 2 */ \
326 ECC_STORE_2 ( r1 , r2 , #N) ; \
327 EW_SET_2 ( r1 , r2 , w) ; \
328 /* O f f s e t 3 */ \
329 ECC_STORE_3 ( r1 , r2 , #N) ; \
330 EW_SET_3 ( r1 , r2 , w) ; \
331 /* O f f s e t 4 */ \
332 ECC_STORE_4 ( r1 , r2 , #N) ; \
333 EW_SET_4 ( r1 , r2 , w) ; \
334 /* O f f s e t 5 */ \
335 ECC_STORE_5 ( r1 , r2 , #N) ; \
336 EW_SET_5 ( r1 , r2 , w) ; \
337 /* O f f s e t 6 */ \
338 ECC_STORE_6 ( r1 , r2 , #N) ; \
339 EW_SET_6 ( r1 , r2 , w) ; \
340 /* O f f s e t 7 */ \
341 ECC_STORE_7 ( r1 , r2 , #N) ; \

342 EW_SET_7 ( r1 , r2 , w) ; \
343 /* O f f s e t 8 */ \
344 ECC_STORE_8 ( r1 , r2 , #N) ; \
345 EW_SET_8 ( r1 , r2 , w) ; \
346 /* O f f s e t 9 */ \
347 ECC_STORE_9 ( r1 , r2 , #N) ; \
348 EW_SET_9 ( r1 , r2 , w) ; \
349 / * O f f s e t 10 * / \
350 ECC_STORE_10 ( r1 , r2 , #N) ; \
351 EW_SET_10 ( r1 , r2 , w) ; \
352 / * O f f s e t 11 * / \
353 ECC_STORE_11 ( r1 , r2 , #N) ; \
354 EW_SET_11 ( r1 , r2 , w) ; \
355 / * O f f s e t 12 * / \
356 ECC_STORE_12 ( r1 , r2 , #N) ; \
357 EW_SET_12 ( r1 , r2 , w) ; \
358 / * O f f s e t 13 * / \
359 ECC_STORE_13 ( r1 , r2 , #N) ; \
360 EW_SET_13 ( r1 , r2 , w) ; \
361 / * O f f s e t 14 * / \
362 ECC_STORE_14 ( r1 , r2 , #N) ; \
363 EW_SET_14 ( r1 , r2 , w) ; \
364 / * O f f s e t 15 * / \
365 ECC_STORE_15 ( r1 , r2 , #N) ; \
366 EW_SET_15 ( r1 , r2 , w) ; \
367 \
368 i f ( w−>i s _ n e g a t i v e ) \
369 ECC_NEG( #N, #N) ; \
370 }
371
372 _EW_STORE_CR ( 0 )
382 _EW_STORE_CR ( 1 0 )
383 _EW_STORE_CR ( 1 1 )
384 _EW_STORE_CR ( 1 2 )
385 _EW_STORE_CR ( 1 3 )
386 _EW_STORE_CR ( 1 4 )
387
388 / * S t o r e word from CP r e g i s t e r 1 5 . Does not c a r e about s i g n s i n c e i t ’ s
389 t h e s t a t u s r e g i s t e r */
390 void
391 e w _ s t o r e _ c r 1 5 ( ecc_word_t * w)
392 {
393 r e g i s t e r i n t r1 , r 2 ;
394 w−>i s _ z e r o = t r u e ;
395 /* O f f s e t 0 */
396 ECC_STORE_0 ( r1 , r2 , " 15 " ) ;
397 EW_SET_0 ( r1 , r2 , w) ;
398 /* O f f s e t 1 */
399 ECC_STORE_1 ( r1 , r2 , " 15 " ) ;
400 EW_SET_1 ( r1 , r2 , w) ;
401 /* O f f s e t 2 */
402 ECC_STORE_2 ( r1 , r2 , " 15 " ) ;
403 EW_SET_2 ( r1 , r2 , w) ;
404 /* O f f s e t 3 */
405 ECC_STORE_3 ( r1 , r2 , " 15 " ) ;
406 EW_SET_3 ( r1 , r2 , w) ;
407 /* O f f s e t 4 */
408 ECC_STORE_4 ( r1 , r2 , " 15 " ) ;
409 EW_SET_4 ( r1 , r2 , w) ;
410 /* O f f s e t 5 */
411 ECC_STORE_5 ( r1 , r2 , " 15 " ) ;
412 EW_SET_5 ( r1 , r2 , w) ;
413 /* O f f s e t 6 */
414 ECC_STORE_6 ( r1 , r2 , " 15 " ) ;
415 EW_SET_6 ( r1 , r2 , w) ;
416 /* O f f s e t 7 */
417 ECC_STORE_7 ( r1 , r2 , " 15 " ) ;
418 EW_SET_7 ( r1 , r2 , w) ;
419 /* O f f s e t 8 */
420 ECC_STORE_8 ( r1 , r2 , " 15 " ) ;
421 EW_SET_8 ( r1 , r2 , w) ;
422 /* O f f s e t 9 */
423 ECC_STORE_9 ( r1 , r2 , " 15 " ) ;
424 EW_SET_9 ( r1 , r2 , w) ;
425 / * O f f s e t 10 * /
426 ECC_STORE_10 ( r1 , r2 , " 15 " ) ;
427 EW_SET_10 ( r1 , r2 , w) ;
428 / * O f f s e t 11 * /
429 ECC_STORE_11 ( r1 , r2 , " 15 " ) ;
430 EW_SET_11 ( r1 , r2 , w) ;
431 / * O f f s e t 12 * /
432 ECC_STORE_12 ( r1 , r2 , " 15 " ) ;
433 EW_SET_12 ( r1 , r2 , w) ;
434 / * O f f s e t 13 * /
435 ECC_STORE_13 ( r1 , r2 , " 15 " ) ;
436 EW_SET_13 ( r1 , r2 , w) ;
437 / * O f f s e t 14 * /
438 ECC_STORE_14 ( r1 , r2 , " 15 " ) ;
439 EW_SET_14 ( r1 , r2 , w) ;
440 / * O f f s e t 15 * /
441 ECC_STORE_15 ( r1 , r2 , " 15 " ) ;
442 EW_SET_15 ( r1 , r2 , w) ;
443 }
L ISTING E.2: Source file for big number

implementation of an ECCo word.
77
Appendix F
Benchmark & Test program
1
2 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
3 * *
4 * C o n t r o l macros *
5 * *
6 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
7
8 // # d e f i n e ONLY_HELLOW / * Only run a simple h e l l o world * /
9
10 /* Testing c o n t r o l macros * /
11 // # define TEST_ARI / * T e s t a r i t h m e t i c module * /
12 // # define TEST_ARI_NOADD / * Skip a d d i t i o n during a r i t h m e t i c t e s t i n g * /
13 // # define TEST_ARI_NOMOD / * Skip m u l t i p l i c a t i o n during a r i t h m e t i c t e s t i n g
*/
14 // # d e f i n e TEST_ARI_NODIV / * Skip d i v i s i o n during a r i t h m e t i c t e s t i n g * /
15 // # d e f i n e TEST_ARI_NONEG / * Skip n e g a t i o n during a r i t h m e t i c t e s t i n g * /
16 // # d e f i n e TEST_REGS / * T e s t r e g i s t e r bank reading/ w r i t i n g * /
17
18 / * Benchmarking c o n t r o l macros * /
19 # d e f i n e BENCHMARK / * D i s a b l e anything but t h e
benchmarking code * /
20 // # d e f i n e BENCHMARK_ECC_ADDITION / * Perform a d d i t i o n s with ECCo
with minimal e x t r a code * /
21 // # d e f i n e BENCHMARK_ANSSI_ADDITION / * Perform a d d i t i o n s with ANSSI
l i b with minimal e x t r a code * /
22 // # d e f i n e BENCHMARK_ECC_MULTIPLICATION / * Perform m u l t i p l i c a t i o n with
ECCo with minimal e x t r a code * /
23 # d e f i n e BENCHMARK_ANSSI_MULTIPLICATION / * Perform m u l t i p l i c a t i o n with
ANSSI l i b with minimal e x t r a code * /
24 // # d e f i n e BENCHMARK_ITERATIONS 1 / * Number o f i t e r a t i o n s during
benchmarking * /
25 // # d e f i n e BENCHMARK_ITERATIONS 10 / * Number o f i t e r a t i o n s during
benchmarking * /
26 # d e f i n e BENCHMARK_ITERATIONS 100 / * Number o f i t e r a t i o n s during
benchmarking * /
27
28 / * ANSSI l i b e c c c o n t r o l macros * /
29 # d e f i n e ANSSI_LIBECC
30
31 / * S a n i t y checks o f macros * /
32 # i f ( d e f i n e d (BENCHMARK_ECC_ADDITION) && ( d e f i n e d (
BENCHMARK_ANSSI_ADDITION) || d e f i n e d (BENCHMARK_ECC_MULTIPLICATION) ||
d e f i n e d (BENCHMARK_ANSSI_MULTIPLICATION) ) ) || \
33 ( d e f i n e d (BENCHMARK_ANSSI_ADDITION) && ( d e f i n e d (
BENCHMARK_ECC_ADDITION) || d e f i n e d (BENCHMARK_ECC_MULTIPLICATION) ||
34 ( d e f i n e d (BENCHMARK_ECC_MULTIPLICATION) && ( d e f i n e d (
BENCHMARK_ANSSI_ADDITION) || d e f i n e d (BENCHMARK_ECC_ADDITION) ||
35 ( d e f i n e d (BENCHMARK_ANSSI_MULTIPLICATION) && ( d e f i n e d (
BENCHMARK_ANSSI_ADDITION) || d e f i n e d (BENCHMARK_ECC_MULTIPLICATION) ||
d e f i n e d (BENCHMARK_ECC_ADDITION) ) )
36 # e r r o r ( " Only one BENCHMARK_ macro can be d e f i n e d a t a time " )
37 # endif
78 Appendix F. Benchmark & Test program
38
39 # i f ( d e f i n e d (BENCHMARK_ANSSI_ADDITION) || d e f i n e d (
BENCHMARK_ANSSI_MULTIPLICATION) ) && ! d e f i n e d ( ANSSI_LIBECC )
40 # e r r o r ( " ANSSI_LIBECC must be d e f i n e d f o r ANSSI benchmarks " )
41 # endif
42
43
44 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
45 * *
46 * Includes *
47 * *
48 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
49
50 / * ARM CM33 * /
51 # i n c l u d e <arm_cmse . h>
52 # i n c l u d e <cm4ss . h>
53 # i n c l u d e < e e _ p r i n t f . h>
54 # i n c l u d e <cm33/ s e c u r e / t r u s t z o n e _ u t i l . h>
55
56 /* s t d l i b */
58 # i n c l u d e < s t r i n g . h>
59
60 / * Coprocessor * /
62 # i n c l u d e " ecc_word . h "
63 # include " division_data . h"
64 # i n c l u d e " modular_addition_data . h "
65 # include " modular_multiplication_data . h"
66
67 / * ANSSI l i b e c c * /
68 # i f d e f ANSSI_LIBECC
69 # include " l i b a r i t h . h"
70 # endif
71
72
73 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
74 * *
75 * G l o b a l s /Macros *
76 * *
77 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
78
79 / * TZ_START_NS : S t a r t address o f non−s e c u r e a p p l i c a t i o n * /
80 # i f n d e f TZ_START_NS
81 # d e f i n e TZ_START_NS ( 0 x80000U )
82 # endif
83
84 # d e f i n e CPACR_ADDR ( ( unsigned * ) 0xE000ED88U )
85
86
87 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
88 * *
89 * T e s t setup *
90 * *
91 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
92
93 /* A r i t h m e t i c t e s t f u n c t i o n s */
94 bool t e s t _ a r i _ m u l t i p l i c a t i o n ( char ( * ) [DATAMUL16_NUM_HEADERS] [
DATAMUL16_NUM_CHARS+ 1 ] ) ;
95 bool t e s t _ a r i _ a d d i t i o n ( char ( * ) [DATAADD16_NUM_HEADERS] [DATAADD16_NUM_CHARS
+1]) ;
96 bool t e s t _ a r i _ d i v i s i o n ( char ( * ) [DATADIV16_NUM_HEADERS ] [ DATADIV16_NUM_CHARS
+1]) ;
97
98 / * ANSSI l i b e c c h e l p e r s * /
99 # i f d e f ANSSI_LIBECC
100 s t a t i c void nn_import_from_hexbuf ( nn_t out_nn , c o n s t char * hbuf , u32
hbuflen ) ;
101 # endif
102
103 / * Benchmark value s t r i n g s * /
Appendix F. Benchmark & Test program 79
104 char add_op1_str [ ] = " 0

x63feb1ab67e6b315a2dea87e6547ba17e0daa6009366d19f14dbb427faee50ae " ;
105 char add_op1_buf [ ] = { 0 x63 , 0 xfe , 0xb1 , 0xab , 0 x67 , 0 xe6 , 0xb3 , 0 x15 , 0 xa2 ,
0xde , 0 xa8 , 0 x7e , 0 x65 , 0 x47 , 0xba , 0 x17 , 0 xe0 , 0xda , 0 xa6 , 0 x00 , 0 x93
, 0 x66 , 0xd1 , 0 x9f , 0 x14 , 0xdb , 0xb4 , 0 x27 , 0 xfa , 0 xee , 0 x50 , 0 xae } ;
106 char add_op2_str [ ] = " 0
x2f08337b7ae05e16b4fada1ebbb4c7bb56009e5c141dc5b487db427faee50ae0 " ;
107 char add_op2_buf [ ] = { 0 x2f , 0 x08 , 0 x33 , 0x7b , 0 x7a , 0 xe0 , 0 x5e , 0 x16 , 0xb4 ,
0 xfa , 0xda , 0 x1e , 0xbb , 0xb4 , 0 xc7 , 0xbb , 0 x56 , 0 x00 , 0 x9e , 0 x5c , 0 x14
, 0x1d , 0 xc5 , 0xb4 , 0 x87 , 0xdb , 0 x42 , 0 x7f , 0 xae , 0 xe5 , 0 x0a , 0 xe0 } ;
108 char add_mod_str [ ] = " 0
xa41a41a12a799548211c410c65d8133afde34d28bdd542e4b680cf2899c8a8c4 " ;
109 char add_mod_buf [ ] = { 0 xa4 , 0 x1a , 0 x41 , 0 xa1 , 0 x2a , 0 x79 , 0 x95 , 0 x48 , 0 x21 ,
0 x1c , 0 x41 , 0 x0c , 0 x65 , 0xd8 , 0 x13 , 0 x3a , 0 xfd , 0 xe3 , 0x4d , 0 x28 , 0xbd
, 0xd5 , 0 x42 , 0 xe4 , 0xb6 , 0 x80 , 0 x c f , 0 x28 , 0 x99 , 0 xc8 , 0 xa8 , 0 xc4 } ;
110 char mul_op1_str [ ] = " 0
x63feb1ab67e6b315a2dea87e6547ba17e0daa6009366d19f14dbb427faee50ae " ;
111 char mul_op1_buf [ ] = { 0 x63 , 0 xfe , 0xb1 , 0xab , 0 x67 , 0 xe6 , 0xb3 , 0 x15 , 0 xa2 ,
0xde , 0 xa8 , 0 x7e , 0 x65 , 0 x47 , 0xba , 0 x17 , 0 xe0 , 0xda , 0 xa6 , 0 x00 , 0 x93
, 0 x66 , 0xd1 , 0 x9f , 0 x14 , 0xdb , 0xb4 , 0 x27 , 0 xfa , 0 xee , 0 x50 , 0 xae } ;
112 char mul_op2_str [ ] = " 0
x02f08337b7ae05e16b4fada1ebbb4c7bb56009e5c141dc5b487db427faee50ae " ;
113 char mul_op2_buf [ ] = { 0 x02 , 0 xf0 , 0 x83 , 0 x37 , 0xb7 , 0 xae , 0 x05 , 0 xe1 , 0x6b ,
0 x4f , 0xad , 0 xa1 , 0 xeb , 0xbb , 0 x4c , 0x7b , 0xb5 , 0 x60 , 0 x09 , 0 xe5 , 0 xc1
, 0 x41 , 0xdc , 0x5b , 0 x48 , 0x7d , 0xb4 , 0 x27 , 0 xfa , 0 xee , 0 x50 , 0 xae } ;
114 char mul_mod_str [ ] = " 0
xa41a41a12a799548211c410c65d8133afde34d28bdd542e4b680cf2899c8a8c4 " ;
115 char mul_mod_buf [ ] = { 0 xa4 , 0 x1a , 0 x41 , 0 xa1 , 0 x2a , 0 x79 , 0 x95 , 0 x48 , 0 x21 ,
0 x1c , 0 x41 , 0 x0c , 0 x65 , 0xd8 , 0 x13 , 0 x3a , 0 xfd , 0 xe3 , 0x4d , 0 x28 , 0xbd
, 0xd5 , 0 x42 , 0 xe4 , 0xb6 , 0 x80 , 0 x c f , 0 x28 , 0 x99 , 0 xc8 , 0 xa8 , 0 xc4 } ;
116
117 # d e f i n e BM_STR_LEN 67
118 # d e f i n e BM_BUF_LEN 32
119 # d e f i n e BM_NN_LEN ( ( BM_STR_LEN / 2 ) / WORD_BYTES)
120
121
122 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
123 * *
124 * Secure main *
125 * *
126 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
127
128 int
129 main ( void )
130 {
131 # i f n d e f BENCHMARK
132 MSG( ( "C−code : Secure firmware b o o t i n g\n " ) ) ;
133 MSG( ( " >>>>>>>> Running ECC firmware t e s t . \ n " ) ) ;
134 # endif
135
136 / * Enable c o p r o c e s s o r * /
137 *CPACR_ADDR ^= 0 x01 ;
138
139 # i f d e f ONLY_HELLOW
140
141 MSG( ( "HELLO EC WORLD! \ n " ) ) ;
142
143 # else
144
145 /* * * * * * * * * * * * * * * * * * * * * * * * * * * *
146 * T e s t a r i t h m e t i c module *
147 * * * * * * * * * * * * * * * * * * * * * * * * * * * */
148
149 # i f d e f TEST_ARI
150 / * Modular a d d i t i o n * /
151 # i f n d e f TEST_ARI_NOADD
152 MSG( ( " >>>> T e s t i n g a d d i t i o n \n " ) ) ;
153 i f ( t e s t _ a r i _ a d d i t i o n ( dataAdd16 ) )
154 MSG( ( " S u c c e s s ! \ n " ) ) ;
155 # endif
156 / * Modular m u l t i p l i c a t i o n * /
157 # i f n d e f TEST_ARI_NOMUL
158 MSG( ( " >>>> T e s t i n g m u l t i p l i c a t i o n \n " ) ) ;

159 i f ( t e s t _ a r i _ m u l t i p l i c a t i o n ( dataMul16 ) )
160 MSG( ( " S u c c e s s ! \ n " ) ) ;
161 # endif
162 /* D i v i s i o n */
163 # i f n d e f TEST_ARI_NODIV
164 MSG( ( " >>>> T e s t i n g d i v i s i o n \n " ) ) ;
165 i f ( t e s t _ a r i _ d i v i s i o n ( dataDiv16 ) )
166 MSG( ( " S u c c e s s ! \ n " ) ) ;
167 # endif
168 # endif
169
170 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
171 * Benchmark modular a d d i t i o n w/CP *
172 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
173
174 # i f d e f BENCHMARK_ECC_ADDITION
175 ecc_word_t op1 , op2 , mod ;
176 / * S e t parameter v a l u e s * /
177 e w _ s e t _ s t r (&op1 , add_op1_str ) ;
178 e w _ s e t _ s t r (&op2 , add_op2_str ) ;
179 e w _ s e t _ s t r (&mod, add_mod_str ) ;
180 / * Load parameters t o CP * /
181 ew_load_cr0 (&op1 ) ;
183 EW_LOAD_MOD(&mod) ;
184 / * Perform N number o f a d d i t i o n s * /
185 f o r ( i n t i = 0 ; i < BENCHMARK_ITERATIONS ; ++ i )
186 ECC_ADD( " 0 " , " 1 " , " 0 " ) ;
187 # endif
188
189 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
190 * Benchmark modular a d d i t i o n i n s o f t w a r e *
191 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
192
193 # i f d e f BENCHMARK_ANSSI_ADDITION
194 nn nn_op1 , nn_op2 , nn_mod ;
195 fp fp_op1 , fp_op2 ;
196 f p _ c t x f p _ c t x ; /* F i n i t e f i e l d c o n t e x t − s i z e o f f i e l d e t c . */
197 / * I n i t i a l i z e and s e t parameter v a l u e s * /
198 n n _ i n i t _ f r o m _ b u f (&nn_op1 , add_op1_buf , BM_BUF_LEN) ;
199 n n _ i n i t _ f r o m _ b u f (&nn_op2 , add_op2_buf , BM_BUF_LEN) ;
200 n n _ i n i t _ f r o m _ b u f (&nn_mod , add_mod_buf , BM_BUF_LEN) ;
201 f p _ c t x _ i n i t _ f r o m _ p (& f p _ c t x , &nn_mod ) ;
202 f p _ i n i t (&fp_op1 , &f p _ c t x ) ;
204 fp_op1 . f p _ v a l = nn_op1 ;
205 fp_op2 . f p _ v a l = nn_op2 ;
208 fp_add(&fp_op1 , &fp_op1 , &fp_op2 ) ;
209 # endif
210
211 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
212 * Benchmark modular m u l t i p l i c a t i o n w/CP *
213 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
214
215 # i f d e f BENCHMARK_ECC_MULTIPLICATION
216 ecc_word_t op1 , op2 , mod ;
217 / * S e t parameter v a l u e s * /
218 e w _ s e t _ s t r (&op1 , mul_op1_str ) ;
219 e w _ s e t _ s t r (&op2 , mul_op2_str ) ;
220 e w _ s e t _ s t r (&mod, mul_mod_str ) ;
221 / * Load parameters t o CP * /
227 ECC_MUL( " 0 " , " 1 " , " 0 " ) ;
228 # endif
229
230 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
231 * Benchmark modular m u l t i p l i c a t i o n i n s o f t w a r e *
232 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
233
234 # i f d e f BENCHMARK_ANSSI_MULTIPLICATION
235 nn nn_op1 , nn_op2 , nn_mod ;
236 fp fp_op1 , fp_op2 ;
237 f p _ c t x f p _ c t x ; /* F i n i t e f i e l d c o n t e x t − s i z e o f f i e l d e t c . */
238 / * I n i t i a l i z e and s e t parameter v a l u e s * /
239 n n _ i n i t _ f r o m _ b u f (&nn_op1 , mul_op1_buf , BM_BUF_LEN) ;
240 n n _ i n i t _ f r o m _ b u f (&nn_op2 , mul_op2_buf , BM_BUF_LEN) ;
241 n n _ i n i t _ f r o m _ b u f (&nn_mod , mul_mod_buf , BM_BUF_LEN) ;
242 f p _ c t x _ i n i t _ f r o m _ p (& f p _ c t x , &nn_mod ) ;
245 fp_op1 . f p _ v a l = nn_op1 ;
246 fp_op2 . f p _ v a l = nn_op2 ;
249 fp_mul(&fp_op1 , &fp_op1 , &fp_op2 ) ;
250 # endif
251
252 # endif
253
254 # i f n d e f BENCHMARK
255 MSG( ( " >>>>>>>> F i n i s h e d ECC firmware t e s t . \ n\n " ) ) ;
256 # endif
257
258 f i n i s h _ t e s t ( TEST_PASS ) ;
259 r e t u r n 0 ; // This l i n e w i l l never e x e c u t e as boot_nonsec_program never
returns
260 }
261
262
263 /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
264 * *
265 * Test functions *
266 * *
267 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
268
269 /* * * * * * * * * * * * * * * * * * * * * * *
270 * A r i t h m e t i c module *
271 * * * * * * * * * * * * * * * * * * * * * * */
272
274 bool
275 t e s t _ a r i _ a d d i t i o n ( char ( * data ) [DATAADD16_NUM_HEADERS] [DATAADD16_NUM_CHARS
+1])
276 {
277 int i = 0;
278 char ( * e n t r y ) [DATAADD16_NUM_CHARS+ 1 ] ;
279 ew_str_t mod_s , op1_s , op2_s , s o l _ s , r e s _ s ;
280 ecc_word_t mod, op1 , op2 , s o l , r e s ;
281
282 while ( i ++ < DATAADD16_NUM_ENTRIES ) {
283 e n t r y = * data ++;
284 / * S e t parameter v a l u e s from data s t r i n g s * /
285 i f ( ! e w _ s e t _ s t r (&mod, e n t r y [ 0 ] ) ) goto e r r o r ;
286 i f ( ! e w _ s e t _ s t r (&op1 , e n t r y [ 1 ] ) ) goto e r r o r ;
288 i f ( ! e w _ s e t _ s t r (& s o l , e n t r y [ 3 ] ) ) goto e r r o r ;
289 / * Load parameters i n t o CP r e g i s t e r s * /
293 / * Perform a d d i t i o n * /
294 ECC_ADD( " 0 " , " 1 " , " 2 " ) ;
295 /* V e r i f y r e s u l t */
296 e w _ s t o r e _ c r 2 (& r e s ) ;
297 i f ( ! ew_eq(& r e s , &s o l ) )
298 goto wrong ;
299 MSG( ( " T e s t e n t r y %d passed . \ n " , i ) ) ;
300 }
301 return true ;
302
303 wrong :
304 e w _ t o _ s t r (&mod, mod_s , EW_STR_LENGTH) ;
305 e w _ t o _ s t r (&op1 , op1_s , EW_STR_LENGTH) ;
307 e w _ t o _ s t r (& r e s , r e s _ s , EW_STR_LENGTH) ;
308 e w _ t o _ s t r (& s o l , s o l _ s , EW_STR_LENGTH) ;
309 MSG( ( " %s\n "
310 " + %s\n "
311 " (mod %s ) \n "
312 " = %s\n "
313 " got %s\n " ,
314 op1_s , op2_s , mod_s , r e s _ s , s o l _ s ) ) ;
315 error :
316 MSG( ( " F a i l e d . . . \ n " ) ) ;
318 }
319
321 bool
322 t e s t _ a r i _ m u l t i p l i c a t i o n ( char ( * data ) [DATAMUL16_NUM_HEADERS] [
DATAMUL16_NUM_CHARS+ 1 ] )
323 {
324 int i = 0;
325 char ( * e n t r y ) [DATAMUL16_NUM_CHARS+ 1 ] ;
326 ew_str_t mod_s , op1_s , op2_s , s o l _ s , r e s _ s ;
327 ecc_word_t mod, op1 , op2 , s o l , r e s ;
328
329 while ( i ++ < DATAMUL16_NUM_ENTRIES ) {
330 e n t r y = * data ++;
332 i f ( ! e w _ s e t _ s t r (&mod, e n t r y [ 0 ] ) ) goto e r r o r ;
341 ECC_MUL( " 0 " , " 1 " , " 2 " ) ;
342 /* V e r i f y r e s u l t */
343 e w _ s t o r e _ c r 2 (& r e s ) ;
344 i f ( ! ew_eq(& r e s , &s o l ) )
345 goto wrong ;
347 }
348 return true ;
349
350 wrong :
351 e w _ t o _ s t r (&mod, mod_s , EW_STR_LENGTH) ;
356 MSG( ( " %s\n "
357 " * %s\n "
358 " (mod %s ) \n "
359 " = %s\n "
360 " got %s\n " ,
361 op1_s , op2_s , mod_s , r e s _ s , s o l _ s ) ) ;
362 error :
363 MSG( ( " F a i l e d . . . \ n " ) ) ;
365 }
366
368 bool
369 t e s t _ a r i _ d i v i s i o n ( char ( * data ) [DATADIV16_NUM_HEADERS ] [ DATADIV16_NUM_CHARS
+1])
370 {
371 int i = 0;
372 char ( * e n t r y ) [DATADIV16_NUM_CHARS+ 1 ] ;
373 ew_str_t op1_s , op2_s , s o l _ s , r e s _ s ;
374 ecc_word_t op1 , op2 , s o l , r e s ;
375
376 while ( i ++ < DATADIV16_NUM_ENTRIES ) {
377 e n t r y = * data ++;
386 ECC_DIV ( " 0 " , " 1 " , " 2 " ) ;
387 /* V e r i f y r e s u l t */
388 e w _ s t o r e _ c r 2 (& r e s ) ;
389 i f ( ! ew_eq(& r e s , &s o l ) )
390 goto wrong ;
392 }
393 return true ;
394
395 wrong :
400 MSG( ( " %s\n "
401 " / %s\n "
402 " = %s\n "
403 " got %s\n " ,
404 op1_s , op2_s , r e s _ s , s o l _ s ) ) ;
405 error :
406 MSG( ( " F a i l e d . . . \ n " ) ) ;
408 }
L ISTING F.1: C main of test and benchmark program.

85
References
[1] N. Koblitz, “Elliptic curve cryptosystems”, Math. Comp., vol. 48, pp. 203–
209, 1987, ISSN: 0025-5718. DOI: 10.1090/S0025-5718-1987-0866109-
5.
[2] V. S. Miller, “Use of elliptic curves in cryptography”, in Advances in
Cryptology — CRYPTO ’85 Proceedings, H. C. Williams, Ed., Berlin, Hei-
delberg: Springer Berlin Heidelberg, 1986, pp. 417–426, ISBN: 978-3-
540-39799-1.
[3] A. J. Menezes, S. A. Vanstone, and P. C. V. Oorschot, Handbook of Applied
Cryptography, 1st. Boca Raton, FL, USA: CRC Press, Inc., 1996, ISBN:
0849385237.
[4] W. Diffie and M. Hellman, “New directions in cryptography”, IEEE
Transactions on Information Theory, vol. 22, no. 6, pp. 644–654, Nov. 1976,
ISSN : 0018-9448. DOI : 10.1109/TIT.1976.1055638.
[5] Y. Kumar, R. Munjal, and H. Sharma, “Comparison of symmetric and
asymmetric cryptography with existing vulnerabilities and counter-
measures”, International Journal of Computer Science and Management Stud-
ies, vol. 11, no. 03, 2011.
[6] R. Tripathi and S. Agrawal, “Comparative study of symmetric and asym-
metric cryptography techniques”, International Journal of Advance Foun-
dation and Research in Computer (IJAFRC), vol. 1, no. 6, pp. 68–76, 2014.
[7] E. Rescorla. (2018). The transport layer security (tls) protocol version
1.3, [Online]. Available: https : / / tools . ietf . org / html / rfc8446
(visited on 11/09/2018).
[8] IEEE. (2017). Why we need low-power, low-latency devices, [Online].
Available: https://innovationatwork.ieee.org/why-we-need-low-
power-low-latency-devices/ (visited on 06/26/2019).
[9] M. Guerra. (2017). The power of iot devices, [Online]. Available: https:
//www.electronicdesign.com/power/power-iot-devices (visited on
06/26/2019).
[10] N. Shields. (2017). Here’s how 5g will revolutionize the internet of
things, [Online]. Available: https://www.businessinsider.com/how-
5g- will- revolutionize- the- internet- of- things- 2017- 6?r=US&
IR=T (visited on 06/26/2019).
[11] M. Hirth, Hardware acceleration of asymmetric elliptic curve cryptography,
2018.
[12] P. B. Bhattacharya, S. K. Jain, and S. Nagpaul, Basic abstract algebra, 2nd.
Cambridge University Press, 1994, ISBN: 0521460816.
86 REFERENCES
[13] B. Lynn. (). Modular arithmetic, [Online]. Available: https://crypto.

stanford . edu / pbc / notes / numbertheory / arith . html (visited on
11/14/2018).
[14] Wikipedia. (2018). Extended euclidaen algorithm, [Online]. Available:
https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
(visited on 11/14/2018).
[15] ——, (2018). Euclidaen algorithm, [Online]. Available: https : / / en .
wikipedia.org/wiki/Euclidean_algorithm (visited on 11/14/2018).
[16] S. for Efficient Cryptography. (2009). Sec 1: Elliptic curve cryptography,
[Online]. Available: http://www.secg.org/sec1- v2.pdf (visited on
12/19/2018).
[17] J. Balasch, B. Gierlichs, K. Ja¨rvinen, and I. Verbauwhede, “Hardware/-
software co-design flavors of elliptic curve scalar multiplication”, in
2014 IEEE International Symposium on Electromagnetic Compatibility (EMC),
Aug. 2014, pp. 758–763. DOI: 10.1109/ISEMC.2014.6899070.
[18] H. Cohen, A. Miyaji, and T. Ono, “Efficient elliptic curve exponen-
tiation using mixed coordinates”, in Advances in Cryptology — ASI-
ACRYPT’98, K. Ohta and D. Pei, Eds., Berlin, Heidelberg: Springer Berlin
Heidelberg, 1998, pp. 51–65, ISBN: 978-3-540-49649-6.
[19] D. Hankerson, A. J. Menezes, and S. Vanstone, Guide to Elliptic Curve
Cryptography. Berlin, Heidelberg: Springer-Verlag, 2003, ISBN: 038795273X.
[20] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining dig-
ital signatures and public-key cryptosystems”, Commun. ACM, vol. 21,
no. 2, pp. 120–126, Feb. 1978, ISSN: 0001-0782. DOI: 10.1145/359340.
359342. [Online]. Available: http://doi.acm.org/10.1145/359340.
359342.
[21] A. P. Fournaris, I. Zafeirakis, C. Koulamas, N. Sklavos, and O. Koufopavlou,
“Designing efficient elliptic curve diffie-hellman accelerators for em-
bedded systems”, in 2015 IEEE International Symposium on Circuits and
Systems (ISCAS), May 2015, pp. 2025–2028. DOI: 10.1109/ISCAS.2015.
7169074.
[22] Mentor. (2019). Questa® advanced simulator, [Online]. Available: https:
//www.mentor.com/products/fv/questa/ (visited on 06/19/2019).
[23] ——, (2019). Mentor, [Online]. Available: https://www.mentor.com/
(visited on 06/19/2019).
[24] “Ieee standard vhdl language reference manual”, IEEE Std 1076-2008
(Revision of IEEE Std 1076-2002), pp. c1–626, Jan. 2009. DOI: 10.1109/
IEEESTD.2009.4772740.
[25] “Ieee standard for verilog hardware description language”, IEEE Std
1364-2005 (Revision of IEEE Std 1364-2001), pp. 1–590, Apr. 2006. DOI:
10.1109/IEEESTD.2006.99495.
REFERENCES 87
[26] “Ieee standard for systemverilog–unified hardware design, specifica-

tion, and verification language”, IEEE Std 1800-2017 (Revision of IEEE
Std 1800-2012), pp. 1–1315, Feb. 2018. DOI: 10 . 1109 / IEEESTD . 2018 .
8299595.
[27] ARM. (2019). Cortex-m33, [Online]. Available: https : / / developer .
arm.com/ip-products/processors/cortex-m/cortex-m33 (visited on
06/19/2019).
[28] ——, (2019). Arm, [Online]. Available: https://www.arm.com/ (visited
on 06/19/2019).
[29] ——, (2016). Armv8-m architecture reference manual, [Online]. Avail-
able: http://infocenter.arm.com/help/index.jsp?topic=/com.
arm.doc.ddi0553a.d/index.html (visited on 06/26/2019).
[30] Wikipedia. (2019). Hardware acceleration, [Online]. Available: https:
//en.wikipedia.org/wiki/Hardware_acceleration (visited on 06/19/2019).
[31] R. Benadjila, A. Ebalard, and J.-P. Flori. (2017). Libecc project, [On-
line]. Available: https://github.com/ANSSI- FR/libecc (visited on
10/11/2018).
[32] Python Software Foundation. (2018). Python, [Online]. Available: https:
//www.python.org/ (visited on 11/21/2018).
[33] Python Docs. (2018). Python data model, [Online]. Available: https :
//docs.python.org/3/reference/datamodel.html#the-standard-
type-hierarchy (visited on 11/20/2018).
[34] C. Koc, Rsa hardware implementation, rsa laboratories, rsa data security, inc.
august 1995.
[35] J. K. Omura, “A public key cell design for smart card chips”, ISITA’90,
pp. 983–985, 1990.
[36] P. L. Montgomery, “Modular multiplication without trial division”, Math.
Comp, vol. 44, pp. 519–521, 1985. DOI: 10 . 1090 / S0025 - 5718 - 1985 -
0777282-X.
[37] N. I. of Standards and Technology. (2013). Digital signature standards,
[Online]. Available: https : / / nvlpubs . nist . gov / nistpubs / FIPS /
NIST.FIPS.186-4.pdf (visited on 09/09/2018).
[38] S. for Efficient Cryptography. (2010). Sec 2: Recommended elliptic curve
domain parameters, [Online]. Available: http://www.secg.org/sec2-
v2.pdf (visited on 09/09/2018).
[39] OpenCores. (2019). Opencores, [Online]. Available: https://opencores.
org/ (visited on 07/01/2019).

Hardware Acceleration of ECC

Uploaded by

Copyright:

Available Formats

Hardware Acceleration of ECC

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hardware Acceleration of ECC

Uploaded by

Copyright:

Available Formats

Magnus Hirth

Hardware Acceleration of Asymmetric

Master’s thesis in Electronics Systems Design and Innovation

Hardware Acceleration of Asymmetric

Master’s thesis in Electronics Systems Design and Innovation

Norwegian University of Science and Technology

Asymmetric cryptography, which is also known as public-key cryptography,

Hardware Acceleration of Asymmetric Elliptic Curve Cryptography

4 Methodology and Architecture Design 19

A Test Data Python script 51

B Internal Interfaces SV Code 55

E ECCo Big Number library 65

F Benchmark & Test program 77

CM33 ARM Cortex M33

1.1 Asymmetric Cryptography

1.2 Objective and Approach

1.3 Main Contributions

2.1 Set theory

2.1.1 Finite Field Arithmetic

2.2 Elliptic Curves

(0,6) (3,6) (4,6)

F IGURE 2.1: Illustration of y2 = x3 − 2x + 1 with the solutions

Figure 2.1 illustrates the elliptic curve y2 = x3 − 2x + 1, x ∈ [−7, 7]. The

2.2.2 EC over F2k

y2 + xy ≡ x3 + ax2 + b (mod p) (2.9)

F IGURE 2.2: Illustration of y2 + xy = x3 − 2x2 + 1 with the

Figure 2.2 illustrates the elliptic curve y2 + xy = x3 − 2x2 + 1, x ∈ [−7, 7].

2.2.3 Point Arithmetics

F IGURE 2.3: Illustration of elliptic curve point addition and

Let P1 = ( x1 , y1 ), P2 = ( x2 , y2 ) and P3 = ( x3 , y3 ) be points on an elliptic

2.3 Scalar Multiplication

Algorithm 1 Double-and-add (left-to-right) [17]

2.4 Coordinate Systems

2.5 ECC Algorithms

a, b Are the elliptic curve coefficients (See Equation 2.8)

G Is the base point on the curve.

n Is the order of G; The smallest positive number such that n · G = O

For F2m the parameters are D = (m, f ( x ), a, b, G, n, h), where f ( x ) is an irre-

Algorithm 2 ECDSA signature generation [19]

If Alice wants to send a message to Bob with a digital signature to verify

Algorithm 3 ECDSA signature verification [19]

2.7 ARM Cortex M33

CPD, CPD2 Coprocessor data processing instructions.

MCR, MCR2 32-bit data transfer to the coprocessor.

MRC, MRC2 32-bit data transfer to the CM33.

MCRR, MCRR2 64-bit data transfer to the coprocessor.

MRRC, MRRC2 64-bit data transfer to the CM33.

2.8 Hardware Acceleration

F IGURE 2.4: libecc architecture [31]

operations (Chapter 2.2), hardcoded values for curves, and implementation

In this chapter, existing algorithms for hardware implementations of modu-

3.1 Modular Addition Implementation

Algorithm 4 Modular Addition Algorithm

Algorithm 5 Omura’s Method, Modular Addition Algorithm

The value of m will need to either be computed during operation or pre-

Omura’s algorithm is still restricted to positive numbers, but accepts ad-

3.2 Modular Multiplication Implementation

Algorithm 6 Multiply and Divide Algorithm

This is, however, not an efficient implementation. The word size of P0

A basic interleaving algorithm is presented in Algorithm 7, where A and