Number Theoretic Transform (NTT) FPGA Accelerator
Number Theoretic Transform (NTT) FPGA Accelerator
Number Theoretic Transform (NTT) FPGA Accelerator
Accelerator
1 Introduction
Classical cryptographic algorithms protect data against malicious actors. Files
can be encrypted prior to transmission, then decrypted once received. These
encryption and decryption schemes can either rely on symmetric or asymmetric
keys. Symmetric key encryption means that both sides need to have the key.
On the other hand, asymmetric key encryption relies on computationally hard
functions to generate a public key easily while preventing someone who knows
that public key from gaining access to a user’s private key and as a result,
their data. Algorithms created for quantum computers can solve some previously
created computationally hard problems in a shorter time period which has led
to major research into post-quantum cryptography algorithms that are secure
against quantum computers.
Research into post-quantum algorithms have led to a number of different
types of algorithms being created. We chose to focus on lattice-based schemes,
most of which see significant speedups when using hardware accelerators. This is
due to the fact that many lattice-based cryptographic primitives use polynomial
multiplication as a basic operation. The Number Theoretic Transform (NTT)
is used for the polynomial multiplication within the algorithm. The NTT is
used to perform a transform, then multiply two polynomials then transform
back. Hardware accelerators, such as GPUs or FPGAs, can be used to perform
2 A. Hartshorn, et al.
polynomial multiplication faster because they can have multiple functional units
compute different parts of the Number Theoretic Transform in parallel.
2 Background
2.1 Cryptography
Quantum computers are machines that use quantum mechanics. They leverage
the strange and unreasonable physical properties of matter at an atomic scale to
perform computations. Classical computers encode data in binary digits (bits)
that are represented as “1” or “0”. Quantum computers generate and manipu-
late quantum bits or qubits that can encode more than two states. Qubits are
typically subatomic particles such as electrons or photons that are isolated in a
controlled quantum state.
Superposition is an important qubit property. It is the ability to “be” in mul-
tiple states. Therefore, a quantum computer with several qubits in superposition
can represent an exponent of potential outcomes simultaneously.
Entanglement is another important qubit property. This allows the genera-
tion of pairs of qubits where two members of a pair exist in a single quantum
state. When the state of one of the qubits is changed, the state of the other one
changes instantaneously in a predictable way.
4 A. Hartshorn, et al.
– Lattice-based cryptography
– Multivariate cryptography
– Hash-based cryptography
– Code-based cryptography
– Super-singular elliptic curve isogeny cryptography
– Symmetric key quantum resistance
13 return a
Number Theoretic Transform (NTT) FPGA Accelerator 7
The cyclic convolution of two integer sequences coming from a finite field of
length n can be computed by applying the algorithm to both sequences, then
multiplying the resulting NTT sequences of length ncoefficient-wise and trans-
forming the result back via an inverse NTT. Because of this cyclic convolution
produced, computing c = a · b mod (X n + 1) with two polynomials a and b
would require applying the NTT of length 2n and thus n zeros to be appended
to each input. This effectively doubles the length of the inputs and requires the
computation of an explicit reduction modulo X n + 1.
The polynomial product of two n-1 degree polynomials a and b has degree
2n-1, so it requires evaluations in at least 2n distinct points to be uniquely
identified. To do this, we use the 2nth primitive roots of unity, meaning that
a polynomial with a coefficient vector of at least length 2n is needed for our
NTT algorithm. Following this, we pad the coefficient vectors of polynomials a
and b to at least length 2n using zeros. For more precision since it is required
that n is a power of 2, the coefficient vectors are padded to length 2k , where k
is the lowest integer such that 2k ≥ 2n. The NTT algorithm is then applied to
get evaluations of the polynomials a and b at the same 2n distinct inputs. If we
then multiply the 2n evaluations of a with the respective 2n evaluations of b, we
calculate 2n products in total that together represent the polynomial product
of the two original polynomials. The INTT of this product is then computed to
transform the vector of polynomial evaluations into the vector of its coefficients.
3 Methodology
Unlike the 128-bit fast modulus reduction, there are many more scenarios for
z values that we need to account for. We found 6 different cases for z, such
that −3p < z < 2p. In equation 6, we create groupings of additions to prevent
underflow. Since we are using unsigned datatypes, we need to be careful when
performing subtractions to ensure we never go below zero. Using the groupings,
we can easily check for various cases when we must add p or 2p to prevent
underflow. For the last step, we subtract p or 2p as necessary. Even with more
additions, shifts, subtractions, and comparison, we are able to pipeline the 256-
bit fast modulus in three stages. Figure 4 shows the pipeline stages.
The 64-point NTT we implemented has a few specific properties that allow it
to work. The first is the use of 64-bit input numbers which allows use to utilize
bit shifting since 8 is a 64th root of unity. The second requirement is that the
number of shifts follow a set pattern which allows us to pre-calculate the shifts
and store them ahead of time in BRAM blocks. The input to the NTT is 64
64-bit numbers. The corresponding output is another set of 64 64-bit numbers.
Normally when calculating the 64-point NTT, many multiplication opera-
tions are needed which are extremely slow on FPGA hardware. Essentially, a
matrix of 64 by 64 multiplication operations is needed for the calculations. As
mentioned before, using 64-bit input values allows us to shift instead of multiply
which greatly improves the speed performance. These shift values are known as
omegas.
The first step of the NTT involves computing rows of shifts. Each row is cal-
culated by taking the summation of input numbers shifted by 64 corresponding
omegas. After the summation, a modulus needs to be taken. In order to calcu-
late ω, the indexes of the input and output are needed. The following equation
calculates the correct shifts:
We use Equation 7 to precalculate omega and load those values into BRAM
blocks. A Python script was used to calculate the 64 by 64 omega values.
Below is the calculations of the 64 output numbers, y, for the input numbers
x and the ω values where i and j are the indexes of x and y respectively:
Differing Order of Operations There are two methods that we can use to
compute the 64-point NTT. The first is to shift by w, perform a 256-bit fast
modulus, then add the previous result. A final 128-bit fast modulus reduction
is performed at the very end in case there is overflow. We call this first method
the Mod First Method. The second is to shift by w, add the previous result,
then perform a single 256-bit fast modulus at the very end. The second method
requires a 256-bit fast modulus because the additions can cause the number to
reach a maximum of 254-bits. We call this method the Add First Method. Figure
5, shows the first method of operations. Variations between the two methods are
slight, but there are a few clock cycles of difference that are described in more
detail in Section 4.2.
The top level implementation of the iterative NTT module is shown in Figure 8.
Note that the control signals between the modules are not shown. Instead, the
datapath is shown through the different modules. Each of the modules are fully
pipelined. Additionally, the algorithms used in these modules are intended to be
scalable. The scaling parameters are the input vector size (n), and the number
Number Theoretic Transform (NTT) FPGA Accelerator 13
of BRAMs / functional units (b). This section discusses the function of each of
the blocks in the top level module. The blocks are:
• Index Calculator
• BRAM Router
• Write Back Controller
• ALU Router
• BRAM controller
• ALU Cores
14 A. Hartshorn, et al.
15 if b = n/2 then
16 ∆g = 1
17 else if ∆y = n/b then
18 ∆g = 2 × ∆y
19 else
20 ∆g = n/b
21 for cycle ← 0 to cc do
22 if cycle (mod ∆y) = 0 and ∆y 6= 2 and cycle 6= 0 then
23 ∆x0 = ∆x0 + ∆y;
24 brams = b/groups;
25 for index_pair ← 0 to brams do
26 if ∆g (mod ∆y) = 0 and index_pair 6= 0 and
∆g < power_of _two then
27 ∆g 0 = ∆g 0 + ∆y;
Algorithm Remarks In line 4, the change in the y index from the x is simple.
It is the current power_of _two shifted right by 1.
Next consider the logic starting at line 5. A variable, groups, is set to indicate
that the pairs are grouped together in a single clock cycle as opposed to indexes
that contain a jump.
The difference in x coordinates from each clock cycles is defined as ∆x. This
depends on the groups variable whether the jump is 2 or 1. The change in the
x coordinate is not constant. To handle the occasional non constant change in
x, the variable ∆x0 (line 22) is used. The offset is incremented whenever the
current clock cycle equals ∆y.
Similar to ∆x, ∆g is used to denote the jump in pairs in the same clock cycle.
This is calculated in line 15. There is also a non constant change in ∆g. ∆g 0 is
used to compensate for the non constant ∆g. It is shown in line 26.
Router Modules The BRAM router, write back controller, and ALU router
have similar functions. They are responsible for separating the index / ALU
pairs into BRAM pairs. The values are multiplexed to their appropriate port
because of the BRAM port tag in the datagram. The write back controller also
issues a signal to stall the pipeline for a cycle to write back into the BRAMs. The
ALU router acts like the BRAM router except for the datagram tag. Recall the
datagram contains a separate tag to send the pair to an ALU. Figure 9 shows
how values can be routed using a butterfly circuit. Although different BRAMs’
values are needed, they can still be fully utilized and routed to an ALU. The
ALU performs the addition / subtraction and uses a similar process to store
back into the BRAM.
Number Theoretic Transform (NTT) FPGA Accelerator 17
BRAM Controller Users can specify the number of BRAMs to include in the
module. The BRAM number must be a power of two. The default Xilinx dual
port BRAM module was slightly modified to include our NTT parameters. The
address and data port widths are automatically calculated based on the vector
size and number of BRAMs. The BRAM modules are wrapped in a BRAM top
module that contains the bus assignments.
ALU Cores The final module used is the ALU core. The number of ALU cores
matches the number of BRAMs. The ALU cores are wrapped in an ALU top
level module to assign the busses. The figure below shows the pipeline stages for
the ALU core. The pipeline fills in six clock cycles. Note that the wb values in
the multiplication stage are stored in a ROM and are not dynamically computed.
This is because in our implementation we consider a fixed omega value. Moreover,
we store the omega values in ROM. They are pre-calculated prior to execution.
18 A. Hartshorn, et al.
4 Results
In this section, we discuss timing and utilization of each element of our design.
Our experiments for timing and overall usage were performed on a Xilinx Virtex-
7 FPGA. Our device consisted of 693,120 logic cells and 3600 DSP slices. It has
433,200 LUTs, 865,400 FFs, and 1470 BRAMs available for use and has a system
clock which runs at 200 MHz. The device also has a PCIe connector allowing it
to be used for high performance applications such as hardware acceleration.
The metrics we focused on to assess our two fast modulus reductions were clock
cycles to completion, timing, and hardware utilization. Each of these metrics tell
a difference story, and each metric needs to be taken into account when used
in our final larger design. The primary goal for each design is to beat timing,
which is a 5ns clock period based on the 200 MHz clock speed of our FPGA.
We are then most concerned about utilization and how much space the designs
require. For the fast modulus especially, space is a big concern because we need
to instantiate multiple copies of each block.
Our final results can be seen in Table 2. For both 128-bit fast modulus and
256-bit fast modulus, we went through many iterations to decrease the clock
cycle count and to meet timing requirements. We were able to meet the 200MHz
frequency with both designs, though 256-bit fast modulus does take longer to
complete and require more resources. The 128-bit Modulus Reduction copmletes
in 6.500ns while the 256-bit Modulus Reduction completes in 13.059ns. We ex-
pected to see these results because 256-bit fast modulus is working with much
more data and much larger numbers. The hardware utilization for each module
is also very low which is very beneficial to us.
The results of the two methods are very similar, as seen in Table 3. Unfor-
tunately, only the 128-bit fast modulus method meets timings even though it
takes 3 more clock cycles to complete. The Mod First method takes a total of
352.8ns to complete, whereas the Add First method takes 372.6ns to complete.
Both methods have similar hardware utilization numbers which was surprising
to us.
The Mod First method uses 64 256-bit fast modulus blocks and a single 128-
bit fast modulus block. The Add First method only requires a single 256-bit fast
modulus block. Based on the results of our standalone fast modulus reduction,
we can see that the 256-bit method is much more complex. With slightly more
LuT and FF utilization, but much less fast modulus blocks, the utilization of
the Add First method happen primarily in the routing of many 256-bit numbers.
The Mod First method has a much better time routing 64-bit numbers and thus
is able to meet our timing requirements.
20 A. Hartshorn, et al.
Vector Length BRAMs Timing Frequency LuT Util. FF Util. DSP Util.
(n) (b) (clk cycles) (MHz)
one using a single 64-point NTT and one using 4 64-point NTT blocks. As you
can see, the results of the combined NTT using 4 64-point blocks is almost 50%
faster in terms of the number of clock cycles. On the other hand, using just a
single NTT does not make a significant difference in terms of performance. Only
a few clock cycles are saved. The 1024 Iterative NTT takes 9680ns to complete
while the combined NTT takes 5315ns to complete. Unfortunately, the cost of
the speed up is a signficant increase to the hardware utilization. The speed up of
using 4 64-point NTTs increases LuT utilization and FF utilization by over 30
times. Using a single 64-point becomes horribly inefficient because the increase
in hardware utilization only saves a few clock cycles.
We also estimated the performance gains of the combined NTT against the
Iterative NTT for different BRAM usage. In Table 6, we see that the combined
NTT using a single 64-point block is only faster than the standard Iterative
NTT using 4 BRAMs. Using 4 64-point NTTs is faster up to 16 BRAMs. At 32
BRAMs, the Iterative NTT is more efficient. This means that the Iterative NTT
actually scales wells in terms of BRAM usage. At 16 BRAMs, the combined
NTT using 4 64-point blocks is just over 500ns faster in terms of execution time.
Where as at 64 BRAMs, the iterative NTT is twice as fast.
The hardware utilization is heavily sacrificed for the increase in speed since
the 64-point NTT requires a lot of resources. For a low BRAM iterative NTT
module, speed can be gained by using more 64-point NTT blocks. However, as
the number of BRAMs increases, the number of 64-point NTT modules also de-
creases as the hardware limit becomes an issue. Note that utilizing more BRAMs
is more efficient even in terms of speed. Using 64 BRAMs in the iterative NTT
22 A. Hartshorn, et al.
The first row of each section is the Iterative NTT performance. The second row of each
section is the combined NTT performance using 1 64-point NTT. The third row of
each section is the combined NTT performance using 4 64-point NTTs.
Number Theoretic Transform (NTT) FPGA Accelerator 23
is more than 5 times faster than using 4 BRAMs with 4 64-point NTT blocks.
These results are assuming that the hardware routing can be done. The number
of DSP slices available allow the iterative NTT to scale well.
Full NTT Multiplication For a complete NTT operation, two forward NTT
operations are needed. The result of those two forward NTT operations are mul-
tiplied together and then a single inverse NTT is performed. The total number
of clock cycles required is computed by the addition of two forward NTT oper-
ations with an inverse NTT operation. It is important to note that the forward
NTT operations can use a combined NTT while the inverse NTT operation must
be a standard iterative NTT in terms of clock cycles. In Table 7, we estimate
the execution time of the full NTT Multiplication. Again, the results prove that
the iterative NTT scales much better than using multiple 64-point NTTs.
The first row of each section uses only Iterative NTT performance. The second row of
each section is the combined NTT performance using 1 64-point NTT. The third row
of each section is the combined NTT performance using 4 64-point NTTs.
We have not added any comparisons to previous works because we were not
able to find any fair comparisons. Some of the previous research that we read
had results based on GPU or software performance. Additionally, the hardware
that previous research has been on is dated at the time of our findings. For those
hardware research results that we did see, the design choices were different from
the ones we choose. Some research had a heavy focus on hardware utilization
reduction rather than performance increase, while other research drastically in-
creased hardware utilization for performance gains. Furthermore, differences in
vector size and bit size of the NTT can change results heavily.
24 A. Hartshorn, et al.
This project was a good start to creating a scalable NTT module. There is room
for improvement in our design. For example, we can save clock cycles in the 64
point by adjusting the way it is integrated with the iterative module. Also, the
iterative module still exhibits errors in some computations. We would have to
hunt for these edge cases and see what causes the module to produce inaccurate
results.
In post-quantum cryptography, the NTT is a commonly used yet expensive
operation. NTT operations are used over 600 times in each post-quantum al-
gorithm in the NIST competition[2][3]. FPGAs can be used to implement the
operation and provide speedups to execution time. We exploited parallelization
of the NTT to create a hardware accelerator. A dedicated 64 point module was
combined with a generic iterative module. The unit is capable of a complete
polynomial multiplication on the order of 20,000 nanoseconds. Our project was
focused on a generic model that can be scaled according to desired specifications.
This allows the NTT to be implemented in a variety of applications from simple
microcontrollers to high end servers.
Number Theoretic Transform (NTT) FPGA Accelerator 25
References
1. Aysu, Aydin & Patterson, Cameron & Schaumont, Patrick. (2013). Low-
cost and area-efficient FPGA implementations of lattice-based cryptogra-
phy. Proceedings of the 2013 IEEE International Symposium on Hardware-
Oriented Security and Trust, HOST 2013. 81-86. 10.1109/HST.2013.6581570.
https://rijndael.ece.vt.edu/schaum//pdf/papers/2013hostb.pdf
2. Peter Schwabe, Roberto Avanzi, Joppe Bos, Léo Ducas, Eike Kiltz, Tancrède Le-
point, Vadim Lyubashevsky, John M. Schanck, Gregor Seiler, and Damien Stehlé.
CRYSTALS-KYBER. (2019). National Institute of Standards and Technology.
https://csrc.nist.gov/projects/ post-quantum-cryptography/round-2-submissions
3. Bernstein, D.J., Chuengsatiansup, C., Lange, T., van Vredendaal, C.: Ntru prime:
reducing attack surface at low cost. Cryptology ePrint Archive, Report 2016/461
(2016). http://eprint.iacr.org/2016/461
4. Nayuki. “Number-Theoretic Transform (Integer DFT).” Project Nayuki, 7 June
2017, www.nayuki.io/page/number-theoretic-transform-integer-dft.
5. A. Emerencia. (2007). Multiplying huge integers using Fourier trans-
forms. http://www.cs.rug.nl/ ando/pdfs/Ando_Emerencia_multiplying_
huge_integers_using_fourier_transforms_paper.pdf.
6. Emmart, Niall & Weems, Charles. (2011). High Precision Integer Multiplication
with a GPU Using Strassen’s Algorithm with Multiple FFT Sizes. Parallel Pro-
cessing Letters. 21. 359-375. 10.1142/S0129626411000266.
7. Longa, P., Naehrig, M.: Speeding up the Number Theoretic Transform for Faster
Ideal Lattice-Based Cryptography. Cryptology ePrint Archive, Report 2016/504
(2016). https://eprint.iacr.org/2016/504.pdf
8. Chen, D.D., Mentens, N., Vercauteren, F., Roy, S.S., Cheung, R.C.C., Pao,
D., Verbauwhede, I.: High-speed polynomial multiplication architecture for ring-
lwe and she cryptosystems. Cryptology ePrint Archive, Report 2014/646 (2014).
https://eprint.iacr.org/2014/646.pdf
9. Mert, A.C., Ozturk, E., Savas, E.: Design and Implementation of a Fast and Scal-
able NTT-Based Polynomial Multiplier Architecture. Cryptology ePrint Archive,
Report 2019/109 (2019). https://eprint.iacr.org/2019/109.pdf
10. T. Poppelmann, T. Oder, and T. Guneysu. High-performance ideal lattice-based
cryptography on 8-bit ATxmega microcontrollers. Cryptology ePrint Archive, Re-
port 2015/382 (2015). https://eprint.iacr.org/2015/382.pdf
11. J. W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex
Fourier series. Mathematics of Computation, 19(90):297–301, 1965.
12. Slade, George. (2013). The Fast Fourier Transform in Hard-
ware: A Tutorial Based on an FPGA Implementation.
https://www.researchgate.net/publication/235995761_The_Fast_Fourier
_Transform_in_Hardware_A_Tutorial_Based_on_an_FPGA_Implementation
13. Giles, Martin. (2019). Explainer: What is a quantum computer? How it works,
why it’s so powerful, and where it’s likely to be most useful first. MIT Tech-
nology Review. https://www.technologyreview.com/2019/01/29/66141/what-is-
quantum-computing/
14. Lyubashevsku, Vadim. (2016). Preparing for the Next Era of
Computing With Quantum-Safe Cryptography. SecurityIntelligence.
https://securityintelligence.com/preparing-next-era-computing-quantum-safe-
cryptography/
26 A. Hartshorn, et al.
15. Boutin, Chad. (30 January 2019). NIST Reveals 26 Algorithms Advancing to the
Post-Quantum Crypto ‘Semifinals’. National Institute of Standards and Technol-
ogy. https://www.nist.gov/news-events/news/2019/01/nist-reveals-26-algorithms-
advancing-post-quantum-crypto-semifinals
16. Public-key cryptography. (n.d). Wikipedia, the free encyclopedia. Retrieved
from https://en.wikipedia.org/wiki/Public-key_cryptography ([Online; accessed
10-March-2020])
17. Post-quantum cryptography. (n.d). Wikipedia, the free encyclopedia. Retrieved
from https://en.wikipedia.org/wiki/Post-quantum_cryptography ([Online; ac-
cessed 14-February-2020])
Number Theoretic Transform (NTT) FPGA Accelerator 27
parameter p = 64’hffffffff00000001;
endmodule
module fast_mod_256(
input [31:0] a,
input [31:0] b,
input [31:0] c,
input [31:0] d,
input [31:0] e,
input [31:0] f,
input [31:0] g,
input [31:0] h,
input clk,
output reg [63:0] z
);
28 A. Hartshorn, et al.
parameter p = 64’hffffffff00000001;
end
endmodule
module rowcalc(
input clk,
input [63:0] a,
input [7:0] w,
output [63:0] out
);
endmodule
if __name__ == "__main__":
#Constants
BRAMS = 2
VECTOR_LEN = 32
trans_size_array = [2, 4, 8, 16, 32]
#Variables
30 A. Hartshorn, et al.
groups = 0
pairs_per_cycle = 0
delta_x = 0
delta_y = 0
delta_g = 0
delta_g_bonus_offset = 0
delta_x_bonus_offset = 0
x_pos = 0
y_pos = 0
delta_x_bonus_offset = 0
pairs_per_cycle = int(BRAMS / groups)
Number Theoretic Transform (NTT) FPGA Accelerator 31
if(groups == 1):
print("(" + str(x_pos) + ", " + str(y_pos) + ")")
elif(groups == 2):
print("(" + str(x_pos) + ", " + str(y_pos) + ")")
print("(" + str(x_pos + 1) + ", " + str(y_pos + 1) + ")")
module index_calc_datapath(
input clk,
input [2:0] current_state,
input [9:0] trans_size,
output reg [6:0] i,
output reg [1:0] j,
output [9:0] x_pos1,
output [9:0] y_pos1,
output [9:0] x_pos2,
output [9:0] y_pos2,
output [9:0] x_pos3,
output [9:0] y_pos3,
output [9:0] x_pos4,
32 A. Hartshorn, et al.
index_calculator CALC1(
.delta_g(delta_g),
.j(0),
.i(i),
.delta_x(delta_x),
.delta_y(delta_y),
.delta_g_bonus_offset(delta_g_bonus_offset),
.delta_x_bonus_offset(delta_x_bonus_offset),
.x_pos(x_pos1),
.y_pos(y_pos1)
);
index_calculator CALC2(
.delta_g(delta_g),
.j(1),
.i(i),
.delta_x(delta_x),
.delta_y(delta_y),
Number Theoretic Transform (NTT) FPGA Accelerator 33
.delta_g_bonus_offset(delta_g_bonus_offset),
.delta_x_bonus_offset(delta_x_bonus_offset),
.x_pos(x_pos2_bus),
.y_pos(y_pos2_bus)
);
index_calculator CALC3(
.delta_g(delta_g),
.j(j3_bus),
.i(i),
.delta_x(delta_x),
.delta_y(delta_y),
.delta_g_bonus_offset(delta_g_bonus_offset),
.delta_x_bonus_offset(delta_x_bonus_offset),
.x_pos(x_pos3),
.y_pos(y_pos3)
);
index_calculator CALC4(
.delta_g(delta_g),
.j(3),
.i(i),
.delta_x(delta_x),
.delta_y(delta_y),
.delta_g_bonus_offset(delta_g_bonus_offset),
.delta_x_bonus_offset(delta_x_bonus_offset),
.x_pos(x_pos4_bus),
.y_pos(y_pos4_bus)
);
endmodule
module alu_core(
input clk,
input [63:0] wb,
input [73:0] op1,
input [73:0] op2,
output [73:0] add_pair,
output [73:0] sub_pair
);
// 2 cc to complete
fast_mod ADD_MOD (
.a(add_pair_to_mod[127:96]),
.b(add_pair_to_mod[95:64]),
.c(add_pair_to_mod[63:32]),
.d(add_pair_to_mod[31:0]),
.clk(clk),
.z(add_pair[63:0])
);
// 2 cc to complete (in parallel with other mod)
fast_mod SUB_MOD (
.a(sub_pair_to_mod[127:96]),
.b(sub_pair_to_mod[95:64]),
.c(sub_pair_to_mod[63:32]),
.d(sub_pair_to_mod[31:0]),
.clk(clk),
.z(sub_pair[63:0])
);
// 3 cc to complete
mul_64bit M1(
.a(op2[63:0]),
.b(wb),
.clk(clk),
.result(mul_result)
);
endmodule