A High-Speed Hardware Algorithm for Modulus Operation and its Application in Prime Number Calculation
Abstract
This paper presents a novel high-speed hardware algorithm for the modulus operation for FPGA implementation. The proposed algorithm use only addition, subtraction, logical, and bit shift operations, avoiding the complexities and hardware costs associated with multiplication and division. It demonstrates consistent performance across operand sizes ranging from 32-bit to 2048-bit, addressing scalability challenges in cryptographic applications. Implemented in Verilog HDL and tested on a Xilinx Zynq-7000 family FPGA, the algorithm shows a predictable linear scaling of cycle count with bit length difference (BLD), described by the equation , where represents the cycle count and represents the BLD. The application of this algorithm in prime number calculation up to 500,000 shows its practical utility and performance advantages. Comprehensive evaluations reveal efficient resource utilization, robust timing performance, and effective power management, making it suitable for high-performance and resource-constrained platforms. The results indicate that the proposed algorithm significantly improves the efficiency of modular arithmetic operations, with potential implications for cryptographic protocols and secure computing.
1 Introduction
Residual Number System (RNS) continue to play a crucial role in hardware design for various applications, including cryptography, computer science, digital signal processing, error correction, and random number generation [1]. The efficiency of RNS implementations heavily relies on modular arithmetic operations. While extensive research has been conducted on modular addition, subtraction, multiplication, division, and exponentiation [2],[3], [4],[5], [6], and [7], there is a notable gap in the literature regarding efficient hardware implementations of the fundamental modulus operation itself-that is, computing .
The modulus operation, often overlooked in favor of its composite operations, is in fact the cornerstone of all modular arithmetic. It is essential in cryptographic algorithms, hash functions, random number generation, and error correction codes [8][9]. Despite its importance, hardware implementations of the modulus operation have received comparatively little attention, with most systems relying on software implementations or treating it as a byproduct of division [10].
This paper addresses this critical gap by presenting a novel hardware algorithm specifically designed for modulus operation. This focus on the fundamental operation of distinguishes our work from the majority of studies in the field and offers several key advantages. By optimizing the core modulus operation, our work potentially improves the efficiency of all derived modular operations. This cascading effect could lead to significant performance improvements in systems relying on modular arithmetic.
The primary contributions of our research are as follows:
-
1.
We present a novel algorithm for modulus calculation optimized for hardware implementation, addressing a fundamental operation often overlooked in hardware design literature.
-
2.
The algorithm is implemented in Verilog HDL and tested on a Xilinx Zynq-7000 family FPGA, demonstrating its practical applicability in real-world hardware environments.
-
3.
Our algorithm can be easily implemented using a finite state machine (FSM) on FPGA hardware. By adjusting parameter values, the design can be scaled to any operand size, eliminating the need for complex designs.
-
4.
Our comprehensive evaluation reveals that the cycle count for the operation scales as , where represents the cycle count and represents the bit length difference of the operands.
-
5.
As a practical demonstration, we apply our hardware modulus operation to find prime numbers up to 500,000, showcasing its potential in cryptographic applications.
2 Related Work
This section provides an overview of relevant research in hardware implementations of modular operations, highlighting the gap in literature regarding efficient hardware designs for the fundamental modulus operation.
Modular arithmetic operations, including addition, subtraction, multiplication and exponentiation, have been the focus of numerous hardware implementation studies. Styanarayana et al. provided a comprehensive survey of hardware architectures for modular multiplication, emphasizing its importance in public-key cryptosystems [11]. Their work highlighted various reductions, which are widely used but often involve complex hardware designs.
Hossain et al. presented hardware architectures for modular arithmetic operations over a prime field, optimized for elliptic curve cryptography (ECC) [12]. Their architectures focus on modular addition, subtraction, and multiplication, implemented separately to reduce circuit latency and area., achieving significant improvements in computational time and area utilization compared to related designs.
Langhammer et al. explored an efficient implementation of modular multiplication on FPGAs using Barrett’s algorithm [13]. Their method reduced resource count and latency of modular multiplication by employing aggressive truncation strategies for multipliers and introducing a new reduction method, demonstrating efficiency for 1024-bit modular multipliers. However, the design introduced complexity in managing truncation errors and required significant DSP blocks.
The growing importance of IoT and edge computing has sparked interest in efficient implementations of cryptographic operations on resource-constrained devices. Ibrahim et al. focused on the resource and energy-efficient hardware implementation of the Montgomery modular multiplication algorithm over , targeting compact IoT edge devices [14]. Their design achieved significant savings in area, delay, and energy consumption but involved complex scheduling and projection functions.
The modulus operation, despite its fundamental nature, has received relatively little attention in hardware design literature [15]. Sivakumar et al. explored VLSI architectures for computing the integer modulo operation for specific values of [8]. Their designs were optimized for specific modulus values, limiting general applicability.
Butler et al. presented a high-speed hardware implementation of the modulus operation optimized for FPGA deployment [15]. They introduced two versions of the algorithm to calculate : one with a fixed modulus and another where the modulus can vary. Their design demonstrated efficient pipelining and resource utilization but faced scalabillity issues for very large operand sizes.
Will et al. presented an efficient algorithm for modular reduction using a variable-sized lookup table, supporting large operands and relying on simple processor instructions, making it hardware friendly [10]. However, the authors did not provide a hardware implementation, and the complexity of managing the lookup tables could be a limitation.
Alia et al. introduced a method for efficiently computing using an approximation and correction approach that avoids direct division [16], . Their VLSI structure leveraged fast binary multipliers to handle 32-bit numbers. However, the complexity and resource requirements increased significantly for larger operands, limiting scalability.
While the aforementioned studies have made significant contributions to the field of hardware-based modular arithmetic, there remain a notable gap in the literature regarding efficient, scalable hardware implementations of the fundamental modulus operation (). The majority of existing research focuses on composite modular operations or overall cryptographic algorithms, often overlooking the potential performance gains that could be achieved by optimizing this core operation.
3 Methodology
3.1 Hardware Algorithm for Modulus Operation
The proposed algorithm computes modulus operation with addition, subtraction, logical, and bit shift operations without using general multiplication or division operations which are expensive in hardware implementation. The pseudocode of the proposed algorithm is given in Figure 1, which calculates , where is the dividend and is the divisor.
The algorithm computes the using state machine approach assuming that and . It starts with initializing the state to IDLE and setting variables dividend, divisor, shift, and done to zero. The main loop continuously checks the state. If the state is IDLE and start signal is received, the algorithm assigns to the dividend and to the divisor, and the algorithm transitions to the ALIGN state, where the divisor is left-shifted until it aligns with the dividend, incrementing the shift counter at each step.
Once alignment is achieved, the algorithm changes to SUBTRACT state, where it iteratively subtracts the divisor from the dividend if the dividend is greater than or equal to the divisor. The divisor is then right-shifted, and the shift counter is decremented. This process continues until the dividend is smaller than , the shift counter is zero, or the shift counter reaches . Finally, the algorithm changes to the FINISH state, where the result is set to the current dividend, and the state returns to IDLE, indicating the operation is complete. This state machine approach ensures efficient and systematic computation of the modulus operation, making it suitable for FPGA implementation.
3.1.1 Hardware Implementation of the Modulus Operator
The algorithm was implemented using Verilog Hardware Description Language (HDL). The Xilinx Vivado design tool was employed for simulation and synthesis. Initially, the algorithm was tested with a variety of input values to verify its accuracy in the simulation. Following the initial verification, the Verilog implementation was modified to include the number of cycles consumed for each calculation.
For further analysis, a separate Vivado project was created, instantiating the modulus algorithm to record data directly to the computer. Figure 2 shows the block diagram of the system. The integer values, dividend and divisor, are encoded from the computer and transfered to the First-in-First-Out (FIFO) buffer in the FPGA through 8-bit UART interface. The FPGA system decodes these two integers and send them to the modulus instance. With the start signal modulus block begins its operations and a timer counter also begins to count the clock signals consumed by the modulus block. As soon as the modulus operation completes its operation, done signal is asserted, which indicates to stop the timer counter. Finally, modulus result and the cycle counter values are encoded to be sent to the computer through the FIFO and UART interface. A python program runs on the computer sends and receives the outputs of FPGA system. The recorded data included the dividend, divisor, result, and the number of cycles per calculation. This comprehensive dataset was subsequently analyzed to evaluate the performance and efficiency of the modulus algorithm.
To measure the performance of the algorithm in FPGA hardware, various operand sizes were used. The algorithm can be extended to any bit length of operands by simply changing the integer value in the algorithm in Figure 1. For this study, operand sizes of 32-bit, 64-bit, 128-bit, 256-bit, 1024-bit, and 2048-bit were considered.
3.2 Application to Prime Number Calculation
The best way to measure the performance of the hardware algorithm for modulus operation is to use it in a practical application. In this study, we chose prime number calculation, as it involves repeated modulus operations and is computationally intensive. The application calculates all prime numbers up to a given integer. For example, if the given integer is 10, the application finds 2, 3, 5, and 7 as prime numbers. As the given integer increases, the number of computations required grows significantly. Therefore, this application provides an effective means to measure the performance of the new hardware algorithm for modulus operations.
The algorithm in Figure 3 was constructed to identify all prime numbers up to a given integer , using only logical and addition operations. Although this prime number finding algorithm may not be the most optimized solution and there may be better alternatives, it is sufficient for measuring the performance of the new modulus finding algorithm.
The algorithm begins by initializing several variables: (set to 1) iterate through numbers, pCount to count number of primes found, and as selection flags, and an empty list primes to store the identified prime numbers. The algorithm enters while loop that continues as long as is less than . When equals 2, the algorithm identifies it as a prime number, appends it to the primes list, and increments pCount. For numbers greater than 2, the algorithm initializes to 3 and enters a nested while loop that continues as long as is less than or equal to .
Within the nested loop, if equals , the number is identified as a prime, appended to the primes list, and pCount is incremented before breaking out of the loop. If is not equal to , the algorithm checks if is 0, in which case it calculates as the modulus of by 2 and sets to 1. Otherwise, it calculates as the modulus of by . If is not 0, is incremented by 2, and the loop continues. After exiting the nested loop, is incremented by 1, and the outer loop continues until reaches . The function finally returns a list containing the number of primes found and the list of prime numbers.
3.2.1 Hardware Implementation of Prime Number Calculation
Figure 4 shows the datapath diagram of the prime calculation system. The system employs four registers to store the input input integer (), two indexes ( and ), and the prime number(). The signal is asserted when , and the signal is asserted when is equal to 2. Signals and are asserted when and is equal to , respectively. The block mod represents the hardware implementation of the modulus algorithm.
The signal is asserted if the modulo operation mod equals zero. The signals start_mod and done_mod indicate the initiation and completion of the modulo calculation process, respectively.
Figure 5 shows the finite state machine (FSM) diagram that generates the control signals to manage the data path. The process begins with the start signal, encompassing a total of 13 states. The control signals generated in each state are listed in Table 1. The datapath and the control FSM are implemented as separate Verilog modules and interfaced as depicted in Figure 6. Additionally, a counter module uses start, prime_found, and done signals to determine the number of primes found and the total clock cycles utilized for the entire process.
Both hardware implementations, the modulus operation and the prime calculation, were configured using the Xilinx Vivado 2018.3 design suite, targeting a Digilent Zybo Zynq-7000 SoC Trainer FPGA board containing a Xilinx XC7Z010-1CLG400C. Logic resource utilization, power analysis, and timing analysis were conducted using the Vivado design suite for both the hardware modulus implementation and the prime number calculation separately.
State | Control Signals |
---|---|
WAIT | clr_regs = 1 |
= 1 | |
= 1 | |
= 1 | |
= 0 | |
if( == 0) | |
= 1 | |
start_mod = 1 | |
= 1 | |
= 1 | |
PRIME | prime_found = 1 |
= 1 | |
REPEAT | if( == 1) |
= 0 | |
else | |
= 1 | |
DONE | done = 1 |
The Verilog description of the datapath included several submodules, such as registers and multiplexers, in addition to the modulus module. These modules were instantiated in the data path module along with other logical statements, following the datapath diagram shown in Figure 4. The FSM was implemented using the two-always-block method in Verilog, with one block for state logic and the other for next-state logic. The top-level Verilog module integrates the FSM module and the datapath module. The complete system was then simulated, and the outputs were compared with software calculations to verify accuracy.
3.3 Measurements
In this study, two main experiments were conducted to measure the performance of the proposed hardware algorithm for modulus operation and its application in finding prime numbers up to a given integer value.
To measure the accuracy and performance of the hardware algorithm for modulus operation, operand sizes of 32-bit, 64-bit, 128-bit, 256-bit, 1024-bit, and 2048-bit were used. Uniform random integers were generated through a Python program and encoded to send to the FPGA board via a UART interface for processing. At the end of each operation, results were sent from the FPGA board to the computer and recorded for further analysis. For each operand size, 10,000 samples were recorded.
In the second experiment, integer values of 10, 100, 1,000, 10,000, 100,000, 200,000, 300,000, 400,000, and 500,000 were considered. The prime numbers calculated up to each integer and the number of clock cycles used for each calculation were recorded. These data were analyzed to measure the performance.
To compare the performance of the prime number calculations in FPGA hardware, software implementations of the prime number calculation algorithm (Figure 3) were considered. In the software implementation of the algorithm, the built-in modulus operator (%) was used. Both Python and C programming languages were used to run the same integer set on a Windows 11 PC with 8GB RAM and a processor with a 12th Gen Intel Core i5, 1300 MHz, 10 cores, 12 logical processors, which can operate at a max turbo frequency of 4.9 GHz.
4 Results and Discussion
4.1 Modulus Algorithm Implementation
Our hardware-implemented modulus algorithm achieved 100% accuracy for all operand sizes when compared against software-based calculations. The Figures 7(a), 7(b), and 7(c) represent the performance of the hardware algorithm for calculating the modulus operation with 32-bit, 64-bit, and 128-bit operands respectively.
The equations of fits of three graphs approximately equivalent. The slopes of all three equations are nearly identical, around 2. This indicates that the number of cycles required for the modulus operation increases linearly with the bit length difference (BLD) at a consistent rate. The near-constant slope show that the algorithm’s performance scales predictably regardless of the operand size, making it robust and reliable.
The -intercepts are relatively low for all three cases. These values represent the base cycle count when the BLD is zero, indicating minimal overhead. The low -intercept suggest that the algorithm starts with a small number of cycles and adds cycles linearly as the BLD increases, highlighting the efficiency of the algorithm. Therefore, we can say that cycle count and the BLD has the relationship shown in the equation 1, where represents cycle counts and represents BLD.
(1) |
The Figure 7(d) illustrates the relationship between the the Cycle Count and the BLD for an experimental set of data where the operand size is 2048-bits. The blue dots represent the experimental data points, while the red line represents the predicted cycle count based on the equation 1.
The red line closely follows the distribution of blue dots, indicating that the experimental data align well with the predicted cycle count. This highlights that the prediction equation, is highly accurate.
When the BLD is small, regardless of the operand sizes, the calculation completes in very few clock cycles. This is evident from the cluster of data points near the origin of the graph, where both BLD and cycle count are low. This indicates that the proposed hardware algorithm performs very efficiently when the BLD is small.
The hardware implementation of the modulus algorithm on the Zybo FPGA board, as analyzed using the Xilinx Vivado design environment, demonstrates efficient resource utilization, power consumption, and timing performance.
The Table 2 summarizes the resource utilization of the proposed hardware algorithm implemented on the Zybo FPGA board for different operand sizes (32-bit, 256-bit, 1024-bit, and 2048-bit). The key resources measured include Look-Up Tabes (LUT), Flip-Flops (FF), and Global Clock Buffers (BUFG). The Zybo FPGA has total of LUTs 17,600 and total FFs of 35,200.
As the operand size increases, the utilization of LUTs and FFs increase. This is expected as larger operand sizes require more complex logic. The utilization of BUFG remains constant at 6% across all operand sizes. This indicates that the clock distribution requirements do not change with operand size.
Implementation | |||||
---|---|---|---|---|---|
Resource Utilization | 32-bit | 256-bit | 1024-bit | 2048-bit | |
(%) | (%) | (%) | (%) | ||
Look Up Tables (LUT) | 6 | 15 | 24 | 45 | |
Flip-Flops (FFs) | 5 | 9 | 11 | 15 | |
Global Clock Buffers (BUFG) | 6 | 6 | 6 | 6 |
The timing performance metics of the 2048-bit implementation on the Zybo FPGA board are summarized in Table 3. The Worst Negative Slack (WNS) is 3.384 ns and it indicates the maximum amount by which the design meets the setup time requirements. Other metrics also indicate that the design meets all user-specified timing constraints. Based on the WNS, the maximum operating frequency is about 295.5 MHz.
Metric | Value |
Worst Negative Slack (WNS) | 3.384 ns |
Total Negative Slack (TNS) | 0.000 ns |
Number of Failing Endpoints | 0 |
Total Number of Endpoints (Setup) | 1038 |
Worst Hold Slack (WHS) | 0.049 ns |
Total Hold Slack (THS) | 0.000 ns |
Number of Failing Endpoints (Hold) | 0 |
Total Number of Endpoints (Hold) | 1030 |
Worst Pulse Width Slack (WPWS) | 3.500 ns |
Total Pulse Width Negative Slack (TPWS) | 0.000 ns |
Number of Failing Endpoints (Pulse Width) | 0 |
Total Number of Endpoints (Pulse Width) | 485 |
The power performance metics are detailed in Table 4. The total on-chip power consumption is 0.174 W, divided into dynamic and static components. The dynamic power is 0.083 W, which constitutes the power consumed due to the switching activity of the circuit. Static power 0.091 W represents the power consumed due to leakage currents. The junction temperature is maintained at 27.0 C, with a thermal margin of 58.0 C, indicating effective thermal management.
The 2048-bit implementation on the Zybo FPGA board demonstrates robust timing performance, with no violations in setup, hold, or pulse width requirements, and can operate at a maximum frequency of approximately 295.5 MHz. The power consumption is efficiently managed, with a balanced distribution between dynamic and static power components. These results validate the efficacy of the proposed hardware modulus algorithm for high-performance applications on FPGA platforms.
Parameter | Value |
---|---|
Total On-Chip Power | 0.174 W |
Dynamic Power | 0.083 W (48%) |
Clocks | 0.001 W (1%) |
Signals | 0.038 W (46%) |
Logic | 0.044 W (53%) |
Device Static Power | 0.091 W (52%) |
Junction Temperature | C |
Thermal Margin | C (4.9 W) |
4.2 Limitations of the Algorithm
One limitation of our algorithm is that the cycle count required to complete modulus operations is not fixed, as it depends on the bit lengths of the dividend and the divisor. This variability can create challenges in determining the latency and throughput of designs. Another limitation is that the cycle count becomes significantly large when there is a considerable difference in the bit lengths of the operands. Addressing these limitations will be the focus of future improvements to the algorithm.
4.3 Prime Number Calculation
Figure 8 illustrates the time taken to compute prime numbers up to a given integer A using three different approaches: a hardware implementation on the Zybo FPGA running at 125 MHz, and software implementations in Python and C on a high-performance Windows 11 PC. The y-axis is set to a logarithmic scale to highlight the differences in performance across several orders of magnitudes, while the x-axis represents the integer value for which primes are calculated.
Form the Figure 8 we can observe that the Python implementation shows significantly longer computation times compared to both the hardware and C implementations, while C implementation performs better than hardware implementation. Notably, the hardware algorithm on Zybo FPGA is currently operating at 125 MHz, but it has potential to operate at a maximum frequency of 295.5 MHz. If the hardware were run at this higher frequency, the performance could potentially surpass that of the C implementation on the high-performance PC, particularly for lager integer values.
Integer | Primes | Zybo FPGA | Win 11 PC | Win 11 PC |
---|---|---|---|---|
A | found | @ 125 MHz | (Python) | (C) |
up to A | Time (s) | Time (s) | Time (s) | |
10 | 4 | 0.000000680 | 0.000020943 | |
100 | 25 | 0.000064580 | 0.000998400 | 0.000040000 |
1,000 | 168 | 0.000468280 | 0.007677970 | 0.000176000 |
10,000 | 1,229 | 0.034511140 | 0.487490000 | 0.018000000 |
100,000 | 9,592 | 2.710378531 | 58.110090000 | 0.965000000 |
200,000 | 17,984 | 10.184829460 | 231.585353000 | 3.374000000 |
300,000 | 25,997 | 22.087260698 | 503.831983000 | 8.135000000 |
400,000 | 33,380 | 38.444923792 | 870.262950000 | 18.968000000 |
500,000 | 41,538 | 58.998911240 | 1112.491425000 | 29.179000000 |
4.4 Comparison
Butler et al. [15] presented results for computing x mod z in two scenarios: with z fixed at 3 and with z as a variable. For both cases, using 256-bit numbers for x, their system achieved an operating frequency of 143.8 MHz. The FPGA resource utilization was similar in both scenarios, consuming approximately 55,001 and 55,255 3-input LUTs for fixed and variable z, respectively.
In an earlier study, Alia et al. [16] proposed a method using multipliers for calculating . Their implementation achieved a response time of 200 ns for 32-bit numbers, equivalent to an operating frequency of 5 MHz.
Our implementation of the modular exponentiation algorithm shows significant improvements over these previous works. For 2048-bit operand sizes, our design utilized only 45% of the available LUTs (17,600 total) on a Zybo FPGA board. Compared to Butler et al. [15], our implementation uses substantially fewer FPGA resources while achieving a higher operating frequency of 295.5 MHz. This represents a notable advancement in both resource efficiency and performance for modular arithmetic operations on FPGAs.
4.5 Discussion
The linear relationship between the bit length difference (BLD) and the cycle count, as demonstrated in Figure 7, highlights the efficiency of the proposed hardware algorithm for modulus operation. The algorithm scales predictably, maintaining efficiency even as operand sizes increase. This linear scalability is crucial for applications requiring high-speed computations with varying operand sizes.
Figure 7(d) provides additional validation of the algorithm’s performance by comparing experimental data with the predicted cycle count. The close alignment of the experimental data points with the predicted cycle count based on the equation highlights the accuracy and robustness of the prediction model. The experimental results confirm that the hardware algorithm performs efficiently and predictably across different BLD values, maintaining low cycle counts even as the BLD increases.
Analysis of FPGA resource utilization (Table 2) demonstrates that our algorithm efficiently uses hardware resources even for large operand sizes, with the 2048-bit implementation using only 7,920(45% of available) LUTs. Timing performance results (Table 3) confirm that this 2048-bit implementation meets all constraints without violations in setup, hold, or pulse width requirements, achieving a maximum frequency of 295.5 MHz. Furthermore, power consumption data (Table 4) indicate a balanced distribution between dynamic and static components, with the total on-chip power maintained at an efficient 0.174 W.
The performance comparison of prime number calculations, illustrated in Figure 8, shows that the hardware implementation on the Zybo FPGA outperforms the Python implementation and approaches the performance of the C implementation on a high-performance PC. Notably, the hardware implementation operates at 125 MHz, but with potential for higher performance at the maximum frequency of 295.5 MHz. This suggests that, with further optimization, the hardware implementation could surpass the software implementations, particularly for larger integer values.
The results validate the efficacy of the proposed hardware algorithm for modulus operations and its application in prime number calculations. The linear scalability, efficient resource utilization, robust timing performance, and effective power management make the algorithm well-suited for a wide range of hardware applications. Future work could explore further optimization to increase the operating frequency and reduce power consumption, as well as adaptations for other computationally intensive tasks.
5 Conclusion
This study introduces a novel hardware algorithm for the modulus operation, optimized for FPGA implementation. The algorithm’s linear scalability and low overhead, demonstrated through extensive testing with operand sizes from 32-bit to 2048-bit, make it a robust solution for high-speed computations. The implementation on the Zybo FPGA platform confirms efficient resource utilization, robust timing performance, and effective power management, underscoring its suitability for both high-performance and resource-constrained platforms. Its application in prime number calculation further validates its practical use, showcasing substantial performance improvements over traditional software implementations. The findings suggest that this hardware algorithm can notably accelerate cryptographic protocols and other applications reliant on modular arithmetic, offering a promising direction for further research and development in hardware-based arithmetic operations.
References
- [1] A. A. Oke, B. A. Nathaniel, B. F. Bukola, O. A. Ayopo, Residue number system based applications: A literature review., Annals. Computer Science Series 19 (2021).
- [2] S. Tynymbayev, R. Berdibayev, T. Omar, Y. Aitkhozhayeva, A. Shaikulova, S. Adilbekkyzy, High-speed devices for modular reduction with minimal hardware costs, Cogent Engineering 6 (2019) 1697555.
- [3] A. Parihar, S. Nakhate, Low latency high throughput montgomery modular multiplier for rsa cryptosystem, Engineering Science and Technology, an International Journal 30 (2022) 101045.
- [4] A. A. Abd-Elkader, M. Rashdan, E.-S. A. Hasaneen, H. F. Hamed, A compact fpga-based montgomery modular multiplier, Indonesian Journal of Electrical Engineering and Computer Science 21 (2021) 735–743.
- [5] J. Ding, S. Li, A low-latency and low-cost montgomery modular multiplier based on nlp multiplication, IEEE Transactions on Circuits and Systems II: Express Briefs 67 (2019) 1319–1323.
- [6] M. M. Islam, M. S. Hossain, M. Shahjalal, M. K. Hasan, Y. M. Jang, Area-time efficient hardware implementation of modular multiplication for elliptic curve cryptography, IEEE Access 8 (2020) 73898–73906.
- [7] J. W. Bos, S. J. Friedberger, Faster modular arithmetic for isogeny-based crypto on embedded devices, Journal of Cryptographic Engineering 10 (2020) 97–109.
- [8] R. Sivakumar, N. Dimopoulos, VLSI architectures for computing x mod m, IEE Proceedings-Circuits, Devices and Systems 142 (1995) 313–320.
- [9] R. Müller, W. Meier, C. F. Wildfeuer, Area efficient modular reduction in hardware for arbitrary static moduli, arXiv preprint arXiv:2308.15079 (2023).
- [10] M. A. Will, R. K. Ko, Computing mod with a variable lookup table, in: Security in Computing and Communications: 4th International Symposium, SSCC 2016, Jaipur, India, September 21-24, 2016, Proceedings 4, Springer, 2016, pp. 3–17.
- [11] S. Vollala, N. Ramasubramanian, U. Tiwari, Review of Algorithmic Techniques for Improving the Performance of Modular Exponentiation, Springer International Publishing, 2021.
- [12] M. R. Hossain, M. S. Hossain, Efficient fpga implementation of modular arithmetic for elliptic curve cryptography, in: 2019 International conference on electrical, computer and communication engineering (ECCE), IEEE, 2019, pp. 1–6.
- [13] M. Langhammer, B. Pasca, Efficient fpga modular multiplication implementation, in: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021, pp. 217–223.
- [14] A. Ibrahim, F. Gebali, Word-based processor structure for montgomery modular multiplier suitable for compact iot edge devices, Mathematics 11 (2023) 328.
- [15] J. T. Butler, T. Sasao, Fast hardware computation of x mod z, in: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, IEEE, 2011, pp. 294–297.
- [16] G. Alia, E. Martinelli, A vlsi structure for x (mod m) operation, Journal of VLSI signal processing systems for signal, image and video technology 1 (1990) 257–264.