High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA
Abstract
:1. Introduction
- 1.
- We propose a high-speed, unified ECC processor that is generic for arbitrary prime modulus on Weierstrass curves. To the best of our knowledge, in terms of generic implementation, it is the fastest among the existing literature.
- 2.
- For the underlying architecture, we propose a novel and fast pipelined Montgomery Modular Multiplier (pMMM), which is constructed from an n-bit pipelined multiplier-accumulator. The speed-up comes from combining two existing multiplication algorithms: schoolbook long and Karatsuba–Ofman multiplications, enabling parallelization of digit multiplications while preserving low complexity. Moreover, to further optimize the process, we utilize DSP cores as digit multipliers, resulting in a higher speed multiplier compared to other existing methods.
- 3.
- To balance the speed of our fast pMMM, we also propose a unified and pipelined Modular Adder/Subtractor (pMAS) for the underlying field arithmetic operations. In particular, we modify the modular adder/subtractor in [11] to support pipelining, and employ an adjustable radix. The proposed design offers better flexibility in adjusting the performance of the ECC processor.
- 4.
- Additionally, we propose a more efficient and compact scheduling of the Montgomery ladder for the algorithm for ECPM in [22], in which our implementation does not require any additional temporary register as opposed to one additional register in the original algorithm. As a result, it only needs 97 clock cycles to perform ladder operation per bit scalar (for 256-bit size).
- 5.
- Since our ECC processor and the underlying field multiplier (i.e., pMMM) are generic for arbitrary prime modulus, we can support multi-curve parameters in a single ECC processor, forming a unified ECC architecture.
- 6.
- Lastly, our architecture performs the ECPM in constant time by employing a time-invariant algorithm for each module, including using Fermat’s little theorem to carry out field inversion, making the algorithm secure against side-channel attacks.
2. Preliminaries
2.1. Hamburg’s Formula for ECPM with Montgomery Ladder
Algorithm 1 Hamburg’s Montgomery Ladder Formula [22]. | |
Input: | |
Output: | |
|
|
Algorithm 2 Montgomery Ladder. |
Input: Rewrite Output: |
|
2.1.1. Ladder Setup
2.1.2. Ladder Final
2.2. Montgomery Modular Multiplication
Algorithm 3 Montgomery Multiplication. | |
Input: an odd modulus p of n-bits, , , Output: | |
1: | ▹ 1st multiplication |
2: | ▹ 2nd multiplication |
3: | ▹ 3rd multiplication |
4: | ▹ subtraction |
5: if then | ▹ MSB of u |
6: return t 7: else 8: return u |
3. Proposed Architecture
3.1. Pipelined Montgomery Modular Multiplication (pMMM)
3.1.1. Overview of pMMM
3.1.2. Proposed Pipelined Multiplier-Accumulator
- Stage-1: Two inputs A and B are split based on the radix (digit size), which is into 16 bits in our design. Afterward, a parallel 16-bit RCA is used to compute and . At the same time, parallel DSP cores are utilized as 16-bit digit multipliers to compute . As shown in Figure 2a, we employ a two-stage pipeline for the DSP cores to achieve better performance, as recommended in [32].
- Stage-2: We again utilize the DSP cores as a 17-bit Multiply-Accumulate (MAC) function to compute the Karatsuba–Ofman multiplication, . and are obtained from the output of RCAs at the first stage, as shown in Figure 2b.
- Stage-3: The outputs of 16-bit multipliers are routed to the input accumulator in the MAC modules as .
- Stage-4: The final accumulation for Karatsuba–Ofman is computed by a 34-bit RCA. The equation results in a 33-bit length. At this stage, is set when the CTL value is 3. It means that the input is ready to be included in the CSAT at Stage 5 as the final accumulation of the Montgomery reduction algorithm. The algorithm itself is as presented in Algorithm 3.
- Stage-5: Before being processed by the CSAT, all intermediate values are aligned to reduce the number of inputs in CSAT as well as the depth of the tree. This is due to the additional bit length on each intermediate value, i.e., 33-bit instead of 32-bit length. Figure 3 shows the example of the alignment process for four-input CSAT.All aligned intermediate values, including the input , are assembled by CSAT where the compressor components in the CSA use LUT6_2, a similar 3:2 compressor circuit proposed by [11]. However, while they use multiple compressor circuits (e.g., a 4:2 compressor in [11]) to construct the multiplier, we employ the homogeneous 3:2 compressor to achieve a balanced performance, as illustrated on Figure 4.
- Stage-6 and 7: The and as the outputs of CSAT are then fed to the carry-select adder to obtain the final product. Note that we use the carry-select adder proposed by Nguyen et al. [33] due to its relatively short delay propagation. In the carry-select adder by [33], both options for the carry are computed. Subsequently, the carry is solved similarly to that of the carry-lookahead adder (CLA). Lastly, the sum output is then generated with the final carry for each bit [34].
- Stage-8: A register is used to hold the output . The outputs and are given with respect to the input values and , respectively, which are shifted through the stages via a shift register.
3.1.3. Montgomery Modular Multiplication Using pMMM
- 1.
- The pMMM starts by multiplying the n-bit inputs and , resulting in a -bit product, which is then stored in the first-in, first-out (FIFO) buffer. This product will be used later in the third multiplication. Note that our FIFO buffer uses block RAM (BRAM) to reduce the required number of registers, where the depth of the FIFO buffer depends on the number of possible multiplication processes that can be executed concurrently.
- 2.
- The n-bit LSB product of Step 1 is multiplied with the precalculated constant .
- 3.
- Accordingly, the n-bit LSB product of Step 2 is multiplied by the modulus . In this multiplier, the product that was previously stored in the FIFO at Stage 1 is used as the input to be included in CSAT in the multiplier module. This gives the benefit that we do not need to make additional -bit adders. Instead, we include it in the CSAT.
- 4.
- The n-bit MSB of the third multiplication product is then evaluated and corrected using the carry-select subtractor, so that the output of pMMM is within the range [0, P].
3.2. Pipelined Modular Adder/Subtractor (pMAS)
3.3. Modular Inversion Implementation
Algorithm 4 Constant-time Field Inversion algorithm |
Input: a and prime modulus p of n-bits, Output:
|
3.4. Montgomery Ladder Scheduling
3.5. Generic ECC Architecture
Unified Architecture
4. Hardware Implementation Result and Discussion
4.1. Result and Analysis of Generic Implementation on Weierstrass Curve
4.2. Result and Analysis of Unified ECC Architecture
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ullah, H.; Nair, N.G.; Moore, A.; Nugent, C.; Muschamp, P.; Cuevas, M. 5G communication: An overview of vehicle-to-everything, drones, and healthcare use-cases. IEEE Access 2019, 7, 37251–37268. [Google Scholar] [CrossRef]
- Park, J.H.; Park, J.H. Blockchain security in cloud computing: Use cases, challenges, and solutions. Symmetry 2017, 9, 164. [Google Scholar] [CrossRef] [Green Version]
- Suárez-Albela, M.; Fernández-Caramés, T.M.; Fraga-Lamas, P.; Castedo, L. A Practical Performance Comparison of ECC and RSA for Resource-Constrained IoT Devices. In Proceedings of the 2018 Global Internet of Things Summit (GIoTS), Bilbao, Spain, 4–7 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Wood, G. Ethereum: A secure decentralised generalised transaction ledger. arXiv 2014, arXiv:1011.1669v3. [Google Scholar]
- Mai, L.; Yan, Y.; Jia, S.; Wang, S.; Wang, J.; Li, J.; Ma, S.; Gu, D. Accelerating SM2 Digital Signature Algorithm Using Modern Processor Features. In Proceedings of the International Conference on Information and Communications Security, Beijing, China, 15–17 December 2019; pp. 430–446. [Google Scholar]
- Yang, A.; Nam, J.; Kim, M.; Choo, K.K.R. Provably-secure (Chinese government) SM2 and simplified SM2 key exchange protocols. Sci. World J. 2014, 2014, 825984. [Google Scholar] [CrossRef]
- Blake-Wilson, S.; Bolyard, N.; Gupta, V.; Hawk, C.; Moeller, B. Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS). RFC 4492, IETF. 2006. Available online: https://tools.ietf.org/html/rfc4492 (accessed on 28 December 2020).
- National Institute of Standards and Technology. FIPS 186-4–Digital Signature Standard (DSS); National Institute of Standards and Technology: Gaithersburg, MD, USA, 2013. [Google Scholar]
- Mehrabi, M.A.; Doche, C.; Jolfaei, A. Elliptic curve cryptography point multiplication core for hardware security module. IEEE Trans. Comput. 2020, 69, 1707–1718. [Google Scholar] [CrossRef]
- Gallant, R.P.; Lambert, R.J.; Vanstone, S.A. Faster point multiplication on elliptic curves with efficient endomorphisms. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2001; pp. 190–200. [Google Scholar]
- Roy, D.B.; Mukhopadhyay, D. High-speed implementation of ECC scalar multiplication in GF(p) for generic Montgomery curves. IEEE Trans. Very Large Scale Integr. Syst. 2019, 27, 1587–1600. [Google Scholar]
- Costello, C.; Longa, P.; Naehrig, M. Efficient algorithms for supersingular isogeny Diffie-Hellman. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 14–18 August 2016; pp. 572–601. [Google Scholar]
- Miller, V.S. The Weil pairing, and its efficient calculation. J. Cryptol. 2004, 17, 235–261. [Google Scholar] [CrossRef]
- Asif, S.; Hossain, M.S.; Kong, Y.; Abdul, W. A fully RNS based ECC processor. Integration 2018, 61, 138–149. [Google Scholar] [CrossRef]
- Bajard, J.C.; Merkiche, N. Double level Montgomery Cox-Rower architecture, new bounds. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Paris, France, 5–7 November 2014; pp. 139–153. [Google Scholar]
- Ma, Y.; Liu, Z.; Pan, W.; Jing, J. A High-Speed Elliptic Curve Cryptographic Processor for Generic Curves over GF(p). In Proceedings of the International Conference on Selected Areas in Cryptography, Burnaby, BC, Canada, 14–16 August 2013; pp. 421–437. [Google Scholar]
- Shah, Y.A.; Javeed, K.; Azmat, S.; Wang, X. A high-speed RSD-based flexible ECC processor for arbitrary curves over general prime field. Int. J. Circuit Theory Appl. 2018, 46, 1858–1878. [Google Scholar] [CrossRef]
- Lai, J.Y.; Wang, Y.S.; Huang, C.T. High-performance architecture for elliptic curve cryptography over prime fields on FPGAs. Interdiscip. Inf. Sci. 2012, 18, 167–173. [Google Scholar] [CrossRef] [Green Version]
- Vliegen, J.; Mentens, N.; Genoe, J.; Braeken, A.; Kubera, S.; Touhafi, A.; Verbauwhede, I. A compact FPGA-based architecture for elliptic curve cryptography over prime fields. In Proceedings of the ASAP 2010-21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Rennes, France, 7–9 July 2010; pp. 313–316. [Google Scholar]
- Hu, X.; Zheng, X.; Zhang, S.; Cai, S.; Xiong, X. A low hardware consumption elliptic curve cryptographic architecture over GF(p) in embedded application. Electronics 2018, 7, 104. [Google Scholar] [CrossRef] [Green Version]
- Karatsuba, A.A.; Ofman, Y.P. Multiplication of many-digital numbers by automatic computers. Dokl. Akad. Nauk. Russ. Acad. Sci. 1962, 145, 293–294. [Google Scholar]
- Hamburg, M. Faster Montgomery and double-add ladders for short Weierstrass curves. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020. [Google Scholar] [CrossRef]
- Ding, J.; Li, S.; Gu, Z. High-Speed ECC Processor Over NIST Prime Fields Applied With Toom–Cook Multiplication. IEEE Trans. Circuits Syst. Regul. Pap. 2019, 66, 1003–1016. [Google Scholar] [CrossRef]
- Devlin, B. Blockchain Acceleration Using FPGAs—Elliptic Curves, zk-SNARKs, and VDFs; ZCASH Foundation, 2019; Available online: https://github.com/ZcashFoundation/zcash-fpga (accessed on 28 December 2020).
- Alrimeih, H.; Rakhmatov, D. Fast and Flexible Hardware Support for ECC Over Multiple Standard Prime Fields. IEEE Trans. Very Large Scale Integr. Syst. 2014, 22, 2661–2674. [Google Scholar] [CrossRef]
- Güneysu, T.; Paar, C. Ultra High Performance ECC over NIST Primes on Commercial FPGAs. In International Workshop on Cryptographic Hardware and Embedded Systems; Springer: Berlin/Heidelberg, Germany, 2008; pp. 62–78. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Verbauwhede, I. An updated survey on secure ECC implementations: Attacks, countermeasures and cost. In Cryptography and Security: From Theory to Applications; Springer: Berlin/Heidelberg, Germany, 2012; pp. 265–282. [Google Scholar]
- Galbally, J. A new Foe in biometrics: A narrative review of side-channel attacks. Comput. Secur. 2020, 96, 101902. [Google Scholar] [CrossRef]
- Montgomery, P.L. Speeding the Pollard and elliptic curve methods of factorization. Math. Comput. 1987, 48, 243–264. [Google Scholar] [CrossRef]
- Montgomery, P.L. Modular Multiplication Without Trial Division. Math. Comput. 1985. [Google Scholar] [CrossRef]
- Xilinx. UG953: Vivado Design Suite 7 Series FPGA and Zynq-7000 SoC Libraries Guide. Available online: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug953-vivado-7series-libraries.pdf (accessed on 28 December 2020).
- Xilinx. 7 Series DSP48E1 Slice User Guide. 2018. Available online: https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf (accessed on 28 December 2020).
- Nguyen, H.D.; Pasca, B.; Preußer, T.B. FPGA-specific arithmetic optimizations of short-latency adders. In Proceedings of the 2011 21st International Conference on Field Programmable Logic and Applications, Chania, Greece, 5–7 September 2011; pp. 232–237. [Google Scholar]
- Massolino, P.M.C.; Longa, P.; Renes, J.; Batina, L. A Compact and Scalable Hardware/Software Co-design of SIKE. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020. [Google Scholar] [CrossRef]
- Liskov, M. Fermat’s Little Theorem. In Encyclopedia of Cryptography and Security; Springer: Boston, MA, USA, 2005; p. 221. [Google Scholar] [CrossRef]
- Kawamura, S.; Koike, M.; Sano, F.; Shimbo, A. Cox-rower architecture for fast parallel montgomery multiplication. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Bruges, Belgium, 14–18 May 2000; pp. 523–538. [Google Scholar]
- Qu, M. Sec 2: Recommended Elliptic Curve Domain Parameters; Tech. Rep. SEC2-Ver-0.6; Certicom Res.: Mississauga, ON, Canada, 1999. [Google Scholar]
- Hu, X.; Zheng, X.; Zhang, S.; Li, W.; Cai, S.; Xiong, X. A high-performance elliptic curve cryptographic processor of SM2 over GF(p). Electronics 2019, 8, 431. [Google Scholar] [CrossRef] [Green Version]
- Lochter, M.; Merkle, J. Elliptic Curve Cryptography (ECC) Brainpool Standard-Curves and Curve Generation. RFC 5639, IETF. 2010. Available online: https://tools.ietf.org/html/rfc5639 (accessed on 28 December 2020).
- Amiet, D.; Curiger, A.; Zbinden, P. Flexible FPGA-Based Architectures for Curve Point Multiplication over GF(p). In Proceedings of the 19th Euromicro Conference on Digital System Design, DSD 2016, Limassol, Cyprus, 31 August–2 September 2016. [Google Scholar] [CrossRef]
- Wu, T.; Wang, R. Fast unified elliptic curve point multiplication for NIST prime curves on FPGAs. J. Cryptogr. Eng. 2019, 9, 401–410. [Google Scholar] [CrossRef]
- Morales-Sandoval, M.; Diaz-Perez, A. Novel algorithms and hardware architectures for Montgomery Multiplication over GF(p). IACR Cryptol. ePrint Arch. 2015, 2015, 696. [Google Scholar]
Designs | Platform | Slices | DSP | BRAM | Max. Freq. (MHz) | Cycles | Time (ms) | Time x Area |
---|---|---|---|---|---|---|---|---|
Virtex-7 | 6909 | 136 | 15 | 232.3 | 0.139 | 0.96 | ||
This work | Kintex-7 | 7115 | 136 | 15 | 234.1 | 32.3k | 0.138 | 0.98 |
XC7Z020 | 7077 | 136 | 15 | 156.8 | 0.206 | 1.46 | ||
Roy et al. [11] | XC7Z020 | 2223 | 40 | 9 | 208.3 | 95.5k | 0.459 | 1.02 |
Bajard et al. [15] | Kintex-7 | 1630 | 46 | 16 | 281.5 | 172.3k | 0.612 | 1.00 |
Asif et al. [14] | Virtex-7 | 18.8k (LUT) | - | - | 86.6 | 63.2k | 0.730 | 3.43 |
Ma et al. [16] | Virtex-5 | 1725 | 37 | - | 291 | 110.6k | 0.380 | 0.66 |
Lai et al. [18] | Virtex-5 | 3657 | 10 | 10 | 263 | 226.2k | 0.860 | 3.15 |
Shah et al. [17] | Virtex-6 | 44.3k (LUT) | - | - | 221 | 143.7k | 0.650 | 7.20 |
Vliegen et al. [19] | Virtex-II Pro | 1947 | 7 | 9 | 68.17 | 1074.4k | 15.760 | 30.68 |
Hu et al. [20] | Virtex-4 | 9370 | - | - | 20.44 | 609.9k | 29.840 | 279.60 |
Operation | Clock Cycles | Latency @234.1 MHz (ns) |
---|---|---|
1 × Input Modular Addition | 5 | 21.36 |
3 × Input Modular Addition | 7 | 29.90 |
1 × Modular Multiplication | 26 | 111.07 |
4 × Modular Multiplication | 29 | 123.89 |
Modular Inverse | 6911 | 29,523.79 |
Ladder Setup | 131 | 559.63 |
One Step Ladder Update | 97 | 414.38 |
Ladder Finish | 7050 | 30,117.60 |
One ECC Scalar Multiplication | 32,272 | 137,865.98 |
Resource | Used | Available | Utilization % |
---|---|---|---|
LUT | 22,736 | 433,200 | 5.25 |
FF | 12,511 | 866,400 | 1.44 |
Slice | 6909 | 108,300 | 6.38 |
DSP48E1 | 136 | 3600 | 3.78 |
BRAM | 15 | 1470 | 1.02 |
Designs | Curve | Modulus Size (Bits) | Slices | DSP | BRAM | Max. Freq. (MHz) | Time (ms) |
---|---|---|---|---|---|---|---|
192 | 0.119 | ||||||
This work | Any | 224 | 7281 | 136 | 15 * | 204.2 | 0.138 |
256 | 0.158 | ||||||
192 | 0.296 | ||||||
224 | 0.389 | ||||||
Wu et al. [41] | NIST | 256 | 8411 | 32 | 310 | 0.526 | |
384 | 1.070 | ||||||
521 | 1.860 | ||||||
Amiet et al. [40] | Any | 192 | 6816 (LUT) | 20 | 225 | 0.690 | |
256 | 1.490 | ||||||
384 | 4.080 | ||||||
521 | 9.700 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Awaludin, A.M.; Larasati, H.T.; Kim, H. High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA. Sensors 2021, 21, 1451. https://doi.org/10.3390/s21041451
Awaludin AM, Larasati HT, Kim H. High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA. Sensors. 2021; 21(4):1451. https://doi.org/10.3390/s21041451
Chicago/Turabian StyleAwaludin, Asep Muhamad, Harashta Tatimma Larasati, and Howon Kim. 2021. "High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA" Sensors 21, no. 4: 1451. https://doi.org/10.3390/s21041451
APA StyleAwaludin, A. M., Larasati, H. T., & Kim, H. (2021). High-Speed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA. Sensors, 21(4), 1451. https://doi.org/10.3390/s21041451