Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
J. Inst. Eng. India Ser. B https://doi.org/10.1007/s40031-019-00384-1 ORIGINAL CONTRIBUTION High-Speed High-Throughput VLSI Architecture for RSA Montgomery Modular Multiplication with Efficient Format Conversion Aashish Parihar1 • Sangeeta Nakhate1 Received: 24 May 2018 / Accepted: 16 January 2019 Ó The Institution of Engineers (India) 2019 Abstract Modular multiplication is a key operation in RSA cryptosystems. Modular multipliers can be realized using Montgomery algorithm. Montgomery algorithm employing carry save adders makes modular multiplication suitable and efficient. Montgomery modular multiplication can be carried out in two ways. All the operands are kept in carry save form in one of the ways. The input and output are kept in binary form, and intermediate operands are kept in carry save form in the other way which requires an efficient format converter. This paper proposes a fast and high-throughput Montgomery modular multiplier which employs an efficient format conversion method. Format conversion is carried out through a format conversion unit which consists of a carry look-ahead unit and multiplexer unit. In addition, this multiplier merges two iterations, which reduces the number of clock cycles significantly. Merger of iteration requires integer multiples of inputs which is computed using the same format converter. Critical path delay of the multiplier is minimized by multiplying one of the inputs by four which simplifies necessary intermediate calculations. The total time required for one complete multiplication is significantly minimized due to reduction in required number of clock cycles with optimum critical path delay. Experimental results show that the proposed multiplier achieves significant speed and throughput improvement as compared to previous designs. & Aashish Parihar parihar.aashish@gmail.com Sangeeta Nakhate sangeetanakhate@manit.ac.in 1 Department of Electronics and Communication, Maulana Azad National Institute of Technology, Bhopal, India Keywords Carry save addition  VLSI  Modular exponentiation  Montgomery modular multiplier  Rivest, Shamir, and Adleman (RSA) cryptosystem Introduction RSA [1] is a widespread public key cryptography algorithm. The encryption and decryption in RSA involves modular exponentiation, which is efficiently carried out by Montgomery modular multiplier (MM) [2]. The modulus in MM is generally kept at least 2048 bits in size for longterm security. However, large size of modulus limits the performance of multiplier. For an efficient RSA cryptosystem, a high-speed and high-throughput MM is required. Carry save adders (CSA) are integral part of MM for efficient operation. MM can be roughly classified into two categories. In the first category [3–7], input and output are kept in binary form and intermediate results are kept in carry save form. However, format conversion of output of each modular multiplication accounts for additional clock cycles. Work in [6] reused two-level CSA for format conversion. However, two-level CSA require n/2 clock cycles for format conversion in the worst case. Note that n denotes the number of input bits. Kuang et al. [7] proposed a one-level configurable CSA for MM with optimized critical path delay and detect and skip mechanism to minimize the number of clock cycles. However, this multiplier requires n/2 cycles for format conversion in the worst case. In the second category [8–12], all the operands are kept in carry save form and conversion takes place only at the end of modular exponentiation. However, these multipliers 123 J. Inst. Eng. India Ser. B require extra hardware for storing additional operands C. McIvor et al. [9] proposed two variants of this category. One of them is based on 5-to-2 CSA and the other is based on 4-to-2 CSA. The latter multiplier has the benefit of one less input at the cost of one multiplexer. The work in [10] proposed a simple algorithm using 5-to-2 CSA, which modified one of the inputs to reduce critical path delay. Work in [12] employed a bypass mechanism to skip unnecessary iterations. However, bypassing of iterations depends on input bit pattern. Parallel processing multipliers [13, 14], high radix algorithms [15–17] and systolic arrays [18, 19] improved the performance of MM. However, these techniques probably result in increase in power, hardware and cost. In this paper, we propose a fast and high-throughput Montgomery modular multiplier with efficient format converter. The proposed multiplier combines two iterations while maintaining optimum critical path delay. Optimization of critical path delay is achieved by multiplying one of the operands by four. This modification simplifies and speeds up the necessary calculations required for the next iteration. Consequently, the required number of clock cycles and hence the total time required to complete one MM are reduced significantly and higher throughput rate and speed can be achieved. As this algorithm belongs to the first category, format conversion is required at the end of the multiplication. Format conversion is carried out by a novel format conversion unit (FCU), which consists of a 64-bit carry look-ahead unit (CLU), multiplexer unit (MU), 32-bit register and CSA. Note that addition and format conversion are carried out by using the same CSA architecture. The proposed format converter requires only n/ 64 ? 8 cycles which is extremely efficient as compared to previous format converters such as format converters of [6] and [7]. Algorithm 1: Proposed Algorithm Inputs: X, Y, N (n bit numbers) Output: S[n] 1. Pre-Compute Ym=4Y, 3Ym and 3N; 2. S[0]=0;C[0]=0; 3. for i=0:2:n+3;{ 4. Choose K and P according to Table 1. 5. (S[i+2],C[i+2])=(S[i]+C[i]+K+P)/4; 6. }; 7. FCU (S[n+5], C[n+5]); 8. return S; provides architecture of the proposed multiplier. The result is explained in ‘‘Experimental Results’’ section. Finally, we present conclusion remarks in ‘‘Conclusion’’ section. Proposed Algorithm The Montgomery modular product S[n] of X, Y and modulus N (odd) can be obtained as: S[n] = XY2-n mod N. Here, X, Y and N are n-bit numbers. S[n] can be obtained iteratively by defining S[i] as: ! i 1 1 X S½iŠ ¼ i Xj 2 j  Y ðmod NÞ : ð1Þ 2 j ¼0 Now, S[i ? 2] can be obtained as: ! iþ1 X 1 j S½i þ 2Š ¼ iþ2 Xj 2  Y ðmod NÞ ð2Þ 2 j ¼0 " # ! i 1 1 1 X j S½i þ 2Š ¼ Xj 2 Y þ ðXi þ 2Xiþ1 ÞY mod N 4 2i j ¼0 ð3Þ 1 S½i þ 2Š ¼ ½S½iŠ þ kY Š mod N: 4 Here, k can vary from 0 to 3 depending on the values of Xi and Xi?1. S[i] ? kY can be made divisible by 4 by adding integer multiples of N which is let pN. Here, p ranges from 0 to 3 and depends on 2 LSBs of S[i] ? kY and N. In the proposed algorithm, Y is modified to Ym by multiplying it by 4. Therefore, 2 LSBs of sum S[i] ? kYm are equal to 2 LSBs of S[i]. Also, N[0] is always 1 as N is always odd for RSA cryptosystems. Therefore, p will depend on N[1] and 2 LSBs of S[i]. Let K and P represent kYm and pN, respectively, and (S[i], C[i]) represents the carry save form of S[i]. Appropriate values of operands K and P are given in Table 1. The proposed algorithm is given in Algorithm 1. The convergence range of the proposed algorithm falls in the range 0 B S \ 3N/2 ? 3N/8_ \ 2N. Therefore, an additional subtraction operation is required at the end for keeping S within N. Subtraction can be avoided by employing Walter’s notion [20]. Therefore, an extra clock cycle is required to avoid subtraction operation. It is Table 1 Operands K and P Xi?1Xi The remainder of the paper is organized as follows. In ‘‘Proposed Algorithm’’ section, we propose a highthroughput and high-speed Montgomery modular multiplication algorithm. ‘‘Hardware Architecture’’ section 123 ð4Þ K S[i]1:0N [1] P S[i]1:0N [1] P 00 0 000 0 100 {N, 0} 01 Ym 001 0 101 {N, 0} 10 11 {Ym, 0} 3 Ym 010 011 3N N 110 111 N 3N J. Inst. Eng. India Ser. B necessary to divide the output by 4 to compensate premultiplication of input by 4. Division by 4 is equivalent to 2 additional clock cycles. Hence, n/2 ? 3 clock cycles are required for calculation of S[n] in carry save form. FCU is employed for format conversion and operand (3Y and 3N) pre-computation. Format conversion requires n/64 ? 8 and operand pre-computation requires 2(n/64 ? 9) clock cycles. Therefore, a total of n/2 ? 3n/64 ? 29 clock cycles are required for complete multiplication. Hardware architecture Proposed Multiplier Architecture Hardware architecture of the proposed multiplier is shown in Fig. 1. Two n-bit CSA adders, one n-bit 4-to-1 multiplexer (M1), four n-bit 2-to-1 multiplexers (M2, M3 and M4) and format conversion unit (FCU) are the main components. Multiplexer M1 selects input operand K from [0, Ym, {Ym, 0}, 3Ym] based on the selection lines Xi and Xi?1. Multiplexers M2, M3 and M4 select P from [0, N, {N, 0}, 3N]. Register RX stores input X and is right-shifted by two bit positions after every negative edge of clock cycle to capture Xi and Xi?1. Operands Ym, 3Ym, N and 3N are stored to registers RYm, R3Ym, RN and R3N, respectively. S[i ? 2] and C[i ? 2] are obtained by adding S[i], C[i], K and P and stored to registers ERS and ERC, respectively. Note that ERS and ERC represent registers of extended bit length equal to n ? 65. Additional length is required for format conversion. FCU is shown in Fig. 2 which is used for format conversion and pre-computation of 3Y and 3N. Fig. 1 Hardware architecture of the proposed multiplier Fig. 2 a Format conversion unit (FCU). b Multiplexer unit (MU) Format Conversion Unit (FCU) Format conversion unit (FCU) employs a CLU, multiplexer unit (MU) and 32-bit register (SEL) as shown in Fig. 2. Architecture of CLU [21] is shown in Fig. 3. CLU receives 64 LSBs of output in carry save form and generates carry C16, C32, C48 and C64 in one clock cycle. During format conversion registers, ERS and ERC are circularly rightshifted by 64 bits and extended part stores 64 LSBs of S[n] and C[n] with one gap to avoid any carry. C64 becomes C0 for the next cycle. This cycle continues until scanning of all n bits. MU selects proper carry for maintaining correctness of the system. During multiplication process, MU selects normal carry generated from CSA. During format conversion, MU selects look-ahead carry bits (C16, C32 and C48) for circularly shifted bits, 0 for all scanned bits and normal CSA carry for un-scanned bits. A Fig. 3 a 64-bit CLU. b 4-bit CLU. c 16-bit CLU. d 2-bit CLU 123 123 Table 2 Normalized area and delay of standard cells [7] Cell FA REG 2-Input NAND 2-Input NOR 2-Input AND 2-Input XOR 2-to-1 MUX 4-to-1 MUX Area ratio 1.00 0.88 0.16 0.16 0.20 0.32 0.36 0.96 Delay ratio 1.00 – 0.12 0.16 0.34 0.34 0.45 0.71 Table 3 Analysis of area, delay, clock cycles, total time and area–time product of various multipliers Multiplier Area Area ratio Critical path delay Delay ratio Clock cycles Total time nTFA ATP nTFA  nAFA nAFA ? 6nAREG ? 3nANAND2 ? 2nAMUX4 ? 5nAMUX2 10 nAFA 2TMUX2 ? TXOR2 ? TXOR3 2.17TFA 1.25n 2.71 MM_CSA52 [9] 3nAFA ? 7nAREG ? 3nANAND2 9.76 nAFA 3TFA ? TXOR2 ? TAND2 4.12TFA N 4.12 40.21 MM_CSA42 [9] 2nAFA ? 9nAREG ? 2nAMUX4 11.84 nAFA 2TFA ? 2TXOR2 ? TAND2 ? TMUX4 3.73TFA n?1 3.73 44.16 MMM42 [12] 2nAFA ? 9nAREG ? 2nAMUX4 ? 2nAMUX2 12.56 nAFA 2TFA ? TMUX4 2.71TFA 0.75n 2.03 25.49 Proposed multiplier 2nAFA ? 7nAREG ? nAMUX4 ? 3nAMUX2 ? ACLU 11 nAFA 4T4 2.52 n/2 ? 3n/64 ? 29  1.4 17.76 bit CLU ? T2 bit CLU 27.1 J. Inst. Eng. India Ser. B MM_CCSA [7] J. Inst. Eng. India Ser. B Analysis of Critical Path Delay and Area Fig. 4 FPGA implementation of the proposed multiplier 32-bit select (SEL) register is used as selection lines for MU. Select register is initialized with all ones to select normal CSA’s carry outputs. During format conversion, this register is right-shifted by one bit to select proper carry as shown in Fig. 2b. After n/64 cycles, all carry bits (multiples of 16, i.e., Cn16) are propagated and are set to zero. Now, only 8 additional cycles are required to generate output in binary form. Critical path delay can be analyzed as follows: Register RX is negative edge-triggered; therefore, the required input K is propagated through M1 during negative half of the clock cycle and is available at positive edge of every clock cycle. Input P propagates in parallel with CSA1. Registers ERS and ERC are circularly shifted by either 2 or 64 bits using n-bit 2-to-1 multiplexer. Therefore, TMUX2 delay is required to select and assign proper intermediate output. Therefore, the maximum delay to generate S[i ? 2] and C[i ? 2] is 2TFA ? TMUX2 2TFA þ TMUX4 ; i.e., 2.45TFA. FCU is employed for pre-computation and format conversion of the final output. The propagation delay of FCU is 4T4 bit CLU ? T2 bit CLU = 2.52TFA. Hence, the critical path delay of the proposed multiplier is 2.52TFA. On the other hand, the proposed multiplier consists of two n-bit CSA adders, seven registers, one n-bit 4-to-1 multiplexers (MUX4), four n-bit 2-to-1 multiplexers and FCU. Therefore, the approximate hardware complexity of the proposed multiplier can be expressed as 2nAFA? 7nAREG ? nAMUX4 ? 4nAMUX2 ? AFCU, i.e., approximately equal to 11nAFA. Table 3 compares the area (Area), critical path delay (Delay), number of clock cycles (Clock Cycles), total time required for one complete multiplication (Total Time) and area–time product (ATP) of the previous multipliers with the proposed multiplier. Note that total time required for multiplication is calculated by multiplying clock cycles and critical path delay. Implementation Result Experimental Results This section first analyzes and compares critical path delay and area of the proposed multiplier with previous designs according to the information given in Table 2. Note that Table 2 denotes critical path delay and area of a cell by Tcell and Acell , respectively. Finally, the proposed multiplier is implemented on Nexys 4 DDR XC7A100T FPGA using Vivado 2016.2 and compared with previous multipliers. In this section, we implement the proposed multiplier on Nexys 4 DDR XC7A100T FPGA using Vivado 2016.2 and compare it with previous multipliers. FPGA implementation of the proposed multiplier is shown in Fig. 4. Implementation result is shown in Table 4. Table 4 includes area, critical path delay, number of clock cycles required for one MM, total time required for one MM, throughput formulated by dividing bit length by multiplication of clock cycles and delay and area–time product (ATP). Note that the clock cycles of MMM42 and MM_CCSA are the Table 4 Implementation result with 2048 key bit size Multiplier Area (LUT ? REG) Delay (ns) Clock cycles Total time (ls) Throughput (Mbps) ATP (LUT ? REG) 9 (103 ls) MM_CCSA [7] 28,014 4.39 2560 11.24 182.20 314.8 MM_CSA52 [9] 29,287 8.20 2048 16.79 121.98 491.7 MM_CSA42 [9] 34,542 7.46 2049 15.28 134.03 527.8 MMM42 [12] 37,680 5.17 1536 7.94 257.93 299.1 Proposed work 30,815 4.92 1149 5.65 362.47 174.1 123 J. Inst. Eng. India Ser. B average of best case and worst case. From experimental results, we can conclude that the proposed Montgomery modular multiplier achieves highest throughput rate and smallest execution time to complete one complete multiplication. Conclusion Experimental results indicate that the proposed multiplier needs significantly fewer clock cycles as compared to previous multipliers. The proposed multiplier is very fast as it takes least time to execute one complete MM. Comparatively small number of clock cycles and high speed of the multiplier result in very high-throughput rate. The proposed multiplier needs more area due to extra requirement of extra hardware for format conversion. However, area of the proposed multiplier is comparable to other multipliers. On comparing with MM_CCSA, the proposed multiplier requires 44.8% fewer clock cycles and 50.2% smaller execution time to complete one MM which results in significant throughput enhancement. In future, we will try to reduce hardware requirement and skip iteration when both intermediate operands K and P are equal to zero. Acknowledgement Authors are thankful to the project ‘‘Special Manpower Development Program for Chip to System Design (SMDPC2SD)’’ sponsored by Ministry of Electronics and Information Technology (MeitY), Government of India, for providing technical facility. References 1. R.L. Rivest, A. Shamir, L. Adleman, A method for obtaining digital signature and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978) 2. P.L. Montgomery, Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985) 3. Y.S. Kim, W.S. Kang, J.R. Choi, Implementation of 1024-bit modular processor for RSA cryptosystem, in Proceedings of Second IEEE Asia Pacific Conference on ASICs (2000), pp. 187–190 4. V. Bunimov, M. Schimmler, B. Tolg, A complexity-effective version of Montgomery’s algorithm, in Proceedings of the Workshop on Complexity Effects Designs (2002), pp. 1–7 5. Z.B. Hu, R.M. A. Shboul, V.P. Shirochin, An efficient architecture of 1024-bits Cryptoprocessor for RSA cryptosystem based on modified Montgomery’s algorithm, in Proceedings of the Fourth IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems (2007), pp. 643–646 123 6. Y.-Y. Zhang, Z. Li, L. Yang, S.-W. Zhang, An efficient CSA architecture for Montgomery modular multiplication. Microprocess. Microsyst. 31(7), 456–459 ( 2007) 7. S.-R. Kuang, K.-Y. Wu, R.-Y. Lu, Low-cost high-performance vlsi architecture for montgomery modular multiplications. IEEE Trans. Very Large Scale Integr. Syst. 24(2), 434–443 (2016) 8. K. Manochehri, S. Pourmozafari, Fast Montgomery modular multiplication by pipelined CSA architecture, in Proceedings of the IEEE International Conference on Microelectronics (2004), pp. 144–147 9. C. McIvor, M. McLoone, J.V. McCanny, Modified montgomery modular multiplication and RSA exponentiation techniques. IEE Proc. Comput. Digit. Technol. 151(6), 402–408 (2004) 10. K. Manochehri, S. Pourmozafari, Modified radix-2 montgomery modular multiplication to make it faster and simpler. Proc. IEEE Int. Conf. Inf. Technol. 1, 598–602 (2005) 11. M.-D. Shieh, J.-H. Chen, H.-H. Wu, W.-C. Lin, A new modular exponentiation architecture for efficient design of RSA cryptosystem. IEEE Trans Very Large Scale Integr. Syst. 16(9), 1151–1161 (2008) 12. S.-R. Kuang, J.-P. Wang, K.-C. Chang, H.-W. Hsu, Energy-efficient high-throughput Montgomery modular multipliers for RSA cryptosystems. IEEE Trans. Very Large Scale Integr. Syst. 21(11), 1999–2009 (2013) 13. J.C. Neto, A.F. Tenca, W.V. Ruggiero, A parallel k-partition method to perform Montgomery multiplication, in Proceedings of the IEEE International Conference on Application-Specific Systems, Architecture Processors (2011), pp. 251–254 14. J. Han, S. Wang, W. Huang, Z. Yu, X. Zeng, Parallelization of radix-2 Montgomery multiplication on multicore platform. IEEE Trans. Very Large Scale Integr. Syst. 21(12), 2325–2330 (2013) 15. G. Sassaw, C.J. Jimenez, M. Valencia, High radix implementation of Montgomery multipliers with CSA, in Proceedings of the International Conference on Microelectronic (2010), pp. 315–318 16. A. Miyamoto, N. Homma, T. Aoki, A. Satoh, Systematic design of RSA processors based on high-radix Montgomery multipliers. IEEE Trans. Very Large Scale Integr. Syst. 19(7), 1136–1146 (2011) 17. S.-H. Wang, W.-C. Lin, J.-H. Ye, M.-D. Shieh, Fast scalable radix-4 Montgomery modular multiplier, in Proceedings of IEEE International Symposium on Circuits and Systems (2012), pp. 3049–3052 18. F. Gang, Design of modular multiplier based on improved montgomery algorithm and systolic array. Proc First Int. MultiSymp. Comput. Comput. Sci. 2, 356–359 (2006) 19. G. Perin, D.G. Mesquita, F.L. Herrmann, J.B. Martins, Montgomery modular multiplication on reconfigurable hardware: fully systolic array vs parallel implementation, in Proceedings of the 6th Southern Programmable Logic Conference (2010), pp. 61–66 20. C.D. Walter, Montgomery exponentiation needs no final subtractions. Electron Lett. 35(21), 1831–1832 (1999) 21. F. Vahid, Digital Design (Wiley, London, 2006), pp. 296–316 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.