on FPGAs K. Ja rvinen, M. Tommiska and J. Skytta Abstract: The authors present a comparative survey of private-key cryptographic algorithm implementations on eld programmable gate arrays (FPGAs). The performance and exibility of FPGAs make them almost ideal implementation platforms for cryptographic algorithms, and therefore the FPGA-based implementation of cryptographic algorithms has been widely studied during the past few years. However, a complete analysis of published implementations has not been presented previously. The authors analyse FPGA-based implementations of certain widely used cryptographic algorithms in terms of speed, area and implementation techniques. The algorithms studied in this article include the private-key cryptographic algorithms advanced encryption standard and international data encryption algorithm and certain hash algorithms. These algorithm implementations provide a good overview of the eld of private-key cryptographic algorithm implementation. 1 Introduction This article presents a thorough study of the state of private-key cryptographic algorithm implementation on eld programmable gate arrays (FPGAs). As cryptographic algorithms become more widely used, the need for high-speed implementations of the algo- rithms increases. Software-based implementations of cryptographic algorithms fall short in performance in many applications, e.g. on heavily loaded servers. Therefore, an obvious need for high-speed implementa- tions exists. Reprogrammable hardware is almost ideal for cryp- tographic implementations because high speed can be achieved without signicant reduction in exibility. Flexibility, meaning that the design can be easily changed or modied, is of especially great importance in cryptographic implementations for the following reasons. First, a cryptographic algorithm can be con- sidered secure only until proven otherwise. If a severe aw in an algorithm is found, the algorithm must be replaced with a more secure one. Second, in many applications, a large variety of different algorithms are in use, and, therefore, it must be easy to change from one algorithm to another. This article concentrates on implementations of the advanced encryption standard (AES), the international data encryption algorithm (IDEA) and certain hash algorithms. These algorithms represent both private-key cryptographic algorithms and hash algorithms, thus giving a good overview of the state of cryptographic algorithm implementation. All these algorithms have been implemented by the authors of this article in the Signal Processing Laboratory at Helsinki University of Technology, and these designs are here referred to as the SIG designs, e.g. SIG-AES. Although FPGA-based cryptographic algorithm implementation has been widely studied during the past few years, a thorough comparative study of published implementations has not been presented, at least to the authors knowledge. The article by Wollinger et al. [1] included a review of implementations, but otherwise the article concentrated more on security questions of FPGAs as implementation platforms. In this article, implementations of cryptographic algo- rithms are compared in terms of speed, area and implementation techniques. Finally, certain conclusions on cryptographic algorithm implementation on FPGAs are presented. 2 Private-key cryptographic algorithms Implementation of private-key cryptographic algo- rithms on reprogrammable hardware has been widely studied for several years. Block ciphers are well suited to hardware implementation, because parallelisation, unrolling and pipelining can usually be efciently exploited. Throughput can be increased by pipelining an unrolled design and then calculating a different encryp- tion in each pipelined stage. Pipelining, unfortunately, restricts the use of feedback cipher modes which require the value of the previous ciphertext in the generation of the next one, e.g. cipher block chaining mode [2]. The data encryption standard (DES) and its variant 3DES were the most popular block ciphers for decades. Some recent papers on DES implementation have been published, e.g. [3] and [4]. However DES is currently being replaced by AES, and the role of DES will be marginal in the future. Thus, DES is not considered further. IEE 2005 IEE Proceedings online no. 20055004 doi:10.1049/ip-ifs: 20055004 Paper received 23 June 2005 The authors are with the Signal Processing Laboratory, Helsinki University of Technology, Otakaari 5A, 02150 Finland E-mail: kimmo.jarvinen@hut.fi IEE Proc. Inf. Secur. 3
2.1 Advanced encryption standard (AES) AES is a NIST (National Institute of Standards and Technology) standard introduced in 2001 [5]. AES was developed by two Belgians, Vincent Rijmen and Joan Daemen, and it was originally named Rijndael. AES processes a 128-bit data block with a key of either 128, 192 or 256 bits. The different versions of AES are here referred to as AES-128, AES-192 and AES-256. The 128-bit data block is represented as 4 4 rectangular array of bytes called the State. Depending on the key size, AES consists of either 10, 12 or 14 rounds which include four transformations: SubBytes, ShiftRows, MixColumns and AddRoundKey (the last round does not include MixColumns), where the rows and columns refer to the rows and columns of the State. Keys for every round are derived from the original cipher key using the KeyExpansion routine [5]. The SubBytes operation is the most crucial for both the speed and area requirements of an AES implementa- tion [6]. It operates independently on each byte of the State, and it consists of nding a multiplicative inverse in the Galois eld GF(2 8 ) followed by an afne transformation [5]. Traditionally, these operations are combined and implemented as a single 256 8-bit look- up table (LUT) called the S-box. The inverse transfor- mation of SubBytes, called InvSubBytes, is utilised in the AES decryption. It consists of an inverse afne transformation followed by an inversion in GF(2 8 ) [5]. The rst FPGA-based implementations of AES (at that time known only as Rijndael) were published during the selection process of AES. In the last phase of the selection, there were ve nalist algorithms: Mars, RC6, Rijndael, Serpent and Twosh. Because all the algorithms were considered secure, hardware efciency was given great importance in selecting Rijndael as the winning algorithm [7]. During the selection process Dandalis et al. [8], Elbirt et al. [9], Fischer [10], Gaj and Chodowiec [11] and Mroczkowski [12] published FPGA implementations on both Altera and Xilinx devices. Their studies concluded that Rijndael and Serpent had the highest throughputs [8, 9, 11], while Twosh and RC6 provided compact implementations with medium speed [11]. Mars clearly had the worst hardware characteristics [11]. Since the selection of Rijndael as AES, an enormous number of FPGA-based implementations have been published. Certain trends in these publications are considered next. Several publications have presented studies of unrol- ling and pipelining, e.g. [6, 1319]. The previously mentioned AES nalist algorithm implementations also considered unrolling and pipelining [9, 11]. Very high throughput can be achieved by pipelining unrolled rounds of the algorithm, but, as mentioned, pipelining cannot be efciently used in feedback modes. To the authors knowledge, the fastest published FPGA-based implementation of AES was presented by Zambreno et al. in [19]. They used aggressive pipelining and achieved throughput of 23.57 Gbps on a Xilinx Virtex-II XC2V4000. SubBytes can be implemented as an S-box (LUT) which includes precalculated values of the transforma- tion. One 256 8-bit S-box is required for each byte of the State, and therefore 16 parallel S-boxes are required if SubBytes is performed for the entire State at once. Thus, the total number of S-boxes, without KeyExpan- sion, is 16 times the number of unrolled rounds, and a fully unrolled AES-128 (10 rounds) requires 160 S-boxes. KeyExpansion requires four additional S- boxes per round, and therefore a total number of 200 S- boxes are required in a fully unrolled key agile AES-128 implementation. If S-boxes are implemented on Xilinx FPGAs using BlockRAMs, 100 BlockRAMs are needed, because one dual-port BlockRAM can imple- ment two S-boxes. BlockRAM-based S-boxes have been used in many publications, e.g. [13, 15, 1723]. Different S-boxes are used in SubBytes, and InvSub- Bytes, which makes the combination of encryption and decryption difcult without doubling the BlockRAM need. McLoone and McCanny presented in [20] AES- 128, AES-192 and AES-256 implementations combining encryption and decryption. They introduced two ROMs including S-box values for encryption and decryption. The BlockRAMs implementing SubBytes and InvSub- Bytes were programmed using values from ROMs every time encryption was changed to decryption or vice versa [20]. Another solution was presented by Rodr quez- Henr quez et al. in [16]. They combined SubBytes and InvSubBytes so that the inversion in GF(2 8 ) (utilised by both) was implemented in BlockRAMs, but the afne transformations were implemented with logic. This allowed the same BlockRAMs to be utilised in encryp- tion and decryption. In addition to (Inv)SubBytes, the (Inv)MixColumns transformation can also be performed using the LUT approach. An LUT combining (Inv)SubBytes and (Inv)MixColumns is called the T-box. Fischer and Drutarovsky studied implementation techniques based on S-boxes and T-boxes on an Altera FPGA in [24]. They concluded that slightly faster performance was attained with the T-box approach, but the memory need increased [24]. In [25], McLoone and McCanny pre- sented an AES-128 encryption implementation utilising T-boxes. Their implementation had high throughput and occupied only a small number of slices, but it required a very large number of BlockRAMs. A device with a large amount of embedded memory, e.g. the Virtex-E Extended Memory [26], is therefore required. In the above implementations, SubBytes was imple- mented as an LUT. Another approach is to calculate the multiplicative inverse and the afne transformation using combinatorial logic. Inversion in GF(2 8 ) can be reduced into an inversion in GF(2 4 ) or in GF(2 2 ) accompanied by Galois eld additions and multiplica- tions. That is, the problem is mapped from GF(2 8 ) to another representation of the eld, which in these cases is either GF((2 4 ) 2 ) or GF(((2 2 ) 2 ) 2 ). This approach is here referred to as combinatorial implementation of Sub- Bytes, and certain implementations using such methods are considered next. In [27], we presented a design called SIG-AES which implements SubBytes combinatorially as suggested in [28]. Because this approach requires mappings from GF(2 8 ) to GF((2 4 ) 2 ) and vice versa, other transforma- tions were also mapped to GF((2 4 ) 2 ) in order to reduce area and latency. Hodjat and Verbauwhede explored the optimal number of pipelined stages in the combinato- rially implemented SubBytes in [15], and they compared combinatorial implementation with implementation using BlockRAMs. Zhang and Parhi presented a careful analysis of the combinatorial implementation of Sub- Bytes and introduced highly optimised implementations, one of which exceeds 20 Gbps on a Virtex-E FPGA [29]. The high efciency was attained through detailed analysis and careful implementation using combinator- ial SubBytes in GF((2 4 ) 2 ). 4 IEE Proc. Inf. Secur.
The largest benet of the combinatorial implementa- tion is that the SubBytes can be pipelined and thus higher throughput can be attained. This does, however, increase the latency of the implementation. The slice requirements also increase compared with BlockRAM- based implementations, because SubBytes is implemen- ted with logic. In many applications, it is more important to minimise area than to maximise throughput. Therefore, several implementations with small logic requirements have been published. Pramstaller and Wolkerstorfer presented a compact implementation of AES encryption and decryption with all key lengths using a novel State representation, which solves the problem of accessing both rows and columns of the State [30]. A very compact implementation was presented by Chodowiec and Gaj in [31]. They efciently exploited the structure of FPGA and were able to t AES-128 encryption and decryption into 222 slices and three BlockRAMs on a low-cost Xilinx Spartan-II XC2S30-5/6. The design achieved a throughput of 166 Mbps on an XC2S30-6 [31]. At least to the authors knowledge, the most compact AES implementation published in the literature so far was presented by Rouvroy et al. in [32]. They were able to t AES-128 with KeyExpansion into only 163 slices and three BlockRAMs on a Xilinx Spartan-3 XC3S50-4 and achieved a throughput of 208 Mbps. They also imple- mented the same design on a Virtex-II, and it has been included in the comparison below. The key to lower area consumption compared with [31] was the combination of SubBytes and MixColumns transformations [32]. The throughputs of the designs are not comparable because different FPGAs were used. Even Gbps-level through- puts can be achieved with small logic requirements as was shown by Standaert et al. in [33], where a throughput of 2.085 Gbps was achieved with only 1769 slices. Other compact designs targeting resource-limited FPGAs include [34] and [35]. Also many of the above- mentioned papers include implementations requiring a low area. The comparison of different AES implementations is hard for many reasons. First, the large variety of different target devices makes a fair comparison difcult. Second, many authors do not specify their devices well enough to ensure easy comparison, e.g. the size or the speed grade of the device has not been provided. Third, comparison of area requirements is difcult because both slices and embedded memory, i.e. BlockRAMs in the Xilinx devices, are used. Xilinx Virtex-family FPGAs (i.e. the Virtex [36], Virtex-E [37], Virtex-E Extended Memory (Virtex- EM)[26] and Virtex-II [38]) are clearly the most used implementation platforms for the published designs. Therefore, this comparison concentrates on designs implemented on these devices. Performances on differ- ent devices should not be compared, because the device greatly determines the performance of an implementa- tion. Devices are therefore clearly differentiated in the tables and gures in this paper. A summary of open-literature FPGA-based AES implementations on Xilinx Virtex-family devices is presented in Table 1, and it includes implementations published in [6, 8, 9, 11, 1323, 25, 27, 29, 30, 32, 33, 39, 40]. In order to compare the area requirements of BlockRAM-based implementations with those of imple- mentations which use only slices, a method introduced in [17] is used. Because a dual-port 256 8-bit Block- RAM can be replaced by distributed memory consisting of 256 LUTs, one BlockRAM can be replaced by 128 slices [17]. Thus, the area value in Table 1 was calculated using the following formula: area slices 128 BlockRAMs 1 The performancearea relationship is studied using two different metrics. The rst one is the traditional throughput per slice (TPS) value [9]. The other, which also takes into account the BlockRAM utilisation, is called the throughput per area (TPA) value and is calculated using the area value obtained using (1). TPA offers a better impression of the performancearea relationship than TPS, which neglects the usage of BlockRAMs. An extreme example of this is the implementation presented in [25], which attains an extremely high TPS value but nonetheless requires a large target device because of the large number of BlockRAMs. However, in such extreme cases, the TPA method yields estimates that are too pessimistic and, therefore, TPA should be used together with TPS to ensure fair comparison. Throughput-slice and throughput-area scatters are presented in Figs 1 and 2, respectively. There exists only little correlation between slice usage and throughput in Fig. 1. In Fig. 2, however, there is a signicantly higher correlation, which again validates the use of TPA. As stated earlier, the fastest reported AES implemen- tation achieves a throughput of 23.57 Gbps with 16 938 slices on a Virtex-II XC2V4000 [19]. Although it is the fastest implementation, it is not the most efcient of the high-throughput implementations, if TPS and TPA are considered as the efciency metrics. As can be seen from Fig. 2, implementations by Hodjat and Verbauwhede [15] and Zhang and Parhi [29] achieve almost the same level of throughput with fewer logic resources. Con- sidering the high TPS and TPA values as well as the slower Virtex-E device compared with the Virtex-II devices used in [15] and [19], Zhangs design can be considered the most efcient fully unrolled and pipe- lined AES-128 implementation published so far. Combinatorial implementation of SubBytes results in higher TPA values than LUT-based implementation. This is due to the fact that SubBytes can be pipelined, and therefore very high throughput can be achieved. Combinatorial SubBytes also results in moderate area requirements. This can be seen in Fig. 2, where the combinatorial implementations, i.e. [15, 27, 29], are situated in the upper left corner. If embedded memory can be used, a considerable reduction in slice requirements can be achieved by using BlockRAM-based S-boxes. Also the latency of S-boxes is shorter than that of the combinatorial SubBytes. The T-box approach seems infeasible if TPA is considered, but a very high TPS can be achieved. T-boxes are therefore very inviting if the slice requirement needs to be minimised in a high-throughput design. The approach presentedin[25], however, requires aVirtex-EMXCV812E FPGA because, with the exception of the new Virtex-4 FPGAs [41], no other device in the Xilinx Virtex family contains enough BlockRAMs (244) [26, 3638]. Many implementations with an area in the range 16002000 have been published [1719, 33, 40]. These compact implementations achieve relatively high throughputs of >1 Gbps on Virtex-E and Virtex-II FPGAs. The implementation by Standaert et al. [33] has the highest TPA and throughput among these implementations, as it achieves throughput of 2.085 Gbps with only 1769 slices and no BlockRAMs IEE Proc. Inf. Secur. 5
on a Virtex-E XCV3200E-8, and thus has a TPA (and TPS) of 1.179 Mbps/area. If even smaller area consumption is required, one most tolerate a large slowdown in throughput. The smallest implementations have throughputs measured in hundreds of Mbps; e.g. Rouvroy et al. achieved 358 Mbps with only 146 slices and three BlockRAMs (area 530) on a Virtex-II XC2V40-6 [32]. This survey has shown that various different methods have been presented to implement AES. It is impossible to point out the absolutely best method, because all methods have their advantages and disadvantages. As a conclusions to the AES study it is stated that AES can be efciently implemented on FPGAs for applications with various requirements. Both very high performance and low area requirements can be efciently achieved using the methods presented in the literature. 2.2 International Data Encryption Algorithm IDEA was introduced by Lai and Massay in 1990 [42] and modied the following year [43]. IDEA is considered highly secure, and no published attack Table 1: AES implementations on Xilinx Virtex-family FPGAs Authors Key Device Slices BRAM Area Throughput (Gbps) TPS (Mbps/slice) TPA (Mbps/area) Chodowiec and coworkers [13, 14] [Cho] Virtex 1000-6 12 600 80 22 840 12.16 0.965 0.532 Chodowiec et al. [13] [Cho] Virtex 1000-6 2 057 8 3 081 1.265 0.615 0.411 Chodowiec and coworkers [13, 14] [Cho] Virtex 1000-6 2 507 0 2 507 0.414 0.165 0.165 Dandalis et al. [8] [Dan] Virtex -6 5 673 0 5 673 0.353 0.062 0.062 Elbirt et al. [6, 9] [Elb] Virtex 1000-4 10 992 0 10 992 1.938 0.176 0.176 Elbirt et al. [6, 9] [Elb] Virtex 1000-4 4 871 0 4 871 0.949 0.195 0.195 Gaj and Chodowiec [11] [Gaj] Virtex 1000-6 2 902 0 2 902 0.332 0.114 0.114 Hodjat and Verbauwhede [15] [Hod] Virtex-II VP20-7 9 446 0 9 446 21.64 2.291 2.291 Hodjat and Verbauwhede [15] [Hod] Virtex-II VP20-7 5 177 84 15 929 21.54 4.161 1.352 Ja rvinen et al. [27] [Ja r] Virtex-II 2000-5 10 750 0 10 750 17.8 1.656 1.656 Ja rvinen et al. [27] [Ja r] Virtex-E 1000-8 11 719 0 11 719 16.54 1.411 1.411 Labbe and Pe rez [39] [Lab] Virtex 1000-4 2 151 4 2 663 0.394 0.183 0.148 Labbe and Pe rez [39] [Lab] Virtex 1000-4 3 543 4 4 055 0.796 0.225 0.196 Labbe and Pe rez [39] [Lab] Virtex 1000-4 8 767 4 9 279 1.911 0.218 0.206 McLoone and McCanny [20] [ML1] Virtex-E 3200-8 2 222 100 15 022 6.956 3.131 0.463 McLoone and McCanny [25] [ML2] Virtex-EM 812-8 2 000 244 33 232 12.02 6.010 0.362 McLoone and McCanny [21] [ML3] Virtex-EM 812-8 2 679 82 13 175 6.956 2.596 0.528 Pramstaller and Wolkerstrofer [30] [Pra] Virtex-E 1000-8 1 125 0 1 125 0.215 0.191 0.191 Rodr quez-H et al. [16] [Rod] Virtex-E 2600 5 677 80 15 917 4.121 0.726 0.259 Rouvroy et al. [32] [Rou] Virtex-II 40-6 146 3 530 0.358 2.452 0.675 Saggese et al. [17] [Sag] Virtex-E 2000-8 2 778 100 15 578 8.9 3.204 0.571 Saggese et al. [17] [Sag] Virtex-E 2000-8 446 10 1 726 1 2.242 0.579 Saggese et al. [17] [Sag] Virtex-E 2000-8 5 810 100 18 610 20.3 3.494 1.091 Saggese et al. [17] [Sag] Virtex-E 2000-8 648 10 1 928 1.82 2.809 0.944 Saqib et al. [22] [Saq] Virtex-EM 812 2 744 0 2 744 0.259 0.094 0.094 Saqib et al. [22] [Saq] Virtex-EM 812 2 136 100 14 936 2.868 1.343 0.192 Standaert et al. [18] [St1] Virtex 1000-6 2 257 0 2 257 1.563 0.693 0.693 Standaert et al. [18] [St1] Virtex-E 3200-8 2 784 100 15 584 11.776 4.230 0.756 Standaert et al. [18] [St1] Virtex-E 3200-8 542 10 1 822 1.45 2.675 0.796 Standaert et al. [33] [St2] Virtex-E 3200-8 1 769 0 1 769 2.085 1.179 1.179 Standaert et al. [33] [St2] Virtex-E 3200-8 15 112 0 15 112 18.560 1.228 1.228 Wang and Ni [40] [Wan] Virtex-E 1000-8 1 857 0 1 857 1.604 0.864 0.864 Weaver and Wawrzynek [23] [Wea] Virtex-E 600-8 770 10 2 050 1.75 2.273 0.854 Zambreno et al. [19] [Zam] Virtex-II 4000 1 254 20 3 814 4.44 3.541 1.164 Zambreno et al. [19] [Zam] Virtex-II 4000 16 938 0 16 938 23.57 1.392 1.392 Zambreno et al. [19] [Zam] Virtex-II 4000 2 206 50 8 606 10.88 4.932 1.264 Zambreno et al. [19] [Zam] Virtex-II 4000 3 766 100 16 566 22.93 6.089 1.384 Zambreno et al. [19] [Zam] Virtex-II 4000 387 10 1 667 1.41 3.643 0.846 Zhang and Parhi [29] [Zha] Virtex 1000-6 11 014 0 11 014 16.032 1.456 1.456 Zhang and Parhi [29] [Zha] Virtex 800-6 9 406 0 9 406 9.184 0.976 0.976 Zhang and Parhi [29] [Zha] Virtex-E 1000-8 11 022 0 11 022 21.556 1.956 1.956 Zhang and Parhi [29] [Zha] Virtex-EM 812-8 9 406 0 9 406 11.965 1.272 1.272 The authors have selected the most relevant implementations (in their opinion) from those publications which include several different implementations. Keys are used in Figs 1 and 2. 6 IEE Proc. Inf. Secur.
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 5 10 15 20 25 slices t h r o u g h p u t
( G b p s ) [Dan] [Elb] [Elb] [Lab] [Lab] [Lab] [Cho] [Cho] [Gaj] [St1] [Zha] [Zha] [Jr] [ML1] [ML2] [ML3] [Pra] [Rod] [Sag] [Sag] [Sag] [Sag] [Saq] [Saq] [St1] [St1] [St2] [St2] [Wan] [Wea] [Zha] [Zha] [Hod] [Hod] [Jr] [Rou] [Zam] [Zam] [Zam] [Zam] [Zam] Virtex 4 Virtex 6 Virtex E8 Virtex II Fig. 1 Throughputslice chart of FPGA-based AES implementations 0 5000 10000 15000 20000 25000 30000 35000 0 5 10 15 20 25 area (slices + BlockRAMs) t h r o u g h p u t
( G b p s ) [Dan] [Elb] [Elb] [Lab] [Lab] [Cho] [Cho] [Gaj] [St1] [Zha] [Zha] [Jr] [ML1] [ML2] [ML3] [Pra] [Rod] [Sag] [Sag] [Sag] [Sag] [Saq] [Saq] [St1] [St1] [St2] [St2] [Wan] [Wea] [Zha] [Zha] [Hod] [Hod] [Jr] [Rou] [Zam] [Zam] [Zam] [Zam] [Zam] Virtex 4 Virtex 6 Virtex E8 Virtex II [Lab] Fig. 2 Throughputarea chart of FPGA-based AES implementations IEE Proc. Inf. Secur. 7
(with the exception of attacks on weak keys) is better than an exhaustive search on the 128-bit key space, which is computationally infeasible. The security of IDEA appears to be bounded only by the weaknesses arising from the relatively small (compared with its keylength) blocklength of 64 bits [44]. It has been stated that before the introduction of the AES, IDEA may have been the most secure private-key cryptographic algorithm available to the public [2]. IDEA encrypts 64-bit plaintext blocks into 64-bit ciphertext blocks using a 128-bit input key K. The algorithm consists of eight identical rounds followed by an output transformation. Each round uses six 16-bit subkeys K r i , 1 4i 46, to transform a 64-bit input X into an output of four 16-bit blocks, which are then input to the next round. All subkeys are derived from the 128-bit input key K. The subkey derivation process is different in decryption mode from the encryption mode, but otherwise encryption and decryption are performed using identical hardware. IDEA uses only three operations on 16-bit sub-blocks a and b: bitwise XOR, unsigned addition mod (2 16 ) and modulo (2 16 1) multiplication. All three operations are derived from different algebraic groups of 2 16 elements, which is crucial to the algorithmic strength of IDEA. Of the three arithmetic operations, bitwise XOR and unsigned addition mod (2 16 ) are trivial to implement, whereas an implementation of modulo (2 16 1) multi- plication that is both area efcient and fast requires careful design and bit-level optimisation. Two early FPGA-based IDEA implementations were published by Mencer et al. [45] and Mosanya et al. [46] in 1998 and 1999, respectively. Mencer et al. studied the benets and limitations of FPGA systems compared with processors and application-specic integrated circuits (ASICs) using IDEA encryption as a bench- mark. Their IDEA implementation had a throughput of 528 Mbps, and it covered four Xilinx XC4020 FPGAs (3200 CLBs) [45]. Mosanya et al. presented a recon- gurable cryptoprocessor called CryptoBooster in [46]. They did not present any exact performance gures for their implementation but they estimated that through- puts of 2001500 Mbps could be achieved on a state-of- the-art FPGA of that time [46]. Representative compact and high-speed FPGA-based implementations of IDEA include a bit-serial imple- mentation described by Leong et al. in [47], with a throughput of 500 Mbps on a Xilinx Virtex XCV300-6, and the bit-parallel implementation described by Cheung et al. in [48], which achieved throughput of 5.25 Gbps on a Virtex XCV1000-6. The bit-parallel implementation also included bespoke software to customise the FPGA reprogramming bitstream for different key schedules. In 2002, the SIG-IDEA implementation was described in [49], and, at 6.78 Gbps, its throughput represented the fastest published FPGA-based imple- mentation of IDEA at that time. Other contributions of SIG-IDEA include implementing a fully pipelined algorithm with both inner and outer loop pipelining on a single Xilinx Virtex-E XCV1000E-6 device, the efcient usage of the diminished-one number system and an area-efcient implementation of the modulo (2 16 ) multiplication. Currently, the fastest FPGA-based implementation of IDEA is probably [50], where Gonzalez et al. achieved a throughput of 8.3 Gbps on a Xilinx Virtex XCV600-6 device. The key to high throughput was replacing all the operational units involving the key with its constant- operand equivalents by partial reconguration, the over- head for which was 4 ms. However, only a few devices support partial reconguration, and the scheme requires a controlling microprocessor. Another recent IDEA implementation by Pan et al. achieved a throughput of 6 Gbps by utilising the embedded multipliers for the modulo (2 16 1) multiplication algorithm [51]. The FPGA-based IDEA implementations mentioned above are summarised in Table 2. 3 Hash algorithms Commonly used hash algorithms, e.g. MD5 [52] and the secure hash algorithm (SHA) [53], are not as well suited to high-speed hardware implementations as most of the private-key or public-key algorithms, mainly because parallelisation cannot be used as efciently. Hash algorithms can be implemented efciently on software as they use common modulo (2 32 ) additions, which are easy and fast to perform with traditional microprocessors. However, signicant accelerations from 25 to 31 times for SHA-1 and SHA-512 have been reported [54]. There are certain applications which greatly benet from hardware acceleration. For example, if a crypto- graphic scheme requiring hash calculations, e.g. the digital signature algorithm [55], is implemented on an FPGA, it is well-grounded to implement a hash module on the chip too. Certain very demanding hash calcula- tions, e.g. long chains of hash rounds, benet from hardware acceleration [56]. Hash algorithm implemen- tations presented in [54, 5669] are considered here. A throughput of several hundred Mbps can be achieved with small logic requirements using a basic iterative architecture [54, 56, 57, 63, 65, 67, 69]. However, the throughput and efciency of an imple- mentation can be increased considerably by partially unrolling the algorithm rounds [54, 58, 66]. This is because the structure of hash algorithms favours Table 2: IDEA implementations on Xilinx FPGAs Authors Device Slices Throughput (Gbps) TPS (Mbps/slice) Cheung et al. [48] Virtex 1000-6 11 602 5.24 0.452 Gonzalez et al. [50] Virtex 600-6 6 078 8.3 1.366 Ha ma la inen et al. [49] Virtex-E 1000-6 9 855 a 6.78 0.688 Leong et al. [47] Virtex 300-6 2 801 0.5 0.179 Mencer et al. [45] XC4000 n.a. (3 200 CLBs) 0.528 n.a. Pan et al. [51] Virtex-II 1000-6 4 221 6.0 1.421 a The value is not available in the original publication. Received from the design les. 8 IEE Proc. Inf. Secur.
unrolling; i.e. if k rounds are unrolled, the critical path increases for fewer than k times [54]. Pipelining, however, cannot be used for increasing throughput as efciently as for block ciphers [56]. Unrolling was used in the fastest published SHA-1 implementations, where Lien et al. reported a throughput of 1024 Mbps on a Virtex XCV1000-6 [54] and Sklavos et al. achieved a throughput of 1339 Mbps on a Virtex-II XC2V500 with their combined SHA-1 and RIPEMD design [66]. Because many of the commonly used hash algorithms share resources and have a similar kind of structure, many implementations combining several hash algo- rithms have been proposed. Dominikus presented a general hash processor architecture which can be used for MD5, SHA-1, SHA-256 and RIPEMD calculations in [59]. A general processor architecture naturally achieves slower performance than algorithm-specic implementations, such as [54, 5658, 67], but algorithm exibility may be an essential feature in certain applications. A sufcient balance between speed and algorithm exibility may be achieved by implementing an algorithm-specic design of two or more commonly used algorithms. Implementations combining certain algorithms have been published; e.g. MD5 and SHA-1 were combined in [61, 62, 68], different SHA algorithms were combined in [63] and MD5 was combined with RIPEMD in [64]. Algorithm support of published FPGA-based implementations is presented in Table 3. The performance and area requirements of FPGA- based open-literature hash algorithm implementations are presented in Table 4. Hash algorithm implementa- tions achieving throughputs of several hundred Mbps require only a minimal amount of logic resources. Hash algorithms with different cryptographic strengths, e.g. SHA-1 and SHA-512, have almost similar throughputs, but stronger algorithms have larger logic requirements [54, 60]. Increasing throughput to several Gbps is difcult because of the structures of commonly used hash algorithms. In high-speed implementations of the AES, for example, aggressive parallelisation, unrolling and pipelining can be used efciently, whereas the structures of hash algorithms usually make the efcient use of such methods difcult [56]. Even for hash algorithms, parallel hash blocks or unrolling and pipelining can be used for increasing throughput, as we have shown in [56]. However, very high-speed implementations of hash algorithms require considerably more area than imple- mentations of block ciphers, e.g. the AES, of the same speed [56]. This can be veried, for example, by comparing the SIG-MD5 implementation of four parallel MD5 blocks [56] with the AES implementation by Standaert et al. [33]. Both have a similar level of throughput (2395 and 2085 Mbps), but the MD5 implementation consumes a lot more area (5732 slices) than the AES implementation (1769 slices). MD5, RIPEMD and SHA-1 were recently compro- mised so that nding collisions is possible with much less effort than exhaustive searching [70, 71]; therefore these algorithms can no longer be considered secure. Although nding collisions is not a problem in every application using hash algorithms, e.g. HMAC (the keyed-hash message authentication code), these algo- rithms will certainly be replaced with stronger ones in the future. Hardware implementation of hash algo- rithms will probably be studied actively when MD5 and SHA-1 are replaced with new algorithms. 4 Conclusions and future work We have shown that FPGAs can be used very efciently for high-speed implementations of cryptographic algo- rithms. The eld has been studied extensively for the past few years and very efcient implementations have been presented regardless of the implemented algorithm. Similar design methodologies apply to all algorithms studied in this survey. The key to a high-speed implementation is to identify the critical operation, e.g. SubBytes in the AES, and implement it efciently. In general, operations, or at least the critical operation, Table 3: Algorithm support of published FPGA-based implementations of hash algorithms Authors MD5 SHA-1 SHA-256 SHA-384 SHA-512 RIPEMD HAS-160 Deepakumara et al. [57] Diez et al. [58] Diez et al. [58] Dominikus [59] Grembowski et al. [60] Grembowski et al. [60] Ja rvinen et al. [56] Ja rvinen et al. [61] Kang et al. [62] Lien et al. [54] Lien et al. [54] McLoone and McCanny [63] Ng et al. [64] Selimis et al. [65] Sklavos et al. [66] Ting et al. [67] Wang et al. [68] Zibin and Ning [69] IEE Proc. Inf. Secur. 9
should be implemented on as low a level as possible in order to guarantee maximum performance with mini- mum resources. Although all the designs considered here were implemented on FPGAs, only a small number of them specically target FPGAs as they are merely general hardware implementations which could be implemented on ASICs as well. Exploiting the special properties of FPGAs has not yet been thoroughly studied, with the exception of embedded memory usage. However, partial recongurability was used in an IDEA implementation [50] as discussed in Section 2.2. Certain implementations which have been optimised especially for the slice structure have been published, e.g. [18]. Many cryptographic algorithms use similar kinds of operations, so it may be possible to combine several algorithms into a single design efciently by exploiting these similarities. Such combinations of cryptographic algorithms have not been studied extensively, except in the case of hash algorithms. There exist dedicated embedded blocks for certain commonly used operations in modern FPGAs, e.g. the Altera Stratix-II architecture includes dedicated blocks for digital signal processing (DSP) [73]. Dedicated blocks for cryptography do not yet exist on any device, but if such blocks were implemented they could speed up the performance of cryptographic algorithms substan- tially. The question of how these blocks should be arranged and which operations should be implemented is an open research problem. Based on our observations, certain possible research topics in the future include architectures for constrained environments an increase of generality, general cryptographic architectures implementing several cryptographic algorithms in a single design (cf. MD5/SHA-1 implementations) dedicated blocks for cryptographic operations into FPGAs (cf. DSP blocks) efcient utilisation of the special abilities of FPGAs (e.g. partial reconguration) efcient implementations of strong hash algorithms. It was concluded that regardless of the algorithm, very efcient implementations have been published in terms of both speed and logic requirements. Although cryptographic algorithm implementation has been widely studied, certain open problems remain. Table 4: Performance and area requirements of published FPGA-based implementations of hash algorithms Authors Device Algorithm Slices BlockRAMs Throughput (Mbps) Deepakumara et al. [57] Virtex 1000-6 MD5 880 2 165 Deepakumara et al. [57] Virtex 1000-6 MD5 4 763 0 354 Diez et al. [58] Virtex-II 3000 MD5 1 369 0 467.3 Diez et al. [58] Virtex-II 3000 SHA-1 1 550 0 899.8 Dominikus [59] Virtex-E 300 MD5 1 004 0 146 Dominikus [59] Virtex-E 300 RIPEMD 1 004 0 89 Dominikus [59] Virtex-E 300 SHA-1 1 004 0 119 Dominikus [59] Virtex-E 300 SHA-256 1 004 0 77 Grembowski et al. [60] Virtex 1000-6 SHA-1 1 475 a 0 a 462 Grembowski et al. [60] Virtex 1000-6 SHA-512 2 826 a 2 a 616 Ja rvinen et al. [56] Virtex-II 4000-6 MD5 1 325 0 607 Ja rvinen et al. [56] Virtex-II 4000-6 MD5 5 732 0 2 395 Ja rvinen et al. [56] Virtex-II 4000-6 MD5 11 498 10 5 857 Ja rvinen et al. [61] Virtex-II 2000-6 MD5 1 882 0 602 Ja rvinen et al. [61] Virtex-II 2000-6 SHA-1 1 882 0 485 Kang et al. [62] Apex 20K 1000-3 MD5 10 573 (LE) 0 142 Kang et al. [62] Apex 20K 1000-3 SHA-1 10 573 (LE) 0 114 Kang et al. [62] Apex 20K 1000-3 HAS-160 10 573 (LE) 0 160 Lien et al. [54] Virtex 1000-6 SHA-1 480 0 544 Lien et al. [54] Virtex 1000-6 SHA-1 1 480 0 1 024 Lien et al. [54] Virtex 1000-6 SHA-512 2 384 0 717 Lien et al. [54] Virtex 1000-6 SHA-512 3 521 0 929 McLoone and McCanny [63] Virtex-E 600-8 SHA-384/512 2 914 2 479 Ng et al. [64] Flex 50-1 MD5 1 964 (LE) 0 206 Ng et al. [64] Flex 50-1 RIPEMD 1 964 (LE) 0 84 Selimis et al. [65] Virtex 150 SHA-1 518 0 518 Sklavos et al. [66] Virtex-II 500 SHA-1 2 245 0 1 339 Sklavos et al. [66] Virtex-II 500 RIPEMD 2 245 0 1 656 Ting et al. [67] Virtex-E 300-8 SHA-256 1 261 0 693 Wang et al. [68] Apex 20K 1000-3 MD5 3 040 (LE) 1 (ESB) 178.6 Wang et al. [68] Apex 20K 1000-3 SHA-1 3 040 (LE) 1 (ESB) 143.3 Zibin and Ning [69] Acex 100-1 SHA-1 1 622 (LE) 0 268.99 Notice that 1 slice 2 logic elements (LEs), 1 BlockRAM4096 bits [36, 37, 26] and 1 embedded system block (ESB) 2048 bits [72]. a Calculated from the percentages presented in the paper. 10 IEE Proc. Inf. Secur.
5 Acknowledgments This paper was written as part of the GO-SEC project at Helsinki University of Technology. GO-SEC is nanced by the National Technology Agency of Finland and several Finnish telecommunications companies. Finally, we would like to thank the anonymous reviewers for their helpful comments and proposals for improvement. 6 References 1 Wollinger, T., Guajardo, J., and Paar, C.: Security on FPGAs: state of the art implementations and attacks, ACM Trans. Embed. Comput. Syst., 2004, 3, pp. 534574 2 Schneier, B.: Applied cryptography (John Wiley & Sons, 2nd ed. 1996) 3 McLoone, M., and McCanny, J.V.: High-performance FPGA implementation of DES using novel method for implementing the key schedule, IEE Proc. Circ. Dev. Syst., 2003, 150 (5), pp. 373378 4 Rouvroy, G., Standaert, F.-X., Quisquater, J.-J., and Legat, J.-D.: Efcient uses of FPGAs for implementations of DES and its experimental linear cryptanalysis, IEEE Trans. Comput., 2003, 52 (4), pp. 473482 5 National Institute of Standards and Technology.: Advanced Encryption Standard (AES). Federal Information Processing Standards Publication (FIPS PUB) 197, 26 November 2001, http://csrc.nist.gov/publications/ps/ps197/ps-197.pdf, accessed June 2005 6 Elbirt, A.J., Yip, W., Chetwynd, B., and Paar, C.: An FPGA- based performance evaluation of the AES block cipher candidate algorithm nalists, IEEE Trans. VLSI Syst., 2001, 9 (4), pp. 545557 7 Nechvatal, J., Barker, E., Bassham, L., Burr, W., Dworkin, M., Foti, J., and Roback, E.: Report on the development of the Advanced Encryption Standard (AES), 2 October 2000, http:// csrc.nist.gov/Cryptotoolkit/aes/round2/r2report.pdf, accessed June 2005 8 Dandalis, A., Prasanna, V.K, and Rolim, J.D.P.: A comparative study of performance of AES nal candidates using FPGAs. Proc. Workshop on Cryptographic Hardware and Embedded Systems, CHES 2000, Worcester, MA, USA, August 2000, pp. 125140 9 Elbirt, A.J., Yip, W., Chetwynd, B., and Paar, C.: An FPGA implementation and performance evaluation of the AES block cipher candidate algorithm nalists. Proc. Third Advanced Encryption Conf., AES3, New York, NY, USA, April 2000, pp. 1327 10 Fischer, V.: Realization of the round 2 AES candidates using Altera FPGA. Proc. 3rd Advanced Encryption Standard Candidate Conf., AES3, New York, NY, USA, April 2000, http://csrc.nist. gov/CryptoToolkit/aes/round2/conf3/papers/ 24-vscher.pdf, accessed June 2005 11 Gaj, K., and Chodowiec, P.: Comparison of the hardware performance of the AES candidates using recongurable hardware. Proc. 3rd Advanced Encryption Standard Candidate Conf., AES3, New York, NY, USA, April 2000, pp. 4054, http://csrc.nist.gov/CryptoToolkit/aes/round2/conf3/papers/ AES3Proceedings.pdf accessed June 2005 12 Mroczkowski, P.: Implementation of the block cipher Rijndael using altera FPGA. Public Comments on AES Candidate AlgorithmsRound 2, May 2000, http://csrc.nist.gov/Crypto- Toolkit/aes/round2/pubcmnts.htm, accessed June 2005 13 Chodowiec, P., Khuon, P., and Gaj, K.: Fast implementation of secret-key block ciphers using mixed inner- and outer-round pipelining. Proc. 2001 ACM/SIGDA 9th Int. Symp. on Field Programmable Gate Arrays, FPGA 2001, Monterey CA, USA, February 2001, pp. 94102 14 Gaj, K., and Chodowiec, P.: Fast implementation and fair comparison of the nal candidates for advanced encryption standard using eld programmable gate arrays. Proc. Topics in CryptologyCT-RSA 2001, The Cryptographers Track at RSA Conf. 2001, San Francisco, CA, USA, April 2001 pp. 8499 15 Hodjat, A., and Verbauwhede, I.: A 21.54 Gbits/s fully pipelined AES processor on FPGA. Proc. 12th Annual IEEE Symp. Field- Programmable Custom Computing Machines, FCCM04, Napa, CA, USA, April 2004, pp. 308309 16 Rodr quez-Henr quez, F., Saqib, N.A., and D az-Pe rez, A.: 4.2 Gbit/s single-chip FPGA implementation of AES algorithm, Electr. Lett., 2003, 39 (15), pp. 11151116 17 Saggese, G.P., Mazzeo, A., Mazzocca, N., and Strollo, A.G.M.: An FPGA-based performance analysis of the unrolling, tiling, and pipelining of the AES algorithm. Proc. 13th Int. Conf. Field Programmable Logic and Applications, FPL 2003, Lisbon, Portugal, September 2003, pp. 292302 18 Standaert, F.-X., Rouvroy, G., Quisquater, J.-J., and Legat, J.-D.: A methodology to implement block ciphers in recongurable hardware and its application to fast and compact AES RIJNDAEL. Proc. ACM/SIGDA 11th ACM Int. Symp. Field-Programmable Gate Arrays, FPGA 2003, Monterey, CA, USA, February 2003, pp. 216224 19 Zambreno, J., Nguyen, D., and Choudhary, A.: Exploring area/ delay tradeoffs in an AES FPGA implementation. Proc. 14th Int. Conf. Field-Programmable Logic and its Applications, FPL 2004, Antwerp, Belgium, AugustSeptember 2004, pp. 575585 20 McLoone, M., and McCanny, J.V.: High performance single- chip FPGA Rijndael algorithm implementation. Proc. Work- shop on Cryptographic Hardware and Embedded Systems, CHES 2001, Paris, France, May 2001, pp. 6576 21 McLoone, M., and McCanny, J.V.: Single-chip FPGA imple- mentation of the advanced encryption standard algorithm. Proc. 11th Int. Conf. Field-Programmable Logic and Applications, FPL 2001, Belfast, Northern Ireland, UK, August 2001, pp. 152161 22 Saqib, N.A., Rodr quez-Henr quez, F., and D az-Pe rez, A.: AES algorithm implementationan efcient approach for sequential and pipeline architectures. Proc. 4th Mexican Int. Computer Science, ENC 2003, Tlaxcala, Mexico, September 2003, pp. 126130 23 Weaver, N., and Wawrzynek, J.: High performance, compact AES implementations in Xilinx FPGAs, 27 September 2002, http://www.cs.berkeley.edu/nweaver/sfra/rijndael.pdf, accessed June 2005 24 Fischer, V., and Drutarovsky , M.: Two methods of Rijndael implementation in recongurable hardware. Proc. Workshop on Cryptographic Hardware and Embedded Systems, CHES 2001, Paris, France, May 2001, pp. 7792 25 McLoone, M., and McCanny, J.V.: Rijndael FPGA implemen- tation utilizing look-up tables. Proc. 2001 IEEE Workshop on Signal Processing Systems, SIPS01, Antwerp, Belgium, September 2001, pp. 349360 26 Xilinx, Inc.: Virtex-E 1.8 V extended memory eld program- mable gate arrays, 17 July 2002, http://www.xilinx.com/bvdocs/ publications/ds025.pdf, accessed June 2005 27 Ja rvinen, K., Tommiska, M., and Skytta , J.: A fully pipelined memoryless 17.8 Gbps AES-128 encryptor. Proc. ACM/SIGDA 11th ACM Int. Symp. on Field-Programmable Gate Arrays, FPGA 2003, Monterey, CA, USA, February 2003, pp. 207215 28 Daemen, J., and Rijmen, V.: The design of Rijndael (Springer- Verlag, 2002) 29 Zhang, X., and Parhi, K.K.: High-Speed VLSI architectures for the AES algorithm, IEEE Trans. VLSI Syst., 2004, 12 (9), pp. 957967 30 Pramstaller, N., and Wolkerstorfer, J.: A universal and efcient AES co-processor for eld programmable logic arrays. Proc. 14th Int. Conf. Field-Programmable Logic and its Applications, FPL 2004, Antwerp, Belgium, AugustSeptember 2004, pp. 565574 31 Chodowiec, P., and Gaj, K.: Very compact FPGA implementa- tion of the AES algorithm. Proc. Workshop on Cryptographic Hardware and Embedded Systems, CHES 2003, Cologne, Germany, September 2003, pp. 319333 32 Rouvroy, G., Standaert, F.-X., Quisquater, J.-J., and Legat, J.-D.: Compact and efcient encryption/decryption module for FPGA implementation of the AES Rijndael very well suited for small embedded applications. Proc. Int. Conf. Information Technology: Coding and Computing, ITCC04, Las Vegas, NV, USA, April 2004, Vol. 2, pp. 583587 33 Standaert, F.-X., Rouvroy, G., Quisquater, J.-J., and Legat, J.-D.: Efcient implementation of Rijndael encryption in recongurable hardware: improvements and design tradeoffs. Proc. Workshop on Cryptographic Hardware and Embedded Systems, CHES 2003, Cologne, Germany, September 2003, pp. 334350 34 Caltagirone, C., and Anantha, K.: High throughput, parallelized 128-bit AES encryption in a resource-limited FPGA. Proc. 15th Annual ACM Symp. Parallel Algorithms and Architectures, SPAA03, San Diego, CA, USA, June 2003, pp. 240241 35 Zigiotto, A.C., and dAmore, R.: A low-cost FPGA implemen- tation of the Advanced Encryption Standard algorithm. Proc. 15th Symp. Integrated Circuits and Systems Design, SBCCI02, Porto Alegre, Brazil, September 2002, pp. 181186 IEE Proc. Inf. Secur. 11
36 Xilinx, Inc.: Virtex 2.5 V eld programmable gate arrays, 2 April 2001, http://www.xilinx.com/bvdocs/publications/ ds003.pdf, accessed June, 2005 37 Xilinx, Inc.: Virtex-E 1.8 V eld programmable gate arrays, 17 July 2002, http://www.xilinx.com/bvdocs/publications/ ds022.pdf, accessed June 2005 38 Xilinx, Inc.: Virtex-II platform FPGAs: complete data sheet, 1 March 2005, http://www.xilinx.com/bvdocs/publications/ ds031.pdf, accessed June 2005 39 Labbe , A., and Pe rez, A.: AES implementation on FPGA: timeexibility tradeoff. Proc. 12th Int. Conf. Field- Programmable Logic and its Applications, FPL 2002, Montpellier, France, September 2002, pp. 836844 40 Wang, S.-S., and Ni, W.-S.: An efcient FPGA implementation of Advanced Encryption Standard algorithm. Proc. 2004 IEEE Int. Symp. on Circuits and Systems, ISCAS04, Vancouver, British Columbia, Canada, May 2004, pp. 597600 41 Xilinx, Inc.: Virtex-4 family overview, 17 June 2005, http:// www.xilinx.com/bvdocs/publications/ds112.pdf, accessed June 2005 42 Lai, X., and Massey, J.L.: A proposal for a new block encryption standard. Proc. Advances in CryptologyEUROCRYPT 90 pp. 389404 43 Lai, X., Massey, J.L., and Murphy, S.: Markov ciphers and differential cryptanalysis. Proc. Advances in Cryptology EUROCRYPT 91, pp. 1738 44 Menezes, A.J., van Oorschot, P.C., and Vanstone, S.A.: Handbook of applied cryptography (CRC Press Ltd., 1997) 45 Mencer, O., Morf, M., and Flynn, M.J.: Hardware software tri- design of encryption for mobile communication units. Proc. 1998 IEEE Int. Acoustics, Speech, and Signal Processing, ICASSP 98, Seattle, WA, USA, May 1998, Vol. 5, pp. 30453048 46 Mosanya, E., Teuscher, C., Restrepo, H.F., Galley, P., and Sanchez, E.: CryptoBooster: a recongurable and modular cryptographic coprocessor. Proc. Workshop on Cryptographic Hardware and Embedded Systems, CHES 1999, Worcester, MA, USA, August 1999, pp. 246256 47 Leong, M.P., Cheung, O.Y.H., Tsoi, K.H., and Leong, P.H.W.: A bit-serial implementation of the international data encryption algorithm IDEA. Proc. IEEE Symp. Field-Programmable Custom Computing Machines (FCCM00), Napa Valley, CA, USA, April 2000, pp. 122131 48 Cheung, O.Y.H., Tsoi, K.H., Wai Leong, P.H., and Leong, M.P.: Tradeoffs in parallel and serial implementations of the interna- tional data encryption algorithm IDEA. Proc. Third Int. Workshop on Cryptographic Hardware and Embedded Systems, CHES 2001, Paris, France, May 2001, pp. 333347 49 Ha ma la inen, A., Tommiska, M., and Skytta , J.: 8 Gigabits per second implementation of the IDEA cryptographic algorithm. Proc. 12th Int. Conf. Field-Programmable Logic and its Applications, FPL 2002, Montpellier, France, September 2002, pp. 760769 50 Gonzalez, I., Lo pez-Buedo, S., Go mez, F.J., and Mart nez, J.: Using partial reconguration in cryptographic applications: an implementation of the IDEA algorithm. Proc. 13th International Workshop on Field-Programmable Logic and Applications (FPL03), Lisbon, Portugal, September 2003, pp. 194203 51 Pan, Z., Venkateshwaran, S., Gurumani, S.T., and Wells, B.E.: Exploiting ne-grain parallelism of IDEA using Xilinx FPGA. Proc. 16th Int. Conf. Parallel and Distributed Computing Systems (PDCS-2003), Reno, NV, USA, August 2003, pp. 122131 52 Rivest, R.L.: The MD5 message-digest algorithm, RFC 1321 (MIT Laboratory for Computer Science and RSA Data Security, Inc., 1992) 53 National Institute of Standards and Technology.: Secure hash standard. Federal Information Processing Standards Publication (FIPS PUB) 180-2, 1 August 2002, with changes, 25 February 2004, http://www.csrc.nist.gov/publications/ps/ps180-2/ ps180-2withchangenotice.pdf, accessed June 2005 54 Lien, R., Grembowski, T., and Gaj, K.: A 1 Gbit/s partially unrolled architecture of hash functions SHA-1 and SHA-512. Proc. Topics in Cryptology, CT-RSA 2004, The Cryptographers Track at the RSA Conf. 2004, San Francisco, CA, USA, February 2004, pp. 324338 55 National Institute of Standards and Technology.: Digital signature standard (DSS), Federal Information Processing Standards Publication (FIPS PUB) 186-2, 27 January 2000, http://csrc.nist.gov/publications/ps/ps186-2/ps186-2- change1.pdf, accessed June2005 56 Ja rvinen, K., Tommiska, M., and Skytta , J.: Hardware implementation analysis of the MD5 hash algorithm. Proc. 38th Hawaii Int. Conf. System Sciences HICSS-38, Big Island, HI, USA, January 2005, p. 298 (abstract) 57 Deepakumara, J., Heys, H.M., and Venkatesan, R.: FPGA implementation of MD5 hash algorithm. Proc. Canadian Conf. Electrical and Computer Engineering, CCECE 2001, Toronto, Canada, May 2001, Vol. 2, pp. 919924 58 Diez, J.M., Bojanic , S., Stanimirovicc , Lj., Carreras, C., and Nieto-Taladriz, O.: Hash algorithms for cryptographic protocols: FPGA implementations. Proc. 10th Telecommunica- tions Forum, TELFOR2002, Belgrade, Yugoslavia, November 2002, 59 Dominikus, S.: A hardware implementation of MD4-family hash algorithms. Proc. 9th IEEE Int. Conf. Electronics, Circuits and Systems, ICECS 2002, Dubrovnik, Croatia, September 2002, Vol. 3, pp. 11431146 60 Grembowski, T., Lien, R., Gaj, K., Nguyen, N., Bellows, P., Flidr, J., Lehman, T., and Schott, B.: Comparative analysis of the hardware implementations of hash functions SHA-1 and SHA-512. Proc. 5th Int. Conf. Information Security, ISC 2002, Sao Paulo, Brazil, SeptemberOctober 2002, pp. 7589 61 Ja rvinen, K., Tommiska, M., and Skytta , J.: A compact MD5 and SHA-1 co-implementation utilizing algorithm simila- rities. Proc. Int. Conf. Engineering of Recongurable Systems and Algorithms, ERSA05, Las Vegas, NV, USA, June 2005, pp. 4854 62 Kang, Y.K., Kim, D.W., Kwon, T.W., and Choi, J.R.: An efcient implementation of hash function processor for IPSEC. Proc. IEEE Asia-Pacic Conf. on ASIC, AP-ASIC 2002, Taipei, Taiwan, August 2002, pp. 9396 63 McLoone, M., and McCanny, J.V.: Efcient single-chip implementation of SHA-384 and SHA-512. Proc. 2002 Int. Conf. Field-Programmable Technology, FPT 2002, Hong Kong, China, December 2002, pp. 311314 64 Ng, C.-W., Ng, T.-S., and Yip, K.-W.: A uninied architecture of MD5 and RIPEMD-160 hash algorithms. Proc. 2004 IEEE Int. Symp. on Circuits and Systems, ISCAS04, Vancouver, British Columbia, Canada, May 2004, Vol. 2, pp. 889892 65 Selimis, G., Sklavos, N., and Koufopavlou, O.: VLSI implementation of the keyed-hash message authentication code for the wireless application protocol. Proc. 2003 10th IEEE Int. Conf. Electronics, Circuits and Systems, ICECS 2003, Sharjah, United Arab Emirates, December 2003, Vol. 1, pp. 2427 66 Sklavos, N., Dimitroulakos, G., and Koufopavlou, O.: An ultra high speed architecture for VLSI implementation of hash functions. Proc. 2003 10th IEEE Int. Conf. Electronics, Circuits and Systems, ICECS 2003, Sharjah, United Arab Emirates, December 2003, Vol. 3, pp. 990993 67 Ting, K.K., Yuen, S.C.L., Lee, K.H., and Leong, P.H.W.: An FPGA based SHA-256 processor. Proc. 12th Int. Conf. Field-Programmable Logic and its Applications, FPL 2002, Montpellier, France, September 2002, pp. 577585 68 Wang, M.-Y., Su, C.-P., Huang, C.-T., and Wu, C.-W.: An HMAC processor with integrated SHA-1 and MD5 algorithms. Proc. Asia and South Pacic Design Automation Conf. 2004, Yokohama, Japan, January 2004, pp. 456458 69 Zibin, D., and Ning, Z.: FPGA Implementation of SHA-1 algorithm. Proc. 2003 5th Int. Conf. ASIC, ASICON 2003, Beijing, China, October 2003, Vol. 2, pp. 13211324 70 Wang, X., Yin, Y.L., and Yu, H.: Collision search attacks on SHA1, 13 February 2005, http://theory.csail.mit.edu/yiqun/ shanote.pdf, accessed June 2005 71 Wang, X., and Yu, H.: How to break MD5 and other hash functions. Proc. Advances in CryptologyEUROCRYPT 2005: 24th Annual Int. Conf. the Theory and Applications of Cryptographic Techniques, Aarhus, Denmark, May 2005, pp. 1935 72 Altera Corporation: APEX 20K Programmable logic device family datasheet, http://www.altera.com/literature/ds/apex.pdf, March 2004 accessed June 2005 73 Altera Corporation: Stratix II device handbook, volume 2, http://www.altera.com/literature/hb/stx2/stratix2_handbook.pdf, May 2005 accessed June 2005 12 IEE Proc. Inf. Secur.
Echo on a Chip - Secure Embedded Systems in Cryptography: A New Perception for the Next Generation of Micro-Controllers handling Encryption for Mobile Messaging
Echo on a Chip - Secure Embedded Systems in Cryptography: A New Perception for the Next Generation of Micro-Controllers handling Encryption for Mobile Messaging