AES T-Box Slides
AES T-Box Slides
AES T-Box Slides
Agenda
Introduction Motivation Overview of architectures Implementations Key Scheduling Test vectors and tools used Results Conclusion
2
Introduction to AES
In 1997, NIST initiated a contest known as AES to develop a Federal Information Processing Standard. Standard Should be capable of protecting sensitive government information well into the next centuary. After 5 years of extensive analysis, Rijndael was chosen as the winner of the contest, and become a official standard in Nov. 2001 AES is expected to be used by U.S. Government and, on voluntary basis by a private sector.
3
Motivation
AES T-Box implementations for decryption and combined encryption decryption units in software showed better throughput, compared to S-box implementations in software. This performance improvement was shown in hardware on Altera Flex devices by Viktor Fischer and Milos Drutarovsky. Our idea is to show the same performance improvement of T-box architecture in hardware on Xilinx FPGA families Virtex 5 & Spartan 3E.
S-box vs T-box
S-box architecture uses 8 x 8 look-up tables and the remaining round operations for encryption/ decryption operations T-box Architecture uses 8 x 32 look-up tables and the remaining XOR operations for encryption/decryption operations. T-box architecture uses 4 times more memory than S-box. (S-box :16 times 8 x8 ::: T-box: 16 times 8 x 32)
This architecture structure is same as general proposed architecture of AES. Encryption starts with add round key, and performs Round Operations: subbytes (uses 8 x 8 Look-up tables), shift rows, Mix Column and add roundkey. Last round doesnt include Mix column operation.
6
S-box Enc/Dec
Plaintext
K0 i=Nr
Ciphertext
KNr
MixColumn
i=Nr Ki
InvSubbytes
Ki
Ciphertext
Plaintext
a) Encryption
Nr : Total Number of Rounds
b) Decryption
This architecture allows the computation of the entire round only using look-up tables and XOR operations. Pre-computed look-up tables represent the combined operation of subbytes and mixcolumn transformations. T-box tables are of size 8 x 32 bits. Memory of T-box Table One T-box Table: 256 x 32(4B) = 1KB Four T-box tables = 4KB ( Fast Implementations)
8
First rows elements, s0, s4, s8, s12 Second rows elements, s1, s5, s9, s13
02 03 01 01 01 02 03 01 01 01 02 03 03 01 01 02
02 * S0 03* S1
S1 S2 S3
T0
T1
T2
T3
T-Box Tables
02. S[a]
T0[a] =
03. S[a]
T1[a] =
S[a]
T2[a] =
S[a]
T3[a] =
0E. S[a]
T0-1[a] =
0B. S[a]
T1-1[a] =
0D.S[a]
T2-1[a] =
09.S[a]
T3-1[a] =
10
= T0 [a0,j] Rotbyte( T0 [a1,j+c1]) Rotbyte( T0 [a2, j+c2] Rotbyte( T0 [a3, j+c3]) Kj j- indicates key word
11
T-box Architecture
Plaintext
128 K[0] 128
Ciphertext
128 128 K[Nr]
..
..
T Tables
32 32 32
T-1 Tables
32 32 32 32
..
..
32
Ki
128
Inv Ki
128
128
Derived Subbytes
Derived InvSubbytes
Shift Rows
128
InvShift Rows
KNr
128 128
K0
128
Plaintext
b) Decryption
12
InvShiftRows
InvSubbytes
Inv Subbytes
Inv Shiftrows
Add RoundKey
InvMixcolumn
InvMixcolumns
13
Shift Rows
Round key
Data Output
14
This architecture can only encrypt one block of data at a time and number of clock cycles necessary to encrypt/decrypt is equal to the total number of cipher rounds. Critical path is located in the decryption circuit and includes Invshift rows-addroundkey-Inv Mixcolumns- 3to-1 multiplexer - Inv subbytes. This architecture takes 11,13 and 15 clock cycles to process data for key sizes 128,192 and 256
15
Enc Unit
Round Key
Dec Unit
Inv Round Key
Enc round
Dec round
Inv subbytes
Inv shiftrows
Round Key
Data output
16
Key Scheduling
Key scheduling unit supports all three key sizes i.e 128, 192 and 256. It requires a key setup phase, during which round keys are computed and stored in internal memory. This unit produces 64 bit key per clock cycle, independent of the size of the main key.
17
32 32
Output 64 bits
Ki Rot
Sub
Ki+1
Rcon 0
32
Ki-2
32
Ki-4
Ki-3
Register
Ki-6 Ki-5
Ki-8
Ki-7
Interface
19
Interface - Virtex
CLK RESET
DATA_IN
128
DATA_OUT
128
DATA_IN_WRITE
DATA_IN_READY
FULL
KEY_IN
128
WRITE
KEY_IN_WRITE
20
Interface - Spartan
21
Test Vectors
Test vectors provided by NIST in the fips 197 publication Contains intermediate state values Test vectors for encryption and decryption are available for different key sizes Separate decryption test vectors available for decryption schemes using normal key and inverse keys
22
Aldec Active HDL 7.2 used for functional simulation Xilinx ISE Design Suite 10.1 used for synthesis and implementation
23
Results
24
Throughput (Gbps)
S-box Key Size 128 192 256 Virtex 1.53 1.35 1.01 Spartan 0.426 0.403 0.355 Virtex 1.18 1.02 0.907 T-box Spartan 0.376 0.338 0.319
25
Throughput
Comparison: Throughput
1.8 1.6
Throughput (Gbps)
1.4 1.2 1 0.8 0.6 0.4 0.2 0 128_Virtex 192_Virtex 256_Virtex 128_Spartan 192_Spartan 256_Spartan Implementation S-box T-box
26
27
Area
Comparison: Area
14000
Area (CLB slices)
12000 10000 8000 6000 4000 2000 0 128_Virtex 192_Virtex 256_Virtex 128_Spartan 192_Spartan 256_Spartan Implementation S-box T-box
28
Throughput/Area
S-box Key Size 128 192 256 Virtex 2415.910 2104.060 1618.846 Spartan 376.96 354.65 317.82 Virtex 693.113 602.721 538.038 T-box Spartan 32.15 28.90 27.27
29
Throughput/Area
Comparison: Throughput/Area
3000 2500 2000
Ratio
1500 1000 500 0 128_Virtex 192_Virtex 256_Virtex 128_Spartan 192_Spartan 256_Spartan Implementations
S-box T-box
30
Problems encountered
Unable to map the T tables to the BRAMs. By default, the tool implemented the tables as logic instead of BRAMs Possibility of the T-box architectures having higher latency due to on the fly calculation of inverse round keys
31
Conclusion
Our S-box implementations perform better than the T-box implentations Area of T-box implementations nearly four times more than that of the S-box implementations.
32
Conclusion (2)
Comparatively the throughputs of S-box implementations are 11%, 29% and 31% higher than that of the corresponding T-box implementations with key size 128 bits, 192 bits and 256 bits The throughput/areaCLB of the S-box implementation is at least 10x and more than corresponding T-box implementations
33
Implement the T-box architecture implementations such that BRAMs are used to store the T table values Partial or complete loop unrolling can be implemented for the S-box architectures to further increase the throughput For the T-box implementations, the inverse round keys can be precomputed and stored in the memory, which may reduce the min clock period.
34
Questions?
35