Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

AES T-Box Slides

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Comparison of Hardware Implementations of S-box and T-box architectures of AES

ECE 746 : Secure Telecommunication systems Instructor: Dr. Kris Gaj

Bhupathi Kakarlapudi and Nitin Alabur

Agenda

Introduction Motivation Overview of architectures Implementations Key Scheduling Test vectors and tools used Results Conclusion
2

Introduction to AES

In 1997, NIST initiated a contest known as AES to develop a Federal Information Processing Standard. Standard Should be capable of protecting sensitive government information well into the next centuary. After 5 years of extensive analysis, Rijndael was chosen as the winner of the contest, and become a official standard in Nov. 2001 AES is expected to be used by U.S. Government and, on voluntary basis by a private sector.
3

Motivation

AES T-Box implementations for decryption and combined encryption decryption units in software showed better throughput, compared to S-box implementations in software. This performance improvement was shown in hardware on Altera Flex devices by Viktor Fischer and Milos Drutarovsky. Our idea is to show the same performance improvement of T-box architecture in hardware on Xilinx FPGA families Virtex 5 & Spartan 3E.

S-box vs T-box

S-box architecture uses 8 x 8 look-up tables and the remaining round operations for encryption/ decryption operations T-box Architecture uses 8 x 32 look-up tables and the remaining XOR operations for encryption/decryption operations. T-box architecture uses 4 times more memory than S-box. (S-box :16 times 8 x8 ::: T-box: 16 times 8 x 32)

S-box Architecture Overview

This architecture structure is same as general proposed architecture of AES. Encryption starts with add round key, and performs Round Operations: subbytes (uses 8 x 8 Look-up tables), shift rows, Mix Column and add roundkey. Last round doesnt include Mix column operation.
6

S-box Enc/Dec
Plaintext
K0 i=Nr

Ciphertext
KNr

Subbytes Shift Rows


i<Nr

InvMixColumn InvShift Rows


i>=0

MixColumn
i=Nr Ki

InvSubbytes
Ki

Ciphertext

Plaintext

a) Encryption
Nr : Total Number of Rounds

b) Decryption

T-box architecture overview

This architecture allows the computation of the entire round only using look-up tables and XOR operations. Pre-computed look-up tables represent the combined operation of subbytes and mixcolumn transformations. T-box tables are of size 8 x 32 bits. Memory of T-box Table One T-box Table: 256 x 32(4B) = 1KB Four T-box tables = 4KB ( Fast Implementations)
8

Description of T-box Tables


Mix Column Operation In AES

State (128 bit)


S0 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11
S0

S12 S13 S14 S15

First rows elements, s0, s4, s8, s12 Second rows elements, s1, s5, s9, s13

02 03 01 01 01 02 03 01 01 01 02 03 03 01 01 02

02 * S0 03* S1

01* S2 03* S2 02* S2 01* S2

01* S3 01* S3 03* S3 02* S3

S1 S2 S3

01 * S0 02* S1 01 * S0 01* S1 03 * S0 01* S1

T0

T1

T2

T3

T-Box Tables
02. S[a]
T0[a] =

03. S[a]
T1[a] =

S[a]
T2[a] =

S[a]
T3[a] =

S[a] S[a] 03.S[a]

02.S[a] S[a] S[a]

03.S[a] 02.S[a] S[a]

S[a] 03.S[a] 02.S[a]

0E. S[a]
T0-1[a] =

0B. S[a]
T1-1[a] =

0D.S[a]
T2-1[a] =

09.S[a]
T3-1[a] =

09.S[a] 0D.S[a] 0B.S[a]

0E.S[a] 09.S[a] 0D.S[a]

0B.S[a] 0E.S[a] 09.S[a]

0D.S[a] 0B.S[a] 0E.S[a]

10

Round Operation Computation


e0, j e1, j e2, j e3, j = T0 [a0,j] T1 [a1,j+c1] T2 [a2, j+c2] T3 [a3, j+c3]
Mod 4

K0, j K1, j K2, j k3, j

e0, j e1, j e2, j e3, j

= T0 [a0,j] Rotbyte( T0 [a1,j+c1]) Rotbyte( T0 [a2, j+c2] Rotbyte( T0 [a3, j+c3]) Kj j- indicates key word

11

T-box Architecture
Plaintext
128 K[0] 128

Ciphertext
128 128 K[Nr]

..

..

T Tables
32 32 32

T-1 Tables
32 32 32 32

..

..

32

Enc XOR Network

Ki
128

Dec XOR Network

Inv Ki

128

128

Derived Subbytes

Derived InvSubbytes

Shift Rows
128

InvShift Rows
KNr
128 128

K0

128

Cipher text a) Encryption

Plaintext

b) Decryption

12

Modified Decryption in T-box


KNr KNr

InvShiftRows

InvSubbytes

Inv Subbytes

Inv Shiftrows

Add RoundKey

InvMixcolumn

InvMixcolumns

Inv Add RoundKey

a) Standard decryption round

b) Modified decryption round

13

S-box Basic Iterative Architecture


Data input Round key Encryption Circuit Decryption Circuit

SubBytes & Inv Subbytes R

Shift Rows MixColumns

Shift Rows

Round key

InvMixColumns Round key

Data Output

Ref: Dr Gaj and Chodowiec Publication

14

S-box Basic Iterative Architecture(1)

This architecture can only encrypt one block of data at a time and number of clock cycles necessary to encrypt/decrypt is equal to the total number of cipher rounds. Critical path is located in the decryption circuit and includes Invshift rows-addroundkey-Inv Mixcolumns- 3to-1 multiplexer - Inv subbytes. This architecture takes 11,13 and 15 clock cycles to process data for key sizes 128,192 and 256

15

T-box Iterative architecture


Data input Round Key

Enc Unit
Round Key

Dec Unit
Inv Round Key

Enc round

Dec round

Subbytes Shift rows


Round Key

Inv subbytes

Inv shiftrows
Round Key

Data output

Ref: Dr Gaj and Chodowiec Publication

16

Key Scheduling

Key scheduling unit supports all three key sizes i.e 128, 192 and 256. It requires a key setup phase, during which round keys are computed and stored in internal memory. This unit produces 64 bit key per clock cycle, independent of the size of the main key.

17

Key: Block Diagram


Input 64 bits
32 32

32 32

Output 64 bits

Ki Rot
Sub

Ki+1
Rcon 0

32

Ki-2

32

Ki-1 Ki-Nk Ki+1-Nk

Ki-4

Ki-3

Register
Ki-6 Ki-5

Ki-8

Ki-7

Ref: Dr Gaj and Chodowiec Publication


18

Interface

19

Interface - Virtex
CLK RESET

DATA_IN
128

DATA_OUT
128

DATA_IN_WRITE

DATA_IN_READY

AES ENC /DEC UNIT

FULL

KEY_IN
128

WRITE

KEY_IN_WRITE

KEY _IN_READY ENC/DEC

20

Interface - Spartan

21

Test Vectors

Test vectors provided by NIST in the fips 197 publication Contains intermediate state values Test vectors for encryption and decryption are available for different key sizes Separate decryption test vectors available for decryption schemes using normal key and inverse keys

22

Design tools used


Aldec Active HDL 7.2 used for functional simulation Xilinx ISE Design Suite 10.1 used for synthesis and implementation

23

Results

24

Throughput (Gbps)
S-box Key Size 128 192 256 Virtex 1.53 1.35 1.01 Spartan 0.426 0.403 0.355 Virtex 1.18 1.02 0.907 T-box Spartan 0.376 0.338 0.319

25

Throughput
Comparison: Throughput
1.8 1.6
Throughput (Gbps)

1.4 1.2 1 0.8 0.6 0.4 0.2 0 128_Virtex 192_Virtex 256_Virtex 128_Spartan 192_Spartan 256_Spartan Implementation S-box T-box

26

Area (CLB slices)


S-box Key Size 128 192 256 Virtex 633 641 622 Spartan 3019 2913 3019 Virtex 1696 1693 1686 T-box Spartan 11,687 11,687 11,687

27

Area
Comparison: Area
14000
Area (CLB slices)

12000 10000 8000 6000 4000 2000 0 128_Virtex 192_Virtex 256_Virtex 128_Spartan 192_Spartan 256_Spartan Implementation S-box T-box

28

Throughput/Area
S-box Key Size 128 192 256 Virtex 2415.910 2104.060 1618.846 Spartan 376.96 354.65 317.82 Virtex 693.113 602.721 538.038 T-box Spartan 32.15 28.90 27.27

29

Throughput/Area
Comparison: Throughput/Area
3000 2500 2000
Ratio

1500 1000 500 0 128_Virtex 192_Virtex 256_Virtex 128_Spartan 192_Spartan 256_Spartan Implementations

S-box T-box

30

Problems encountered

Unable to map the T tables to the BRAMs. By default, the tool implemented the tables as logic instead of BRAMs Possibility of the T-box architectures having higher latency due to on the fly calculation of inverse round keys

31

Conclusion

Our S-box implementations perform better than the T-box implentations Area of T-box implementations nearly four times more than that of the S-box implementations.

32

Conclusion (2)

Comparatively the throughputs of S-box implementations are 11%, 29% and 31% higher than that of the corresponding T-box implementations with key size 128 bits, 192 bits and 256 bits The throughput/areaCLB of the S-box implementation is at least 10x and more than corresponding T-box implementations

33

Scope for future work

Implement the T-box architecture implementations such that BRAMs are used to store the T table values Partial or complete loop unrolling can be implemented for the S-box architectures to further increase the throughput For the T-box implementations, the inverse round keys can be precomputed and stored in the memory, which may reduce the min clock period.

34

Questions?

35

You might also like