CS252 Graduate Computer Architecture Error Correction Codes April 20, 2011
CS252 Graduate Computer Architecture Error Correction Codes April 20, 2011
CS252 Graduate Computer Architecture Error Correction Codes April 20, 2011
=
|
|
.
|
\
|
=
) 1 (
2
1
0
d
e
e
n
Size
n
d
e
k
e
n
2 2
) 1 (
2
1
0
s
|
|
.
|
\
|
=
3 , 2 ) 1 ( 2 = s + d n
n k
4/20/2011 cs252-S11, Lecture 23 5
How to Generate code words?
Consider a linear code. Need a Generator Matrix.
Let v
i
be the data value (k bits), C
i
be resulting code (n bits):
Are there 2
k
unique code values?
Only if the k columns of G are linearly independent!
Of course, need some way of decoding as well.
Is this linear??? Why or why not?
A code is systematic if the data is directly encoded
within the code words.
Means Generator has form:
Can always turn non-systematic
code into a systematic one (row ops)
But What is distance of code? Not Obvious!
( )
'
i d i
C f v =
i i
v C = G
|
|
.
|
\
|
=
P
I
G
G must be an nk matrix
4/20/2011 cs252-S11, Lecture 23 6
Implicitly Defining Codes by Check Matrix
Consider a parity-check matrix H (n[n-k])
Define valid code words Ci as those that give S
i
=0 (null space of H)
Size of null space?
(null-rank H)=k if (n-k) linearly independent columns in H
Suppose we transmit code word C with error:
Model this as vector E which flips selected bits of C to get R (received):
Consider what happens when we multiply by H:
What is distance of code?
Code has distance d if no sum of d-1 or less columns yields 0
I.e. No error vectors, E, of weight < d have zero syndromes
So Code design is designing H matrix
0 = =
i i
C S H
E C R =
E E C R S = = = H H H ) (
4/20/2011 cs252-S11, Lecture 23 7
How to relate G and H (Binary Codes)
Defining H makes it easy to understand distance of code,
but hard to generate code (H defines code implicitly!)
However, let H be of following form:
Then, G can be of following form (maximal code size):
Notice: G generates values in null-space of H and has k
independent columns so generates 2
k
unique values:
( ) I P H | =
P is (n-k)k, I is (n-k)(n-k)
Result: H is (n-k)n
|
|
.
|
\
|
=
P
I
G
P is (n-k)k, I is kk
Result: G is nk
( ) ( ) 0 |
|
|
.
|
\
|
|
|
.
|
\
|
= =
i i i
v v S
P
I
I P G H
4/20/2011 cs252-S11, Lecture 23 8
Simple example (Parity, d=2)
Parity code (8-bits):
Note: Complexity of logic depends on number of 1s in row!
( ) 111111111 = H
|
|
|
|
|
|
|
|
|
|
|
|
|
.
|
\
|
=
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
G
v
7
v
6
v
5
v
4
v
3
v
2
v
1
v
0
+
c
8
+
s
0
C
8
C
7
C
6
C
5
C
4
C
3
C
2
C
1
C
0
4/20/2011 cs252-S11, Lecture 23 9
Simple example: Repetition (voting, d=3)
Repetition code (1-bit):
Positives: simple
Negatives:
Expensive: only 33% of code word is data
Not packed in Hamming-bound sense (only D=3). Could get much more
efficient coding by encoding multiple bits at a time
|
|
.
|
\
|
=
1 0 1
0 1 1
H
|
|
|
.
|
\
|
=
1
1
1
G
C
0
C
1
C
2
Error
v
0
C
0
C
1
C
2
4/20/2011 cs252-S11, Lecture 23 10
Binary Hamming code meets
Hamming bound
Recall bound for d=3:
So, rearranging:
Thus, for:
c=2 check bits, k 1 (Repetition code)
c=3 check bits, k 4
c=4 check bits, k 11, use k=8?
c=5 check bits, k 26, use k=16?
c=6 check bits, k 57, use k=32?
c=7 check bits, k 120, use k=64?
H matrix consists of all
unique, non-zero vectors
There are 2
c
-1 vectors, c used for parity,
so remaining 2
c
-c-1
Example: Hamming Code (d=3)
|
|
|
.
|
\
|
=
1 0 0 0 1 1 1
0 1 0 1 0 1 1
0 0 1 1 1 0 1
H
|
|
|
|
|
|
|
|
|
.
|
\
|
=
0 1 1 1
1 0 1 1
1 1 0 1
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
G
1 2 2 ) 1 ( 2 s s +
k n n k
n n
k n c c k
c
= + s ), 1 ( 2
4/20/2011 cs252-S11, Lecture 23 11
Example, d=4 code (SEC-DED)
Design H with:
All columns non-zero, odd-weight, distinct
Note that odd-weight refers to Hamming Weight, i.e. number of zeros
Why does this generate d=4?
Any single bit error will generate a distinct, non-zero value
Any double error will generate a distinct, non-zero value
Why? Add together two distinct columns, get distinct result
Any triple error will generate a non-zero value
Why? Add together three odd-weight values, get an odd-weight value
So: need four errors before indistinguishable from code word
Because d=4:
Can correct 1 error (Single Error Correction, i.e. SEC)
Can detect 2 errors (Double Error Detection, i.e. DED)
Example:
Note: log size of nullspace will
be (columns rank) = 4, so:
Rank = 4, since rows
independent, 4 cols indpt
Clearly, 8 bits in code word
Thus: (8,4) code
|
|
|
|
|
|
|
|
|
|
|
.
|
\
|
|
|
|
|
|
.
|
\
|
=
|
|
|
|
|
.
|
\
|
7
6
5
4
3
2
1
0
3
2
1
0
1 0 0 0 1 1 1 0
0 1 0 0 1 1 0 1
0 0 1 0 1 0 1 1
0 0 0 1 0 1 1 1
C
C
C
C
C
C
C
C
S
S
S
S
4/20/2011 cs252-S11, Lecture 23 12
Tweeks:
No reason cannot make code shorter than required
Suppose n-k=8 bits of parity. What is max code size (n)
for d=4?
Maximum number of unique, odd-weight columns: 2
7
= 128
So, n = 128. But, then k = n (n k) = 120. Weird!
Just throw out columns of high weight and make (72, 64) code!
Circuit optimization: if throwing out column vectors, pick
ones of highest weight (# bits=1) to simplify circuit
But shortened codes like this might have d > 4 in some
special directions
Example: Kaneda paper, catches failures of groups of 4 bits
Good for catching chip failures when DRAM has groups of 4 bits
What about EVENODD code?
Can be used to handle two erasures
What about two dead DRAMs? Yes, if you can really know they are dead
4/20/2011 cs252-S11, Lecture 23 13
How to correct errors?
Consider a parity-check matrix H (n[n-k])
Compute the following syndrome S
i
given code element C
i
:
Suppose that two correctable error vectors E
1
and E
2
produce same
syndrome:
But, since both E
1
and E
2
have s (d-1)/2 bits set, E
1
+ E
2
s d-1 bits set
so this conclusion cannot be true!
So, syndrome is unique indicator of correctable error vectors
E C S
i i
= = H H
( )
set bits more or d has
0
2 1
2 1 2 1
E E
E E E E
+
= + = H H H
4/20/2011 cs252-S11, Lecture 23 14
4/20/2011 cs252-S11, Lecture 23 15
Galois Field
Definition: Field: a complete group of elements with:
Addition, subtraction, multiplication, division
Completely closed under these operations
Every element has an additive inverse
Every element except zero has a multiplicative inverse
Examples:
Real numbers
Binary, called GF(2) : Galois Field with base 2
Values 0, 1. Addition/subtraction: use xor. Multiplicative inverse of 1 is 1
Prime field, GF(p) : Galois Field with base p
Values 0 p-1
Addition/subtraction/multiplication: modulo p
Multiplicative Inverse: every value except 0 has inverse
Example: GF(5): 11 1 mod 5, 23 1mod 5, 44 1 mod 5
General Galois Field: GF(p
m
) : base p (prime!), dimension m
Values are vectors of elements of GF(p) of dimension m
Add/subtract: vector addition/subtraction
Multiply/divide: more complex
Just like real numbers but finite!
Common for computer algorithms: GF(2
m
)
4/20/2011 cs252-S11, Lecture 23 16
Specific Example: Galois Fields GF(2
n
)
Consider polynomials whose coefficients come from GF(2).
Each term of the form x
n
is either present or absent.
Examples: 0, 1, x, x
2
, and x
7
+ x
6
+ 1
= 1x
7
+ 1 x
6
+ 0 x
5
+ 0 x
4
+ 0 x
3
+ 0 x
2
+ 0 x
1
+ 1 x
0
With addition and multiplication these form a ring (not quite a
field still missing division):
Add: XOR each element individually with no carry:
x
4
+ x
3
+ + x + 1
+ x
4
+ + x
2
+ x
x
3
+ x
2
+ 1
Multiply: multiplying by x is like shifting to the left.
x
2
+ x + 1
x + 1
x
2
+ x + 1
x
3
+ x
2
+ x
x
3
+ 1
4/20/2011 cs252-S11, Lecture 23 17
So what about division (mod)
x
4
+ x
2
x
= x
3
+ x with remainder 0
x
4
+ x
2
+ 1
X + 1
= x
3
+ x
2
with remainder 1
x
4
+ 0x
3
+ x
2
+ 0x + 1
X + 1
x
3
x
4
+ x
3
x
3
+ x
2
+ x
2
x
3
+ x
2
0x
2
+ 0x
+ 0x
0x + 1
+ 0
Remainder 1
4/20/2011 cs252-S11, Lecture 23 18
Producing Galois Fields
These polynomials form a Galois (finite) field if we
take the results of this multiplication modulo a prime
polynomial p(x)
A prime polynomial cannot be written as product of two non-trivial
polynomials q(x)r(x)
For any degree, there exists at least one prime polynomial.
With it we can form GF(2
n
)
Every Galois field has a primitive element, o, such
that all non-zero elements of the field can be
expressed as a power of o
Certain choices of p(x) make the simple polynomial x the primitive
element. These polynomials are called primitive
For example, x
4
+ x + 1 is primitive. So o = x is a
primitive element and successive powers of o will
generate all non-zero elements of GF(16).
Example on next slide.
4/20/2011 cs252-S11, Lecture 23 19
Galois Fields with primitive x
4
+ x + 1
o
0
= 1
o
1
= x
o
2
= x
2
o
3
= x
3
o
4
= x
+ 1
o
5
= x
2
+ x
o
6
= x
3
+ x
2
o
7
= x
3
+ x
+ 1
o
8
= x
2
+ 1
o
9
= x
3
+ x
o
10
= x
2
+ x
+ 1
o
11
= x
3
+ x
2
+ x
o
12
= x
3
+ x
2
+ x
+ 1
o
13
= x
3
+ x
2
+ 1
o
14
= x
3
+ 1
o
15
= 1
Primitive element = x in GF(2
n
)
In general finding primitive
polynomials is difficult. Most
people just look them up in a
table, such as:
4
= x
4
mod x
4
+ x + 1
= x
4
xor x
4
+ x + 1
= x + 1
4/20/2011 cs252-S11, Lecture 23 20
Primitive Polynomials
x
2
+ x +1
x
3
+ x +1
x
4
+ x +1
x
5
+ x
2
+1
x
6
+ x +1
x
7
+ x
3
+1
x
8
+ x
4
+ x
3
+ x
2
+1
x
9
+ x
4
+1
x
10
+ x
3
+1
x
11
+ x
2
+1
x
12
+ x
6
+ x
4
+ x +1
x
13
+ x
4
+ x
3
+ x +1
x
14
+ x
10
+ x
6
+ x +1
x
15
+ x +1
x
16
+ x
12
+ x
3
+ x +1
x
17
+ x
3
+ 1
x
18
+ x
7
+ 1
x
19
+ x
5
+ x
2
+ x+ 1
x
20
+ x
3
+ 1
x
21
+ x
2
+ 1
x
22
+ x +1
x
23
+ x
5
+1
x
24
+ x
7
+ x
2
+ x +1
x
25
+ x
3
+1
x
26
+ x
6
+ x
2
+ x +1
x
27
+ x
5
+ x
2
+ x +1
x
28
+ x
3
+ 1
x
29
+ x +1
x
30
+ x
6
+ x
4
+ x +1
x
31
+ x
3
+ 1
x
32
+ x
7
+ x
6
+ x
2
+1
Galois Field Hardware
Multiplication by x shift left
Taking the result mod p(x) XOR-ing with the coefficients of p(x)
when the most significant coefficient is 1.
Obtaining all 2
n
-1 non-zero
elements by evaluating x
k
Shifting and XOR-ing 2
n
-1 times.
for k = 1, , 2
n
-1
4/20/2011 cs252-S11, Lecture 23 21
Reed-Solomon Codes
Galois field codes: code words consist of symbols
Rather than bits
Reed-Solomon codes:
Based on polynomials in GF(2
k
) (I.e. k-bit symbols)
Data as coefficients, code space as values of polynomial:
P(x)=a
0
+a
1
x
1
+ a
k-1
x
k-1
Coded: P(0),P(1),P(2).,P(n-1)
Can recover polynomial as long as get any k of n
Properties: can choose number of check symbols
Reed-Solomon codes are maximum distance separable (MDS)
Can add d symbols for distance d+1 code
Often used in erasure code mode: as long as no more than n-k
coded symbols erased, can recover data
Side note: Multiplication by constant in GF(2
k
) can be represented
by kk matrix: ax
Decompose unknown vector into k bits: x=x
0
+2x
1
++2
k-1
x
k-1
Each column is result of multiplying a by 2
i
4/20/2011 cs252-S11, Lecture 23 22
Reed-Solomon Codes (cont)
|
|
|
|
|
|
.
|
\
|
|
|
|
|
|
|
|
|
|
|
.
|
\
|
=
4
3
2
1
0
4 3 2 1 0
4 3 2 1 0
4 3 2 1 0
4 3 2 1 0
4 3 2 1 0
4 3 2 1 0
4 3 2 1 0
7 7 7 7 7
6 6 6 6 6
5 5 5 5 5
4 4 4 4 4
3 3 3 3 3
2 2 2 2 2
1 1 1 1 1
a
a
a
a
a
G
|
|
.
|
\
|
=
1 1 1 1 1 1 1
0 0 0 0 0 0 0
'
7 6 5 4 3 2 1
7 6 5 4 3 2 1
H
Reed-solomon codes
(Non-systematic):
Data as coefficients, code space as
values of polynomial:
P(x)=a
0
+a
1
x
1
+ a
6
x
6
Coded: P(0),P(1),P(2).,P(6)
Called Vandermonde Matrix:
maximum rank
Different representation
(This H and G not related)
Clear that all combinations of
two or less columns
independent d=3
Very easy to pick whatever d you
happen to want: add more rows
Fast, Systematic version of
Reed-Solomon:
Cauchy Reed-Solomon, others
4/20/2011 cs252-S11, Lecture 23 23
Aside: Why erasure coding?
High Durability/overhead ratio!
Exploit law of large numbers for durability!
6 month repair, FBLPY:
Replication: 0.03
Fragmentation: 10
-35
Fraction Blocks Lost
Per Year (FBLPY)
4/20/2011 cs252-S11, Lecture 23 24
Statistical Advantage of Fragments
Latency and standard deviation reduced:
Memory-less latency model
Rate code with 32 total fragments
Time to Coalesce vs. Fragments Requested (TI5000)
0
20
40
60
80
100
120
140
160
180
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Objects Requested
L
a
t
e
n
c
y
4/20/2011 cs252-S11, Lecture 23 25
Conclusion
ECC: add redundancy to correct for errors
(n,k,d) n code bits, k data bits, distance d
Linear codes: code vectors computed by linear transformation
Erasure code: after identifying erasures, can correct
Reed-Solomon codes
Based on GF(p
n
), often GF(2
n
)
Easy to get distance d+1 code with d extra symbols
Often used in erasure mode
4/20/2011 cs252-S11, Lecture 23 26