Data Representation
Data Representation
• This chapter describes the various ways in which computers can store and manipulate
numbers and characters.
• Bit: The most basic unit of information in a digital computer is called a bit, which is a
contraction of binary digit.
• Byte: In 1964, the designers of the IBM System/360 main frame computer established
a convention of using groups of 8 bits as the basic unit of addressable computer
storage. They called this collection of 8 bits a byte.
• Word: Computer words consist of two or more adjacent bytes that are sometimes
addressed and almost always are manipulated collectively. The word size represents
the data size that is handled most efficiently by a particular architecture. Words can
be 16 bits, 32 bits, 64 bits.
• Nibbles: Eight-bit bytes can be divided into two 4-bit halves call nibbles.
• Radix (or Base): The general idea behind positional numbering systems is that a
numeric value is represented through increasing powers of a radix (or base).
10410 = 102123
3|104 2
3| 34 1
3| 11 2
3| 3 0
3|1 1
0
14710 = 100100112
2|147 1
2| 73 1
2|36 0
2|18 0
2|9 1
2|4 0
2|2 0
2|1 1
0
0.430410 = 0.20345
• EXAMPLE 2.7 Convert 0.3437510 to binary with 4 bits to the right of the binary point.
Reading from top to bottom, 0.3437510 = 0.01012 to four binary places. We simply
discard (or truncate) our answer when the desired accuracy has been achieved.
0.3437510 = 0.01012
0.34375
X 2
0.68750
X 2
1.37500
X 2
0.75000
X 2
1.50000
110 010 011 1012 = 62358 Separate into groups of 3 for octal conversion
1100 1001 11012 = C9D16 Separate into groups of 4 for octal conversion
• A signed-magnitude number has a sign as its left-most bit (also referred to as the
high-order bit or the most significant bit) while the remaining bits represent the
magnitude (or absolute value) of the numeric value.
• N bits can represent –(2n-1 - 1) to 2n-1 -1
• EXAMPLE 2.10 Add 010011112 to 001000112 using signed-magnitude arithmetic.
• EXAMPLE 2.14
• EXAMPLE 2.15
• The signed magnitude has two representations for zero, 10000000 and 00000000
(and mathematically speaking, the simple shouldn’t happen!).
• One’s Complement
o This sort of bit-flipping is very simple to implement in computer hardware.
o EXAMPLE 2.16 Express 2310 and -910 in 8-bit binary one’s complement form.
o EXAMPLE 2.17
o EXAMPLE 2.18
• The primary disadvantage of one’s complement is that we still have two
representations for zero: 00000000 and 11111111
o EXAMPLE 2.21 Find the sum of 2310 and -910 in binary using two’s complement
arithmetic.
o A Simple Rule for Detecting an Overflow Condition: If the carry in the sign bit
equals the carry out of the bit, no overflow has occurred. If the carry into the sign
bit is different from the carry out of the sign bit, over (and thus an error) has
occurred.
o EXAMPLE 2.22 Find the sum of 12610 and 810 in binary using two’s complement
arithmetic.
A one is carried into the leftmost bit, but a zero is carried out. Because these
carries are not equal, an overflow has occurred.
o When the divisor is much smaller than the dividend, we get a condition known as
divide underflow, which the computer sees as the equivalent of division by zero.
o Computer makes a distinction between integer division and floating-point division.
With integer division, the answer comes in two parts: a quotient and a
remainder.
Floating-point division results in a number that is expressed as a binary
fraction.
Floating-point calculations are carried out in dedicated circuits call floating-
point units, or FPU.
• If the 4-bit binary value 1101 is unsigned, then it represents the decimal value 13, but
as a signed two’s complement number, it represents -3.
• C programming language has int and unsigned int as possible types for integer
variables.
• If we are using 4-bit unsigned binary numbers and we add 1 to 1111, we get 0000
(“return to zero”).
• If we add 1 to the largest positive 4-bit two’s complement number 0111 (+7), we get
1000 (-8).
• Consider the following standard pencil and paper method for multiplying two’s
complement numbers (-5 X -4):
1011 (-5)
x 1100 (-4)
+ 0000 (0 in multiplier means simple shift)
+ 0000 (0 in multiplier means simple shift)
+ 1011 (1 in multiplier means add multiplicand and shift)
+ 1011____ (1 in multiplier means add multiplicand and shift)
10000100 (-4 X -5 = -124)
• Research into finding better arithmetic algorithms has continued apace for over 50
years. One of the many interesting products of this work is Booth’s algorithm.
• In most cases, Booth’s algorithm carries out multiplication faster and more accurately
than naïve pencil-and-paper methods.
• The general idea of Booth’s algorithm is to increase the speed of a multiplication
when there are consecutive zeros or ones in the multiplier.
• Consider the following standard multiplication example (3 X 6):
0011 (3)
x 0110 (6)
+ 0000 (0 in multiplier means simple shift)
+ 0011 (1 in multiplier means add multiplicand and shift)
+ 0011 (1 in multiplier means add multiplicand and shift)
+ 0000____ (0 in multiplier means simple shift)
0010010 (3 X 6 = 18)
Note that: 010 Ignore extended sign bit that go beyond 2n.
• Booth’s algorithm not only allows multiplication to be performed faster in most cases,
but it also has the added bonus in that it works correctly on signed numbers.
• For unsigned numbers, a carry (out of the leftmost bit) indicates the total number of
bits was not large enough to hold the resulting value, and overflow has occurred.
• For signed numbers, if the carry in to the sign bit and the carry (out of the sign bit)
differ, then overflow has occurred.
• EXAMPLE 2.28: Divide the value 12 (expressed using 8-bit signed two’s
complement representation) by 2.
• In scientific notion, numbers are expressed in two parts: a fractional part call a
mantissa, and an exponential part that indicates the power of ten to which the
mantissa should be raised to obtain the value we need.
• Unbiased Exponent
0 00101 10001000 1710 = 0.100012 * 25
• EXAMPLE 2.31
• A normalized form is used for storing a floating-point number in memory. A
normalized form is a floating-point representation where the leftmost bit of the
significand will always be a 1.
Example: Internal representation of (10.25)10
0 10010 11001000
+
0 10000 10011010
11.001000
+ 0.10011010
11.10111010
Renormalizing we retain the larger exponent and truncate the low-order bit.
0 10010 11101110
0 10001 11110000
• The IEEE-754 single precision floating point standard uses bias of 127 over its 8-bit
exponent. An exponent of 255 indicates a special value.
• The double precision standard has a bias of 1023 over its 11-bit exponent. The
“special” exponent value for a double precision number is 2047, instead of the 255
used by the single precision standard.
• The range of a numeric integer format is the difference between the largest and
smallest values that is can express.
• The precision of a number indicates how much information we have about a value
• Accuracy refers to how closely a numeric representation approximates a true value.
• Because of truncated bits, you cannot always assume that a particular floating point
operation is commutative or distributive.
2.6.2 EBCDIC 87
2.6.4 Unicode 88
• Both EBCDIC and ASCII were built around the Latin alphabet.
• In 1991, a new international information exchange code called Unicode.
• Unicode is a 16-bit alphabet that is downward compatible with ASCII and Latin-1
character set.
• Because the base coding of Unicode is 16 bits, it has the capacity to encode the
majority of characters used in every language of the world.
• Unicode is currently the default character set of the Java programming language.
• The Unicode codespace is divided into six parts. The first part is for Western
alphabet codes, including English, Greek, and Russian.
• The lowest-numbered Unicode characters comprise the ASCII code.
• The highest provide for user-defined codes.
0+0=0
0+1=1
1+0=1
1+1=0
• EXAMPLE 2.36 Find the quotient and remainder when 10010112 is divided by 10112.
• We are focused on single bit error. An error could occur in any of the n bits, so each
code word can be associated with n erroneous words at a Hamming distance of 1.
• Therefore, we have n + 1 bit patterns for each code word: one valid code word, and n
erroneous words. With n-bit code words, we have 2n possible code words consisting
of 2m data bits (where m = n + r).
This gives us the inequality:
(n + 1) * 2m < = 2n
(m + r + 1) * 2m <= 2 m + r or
(m + r + 1) <= 2r
Let’s introduce an error in bit position b9, resulting in the code word:
0 1 0 1 1 1 0 1 0 1 1 0
12 11 10 9 8 7 6 5 4 3 2 1
We found that parity bits 1 and 8 produced an error, and 1 + 8 = 9, which in exactly
where the error occurred.
• If we expect errors to occur in blocks, it stands to reason that we should use an error-
correcting code that operates at a block level, as opposed to a Hamming code, which
operates at the bit level.
• A Reed-Soloman (RS) code can be thought of as a CRC that operates over entire
characters instead of only a few bits.
• RS codes, like CRCs, are systematic: The parity bytes are append to a block of
information bytes.
• RS (n, k) code are defined using the following parameters:
o s = The number of bits in a character (or “symbol”).
o k = The number of s-bit characters comprising the data block.
o n = The number of bits in the code word.
• RS (n, k) can correct (n-k)/2 errors in the k information bytes.
• Reed-Soloman error-correction algorithms lend themselves well to implementation in
computer hardware.
• They are implemented in high-performance disk drives for mainframe computers as
well as compact disks used for music and data storage. These implementations will be
described in Chapter 7.
• Computers store data in the form of bits, bytes, and words using the binary
numbering system.
• Hexadecimal numbers are formed using four-bit groups called nibbles (or nybbles).
• Signed integers can be stored in one’s complement, two’s complement, or signed
magnitude representation.
• Floating-point numbers are usually coded using the IEEE 754 floating-point
standard.
• Character data is stored using ASCII, EBCDIC, or Unicode.
• Error detecting and correcting codes are necessary because we can expect no
transmission or storage medium to be perfect.
• CRC, Reed-Soloman, and Hamming codes are three important error control codes.