A Tutorial On Data Representation
A Tutorial On Data Representation
10110B = 10000B + 0000B + 100B + 10B + 0B = 1×2^4 + 0×2^3 + 1×2^2 + 1×2^1 + 0×2^0
We shall denote a binary number with a suffix B. Some programming languages denote binary
numbers with prefix 0b or 0B (e.g., 0b1001000), or prefix b with the bits quoted
(e.g., b'10001111').
A binary digit is called a bit. Eight bits is called a byte (why 8-bit unit? Probably because 8=23).
1.3 Hexadecimal (Base 16) Number System
Hexadecimal number system uses 16 symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F,
called hex digits. It is a positional notation, for example,
We shall denote a hexadecimal number (in short, hex) with a suffix H. Some programming
languages denote hex numbers with prefix 0x or 0X (e.g., 0x1A3C5F), or prefix x with hex digits
quoted (e.g., x'C3A4D98B').
Each hexadecimal digit is also called a hex digit. Most programming languages accept
lowercase 'a' to 'f' as well as uppercase 'A' to 'F'.
Computers uses binary system in their internal operations, as they are built from binary digital
electronic components with 2 states - on and off. However, writing or reading a long sequence of
binary bits is cumbersome and error-prone (try to read this binary string: 1011 0011 0100 0011
0001 1101 0001 1000B, which is the same as hexadecimal B343 1D18H). Hexadecimal system is
used as a compact form or shorthand for binary bits. Each hex digit is equivalent to 4 binary bits,
i.e., shorthand for 4 bits, as follows:
Hexadecimal Binary Decimal
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
A 1010 10
B 1011 11
C 1100 12
D 1101 13
E 1110 14
F 1111 15
1.4 Conversion from Hexadecimal to Binary
Replace each hex digit by the 4 equivalent bits (as listed in the above table), for examples,
For examples,
The above procedure is actually applicable to conversion between any 2 base systems. For
example,
Egyptian hieroglyphs (next-to-left) were used by the ancient Egyptians since 4000BC.
Unfortunately, since 500AD, no one could longer read the ancient Egyptian hieroglyphs, until the
re-discovery of the Rosette Stone in 1799 by Napoleon's troop (during Napoleon's Egyptian
invasion) near the town of Rashid (Rosetta) in the Nile Delta.
The Rosetta Stone (left) is inscribed with a decree in 196BC on behalf of King Ptolemy V. The
decree appears in three scripts: the upper text is Ancient Egyptian hieroglyphs, the middle portion
Demotic script, and the lowest Ancient Greek. Because it presents essentially the same text in all
three scripts, and Ancient Greek could still be understood, it provided the key to the
decipherment of the Egyptian hieroglyphs.
The moral of the story is unless you know the encoding scheme, there is no way that you can
decode the data.
3. Integer Representation
Integers are whole numbers or fixed-point numbers with the radix point fixed after the least-
significant bit. They are contrast to real numbers or floating-point numbers, where the position of
the radix point varies. It is important to take note that integers and floating-point numbers are
treated differently in computers. They have different representation and are processed differently
(e.g., floating-point numbers are processed in a so-called floating-point processor). Floating-
point numbers will be discussed later.
Computers use a fixed number of bits to represent an integer. The commonly-used bit-lengths for
integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-lengths, there are two representation
schemes for integers:
1. Unsigned Integers: can represent zero and positive integers.
2. Signed Integers: can represent zero, positive and negative integers. Three representation
schemes had been proposed for signed integers:
a. Sign-Magnitude representation
b. 1's Complement representation
c. 2's Complement representation
You, as the programmer, need to decide on the bit-length and representation scheme for your
integers, depending on your application's requirements. Suppose that you need a counter for
counting a small quantity from 0 up to 200, you might choose the 8-bit unsigned integer scheme
as there is no negative numbers involved.
Because of the fixed precision (i.e., fixed number of bits), an n-bit 2's complement signed integer
has a certain range. For example, for n=8, the range of 2's complement signed integers is -
128 to +127. During addition (and subtraction), it is important to check whether the result
exceeds this range, in other words, whether overflow or underflow has occurred.
Example 4: Overflow: Suppose that n=8, 127D + 2D = 129D (overflow - beyond the
range)
The following diagram explains how the 2's complement works. By re-arranging the number line,
values from -128 to +127 are represented contiguously by ignoring the carry bit.
A floating-point number is typically expressed in the scientific notation, with a fraction (F), and
an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10
(F×10^E); while binary numbers use radix of 2 (F×2^E).
Representation of floating point number is not unique. For example, the number 55.66 can be
represented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so on. The fractional part can
be normalized. In the normalized form, there is only a single non-zero digit before the radix
point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2; binary
number 1010.1011B can be normalized as 1.0101011B×2^3.
It is important to note that floating-point numbers suffer from loss of precision when represented
with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there are infinite number of real
numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n-bit binary pattern
can represent a finite 2^n distinct numbers. Hence, not all the real numbers can be represented.
The nearest approximation will be used instead, resulted in loss of accuracy.
It is also important to note that floating number arithmetic is very much less efficient than integer
arithmetic. It could be speed up with a so-called dedicated floating-point co-processor. Hence, use
integers if your application does not require floating-point numbers.
In computers, floating-point numbers are represented in scientific notation of fraction (F)
and exponent (E) with a radix of 2, in the form of F×2^E. Both E and F can be positive as well as
negative. Modern computers adopt IEEE 754 standard for representing floating-point numbers.
There are two representation schemes: 32-bit single-precision and 64-bit double-precision.
4.1 IEEE-754 32-bit Single-Precision Floating-Point Numbers
In 32-bit single-precision floating-point representation:
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative
numbers.
The following 8 bits represent exponent (E).
The remaining 23 bits represents fraction (F).
Normalized Form
Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 0000
0000 0000 0000, with:
S = 1
E = 1000 0001
F = 011 0000 0000 0000 0000 0000
In the normalized form, the actual fraction is normalized with an implicit leading 1 in the form
of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2
+ 1×2^-3 = 1.375D.
The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative
number. In this example with S=1, this is a negative number, i.e., -1.375D.
In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is
because we need to represent both positive and negative exponent. With an 8-bit E, ranging
from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this
example, E-127=129-127=2D.
Hence, the number represented is -1.375×2^2=-5.5D.
De-Normalized Form
Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot
represent the number zero! Convince yourself on this!
System.out.println(Float.intBitsToFloat(0x7fffff));
System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));
Denormalized form can represent very small numbers closed to zero, and zero, which cannot be
represented in normalized form, as shown in the above figure.
The minimum and maximum of denormalized floating-point numbers are:
Precision Denormalized D(min)
Single 0000 0001H 007
0 00000000 00000000000000000000001B 0 0
E = 0, F = 00000000000000000000001B E =
D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149 D(m
(≈1.4 × 10^-45) (≈1
Double 0000 0000 0000 0001H 001
D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074 D(m
(≈4.9 × 10^-324) (≈4
Special Values
Zero: Zero cannot be represented in the normalized form, and must be represented in
denormalized form with E=0 and F=0. There are two representations for zero: +0 with S=0 and -
0 with S=1.
Infinity: The value of +infinity (e.g., 1/0) and -infinity (e.g., -1/0) are represented with an
exponent of all 1's (E = 255 for single-precision and E = 2047 for double-precision), F=0,
and S=0 (for +INF) and S=1 (for -INF).
Not a Number (NaN): NaN denotes a value that cannot be represented as real number
(e.g. 0/0). NaN is represented with Exponent of all 1's ( E = 255 for single-precision and E =
2047 for double-precision) and any non-zero fraction.
5. Character Encoding
In computer memory, character are "encoded" (or "represented") using a chosen "character
encoding schemes" (aka "character set", "charset", "character map", or "code page").
For example, in ASCII (as well as Latin1, Unicode, and many other character sets):
code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z', respectively.
code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z', respectively.
code numbers 48D (30H) to 57D (39H) represents '0' to '9', respectively.
It is important to note that the representation scheme must be known before a binary pattern
can be interpreted. E.g., the 8-bit pattern " 0100 0010B" could represent anything under the sun
known only to the person encoded it.
The most commonly-used character encoding schemes are: 7-bit ASCII (ISO/IEC 646) and 8-bit
Latin-x (ISO/IEC 8859-x) for western european characters, and Unicode (ISO/IEC 10646) for
internationalization (i18n).
A 7-bit encoding scheme (such as ASCII) can represent 128 characters and symbols. An 8-bit
character encoding scheme (such as Latin-x) can represent 256 characters and symbols; whereas
a 16-bit encoding scheme (such as Unicode UCS-2) can represents 65,536 characters and
symbols.
ISO/IEC 8859-1, aka Latin alphabet No. 1, or Latin-1 in short, is the most commonly-used
encoding scheme for western european languages. It has 191 printable characters from the latin
script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin-1 is
backward compatible with the 7-bit US-ASCII code. That is, the first 128 characters in Latin-1
(code numbers 0 to 127 (7FH)), is the same as US-ASCII. Code numbers 128 (80H) to 159 (9FH)
are not assigned. Code numbers 160 (A0H) to 255 (FFH) are given as follows:
Hex 0 1 2 3 4 5 6 7 8 9 A B C D E
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ®
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ
E à á â ã ä å æ ç è é ê ë ì í î
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ
ISO/IEC-8859 has 16 parts. Besides the most commonly-used Part 1, Part 2 is meant for Central
European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for North
European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Part 8 for
Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for
Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern European.
Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Latin-1 (ISO-8859-1). That is,
the first 128 characters are the same as US-ASCII; and the first 256 characters are the same as
Latin-1.
Unicode originally uses 16 bits (called UCS-2 or Unicode Character Set - 2 byte), which can
represent up to 65,536 characters. It has since been expanded to more than 16 bits, currently
stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to
U+10FFFFH (21 bits or about 2 million characters), covering all current and ancient historical
scripts. The original 16-bit range of U+0000H to U+FFFFH (65536 characters) is known as Basic
Multilingual Plane (BMP), covering all the major languages in use currently. The characters
outside BMP are called Supplementary Characters, which are not frequently-used.
In UTF-8, Unicode numbers corresponding to the 7-bit ASCII characters are padded with a
leading zero; thus has the same value as ASCII. Hence, UTF-8 can be used with all software using
ASCII. Unicode numbers of 128 and above, which are less frequently used, are encoded using
more bytes (2-4 bytes). UTF-8 generally requires less storage and is compatible with ASCII. The
drawback of UTF-8 is more processing power needed to unpack the code due to its variable
length. UTF-8 is the most popular format for Unicode.
Notes:
UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary
characters outside BMP (21-bit).
The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) use one byte.
Most European and Middle East characters use a 2-byte sequence, which includes extended
Latin letters (with tilde, macron, acute, grave and other accents), Greek, Armenian, Hebrew,
Arabic, and others. Chinese, Japanese and Korean (CJK) use three-byte sequences.
All the bytes, except the 128 ASCII characters, have a leading '1' bit. In other words, the
ASCII bytes, with a leading '0' bit, can be identified and decoded easily.
Example: 您好 (Unicode: 60A8H 597DH)
Take note that for the 65536 characters in BMP, the UTF-16 is the same as UCS-2 (2 bytes).
However, 4 bytes are used for the supplementary characters outside the BMP.
For BMP characters, UTF-16 is the same as UCS-2. For supplementary characters, each character
requires a pair 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF), the
second from the low-surrogates range (\uDC00-\uDFFF).
5.7 UTF-32 (Unicode Transformation Format - 32-bit)
Same as UCS-4, which uses 4 bytes for each character - unencoded.
Worse still, there are also various chinese character sets, which is not compatible with Unicode:
GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese
character. The most significant bit (MSB) of both bytes are set to 1 to co-exist with 7-bit
ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB2312,
which include more characters as well as traditional chinese characters.
BIG5: for traditional chinese characters BIG5 also uses 2 bytes for each chinese character. The
most significant bit of both bytes are also set to 1. BIG5 is not compatible with GBK, i.e., the
same code number is assigned to different character.
For example, the world is made more interesting with these many standards:
Collating sequence is often language dependent, as different languages use different sets of
characters (e.g., á, é, a, α) with their own orders.
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
public class TestCharsetEncodeDecode {
public static void main(String[] args) {
// Try these charsets for encoding
String[] charsetNames = {"US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16",
"UTF-16BE", "UTF-16LE", "GBK", "BIG5"};
This is meant to be an academic discussion. I have yet to encounter the use of supplementary
characters!
Let me know if you have a better choice, which is fast to launch, easy to use, can toggle between
Hex and normal view, free, ....
The following Java program can be used to display hex code for Java Primitives (integer,
character and floating-point):