Lecture 04 Ascii vs Unicode
Lecture 04 Ascii vs Unicode
Course Objectives
1
ASCII Codes
ASCII stands for American Standard Code for Information Interchange. The ASCII code is a
popular coding scheme used in digital computing systems to encode characters.
In the ASCII code, a unique integer value is assigned to each character like number, letter, symbol,
etc. The standard ASCII code defines a set of 128 characters, where each character can be
represented by a unique 7-bit binary code. Therefore, ASCII code can represent total 27 = 128
possible characters.
In digital electronics, the characters in ASCII code are generally represented in decimal or
hexadecimal notation. Overall, the ASCII code is a standard encoding scheme for representing
characters in digital computers and communication systems.
2
The following table highlights the name, symbol and ASCII code in decimal and binary form for
the range from 0 to 127.
3
Cancel CAN 24 00011000
Space 32 00100000
Hash # 35 00100011
Dollar $ 36 00100100
Percentage % 37 00100101
Asterisk * 42 00101010
Plus + 43 00101011
Comma , 44 00101100
Hyphen - 45 00101101
Zero 0 48 00110000
One 1 49 00110001
Two 2 50 00110010
4
Three 3 51 00110011
Four 4 52 00110100
Five 5 53 00110101
Six 6 54 00110110
Seven 7 55 00110111
Eight 8 56 00111000
Nine 9 57 00111001
Colon : 58 00111010
Semicolon ; 59 00111011
Equals = 61 00111101
At symbol @ 64 01000000
Uppercase A A 65 01000001
Uppercase B B 66 01000010
Uppercase C C 67 01000011
Uppercase D D 68 01000100
Uppercase E E 69 01000101
Uppercase F F 70 01000110
Uppercase G G 71 01000111
Uppercase H H 72 01001000
Uppercase I I 73 01001001
Uppercase J J 74 01001010
Uppercase K K 75 01001011
Uppercase L L 76 01001100
Uppercase M M 77 01001101
5
Uppercase N N 78 01001110
Uppercase O O 79 01001111
Uppercase P P 80 01010000
Uppercase Q Q 81 01010001
Uppercase R R 82 01010010
Uppercase S S 83 01010011
Uppercase T T 84 01010100
Uppercase U U 85 01010101
Uppercase V V 86 01010110
Uppercase W W 87 01010111
Uppercase X X 88 01011000
Uppercase Y Y 89 01011001
Uppercase Z Z 90 01011010
Backslash \ 92 01011100
Underscore _ 95 01011111
Lowercase a a 97 01100001
Lowercase b b 98 01100010
Lowercase c c 99 01100011
6
Lowercase i i 105 01101001
7
The following table shows the name, symbol and ASCII code in decimal and binary form for the
range from 128 to 255.
129 10000001
141 10001101
143 10001111
144 10010000
8
Small tilde ˜ 152 10011000
157 10011101
9
Superscript three - cubed ³ 179 10110011
10
Latin capital letter I with circumflex Î 206 11001110
11
Latin small letter e with acute é 233 11101001
Explore our latest online courses and learn new skills at your own pace. Enroll and become a
certified expert to boost your career.
12
The following are the key benefits of the ASCII (American Standard Code for Information
Interchange) code −
• The ASCII code provides a simple and straightforward encoding scheme to represent
letters, numbers, and symbols.
• ASCII code is compatible with a wide range of programming languages and computing
devices.
• ASCII code provides a compact character representation, where each character can be
represented using 7-bits or 8-bits. Hence, it is a space efficient encoding standard.
• ASCII code is a universally adopted encoding standard in the field of digital electronics.
• ASCII code has easy and simple implementation in hardware and software.
• The standard ASCII code has a limited set of 128 characters. This makes it unsuitable for
representing characters of languages other than English.
• The ASCII code can be extended to 8-bits but it is not standardized beyond 7-bits.
• ASCII code is not suitable to use in systems that require a broad range of characters.
Conclusion
In conclusion, ASCII (American Standard Code for Information Interchange) is a character
encoding scheme widely used in digital systems. It is a 7-bit standard code used to represent a total
of 128 characters including numbers, letters, symbols, and control characters.
13
Internal Storage Encoding of Characters
We know that a computer understands only binary language (0 and 1). Moreover, it is not able to
directly understand or store any alphabets, other numbers, pictures, symbols, etc. Therefore, we
use certain coding schemes so that it can understand each of them correctly. Besides, we call these
codes alphanumeric codes.
UNICODE
Unicode is a universal character encoding standard. This standard includes roughly 100000
characters to represent characters of different languages. While ASCII uses only 1 byte the
Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.
It has three types namely UTF-8, UTF-16, UTF-32. Among them, UTF-8 is used mostly it is also
the default encoding for many programming languages.
UCS
It is a very common acronym in the Unicode scheme. It stands for Universal Character
Set. Furthermore, it is the encoding scheme for storing the Unicode text.
UTF
The UTF is the most important part of this encoding scheme. It stands for Unicode
Transformation Format. Moreover, this defines how the code represents Unicode. It has 3 types
as follows:
UTF-7
This scheme is designed to represent the ASCII standard. Since the ASCII uses 7 bits encoding.
It represents the ASCII characters in emails and messages which use this standard.
UTF-8
It is the most commonly used form of encoding. Furthermore, it has the capacity to use up to 4
bytes for representing the characters. It uses:
• 2 bytes to represent additional Latin and Middle Eastern letters and symbols.
14
Moreover, it is compatible with the ASCII standard.
• It can also represent emojis which is today a very important feature of most apps.
UTF-16
UTF-32
15
Category Character Description Unicode Code Point
f Lowercase f U+0066
g Lowercase g U+0067
h Lowercase h U+0068
i Lowercase i U+0069
j Lowercase j U+006A
Numbers 0 Digit Zero U+0030
1 Digit One U+0031
2 Digit Two U+0032
3 Digit Three U+0033
4 Digit Four U+0034
5 Digit Five U+0035
6 Digit Six U+0036
7 Digit Seven U+0037
8 Digit Eight U+0038
9 Digit Nine U+0039
Punctuation Marks . Period U+002E
, Comma U+002C
; Semicolon U+003B
: Colon U+003A
! Exclamation Mark U+0021
? Question Mark U+003F
- Hyphen U+002D
_ Underscore U+005F
' Single Quote U+0027
" Double Quote U+0022
Mathematical Symbols + Plus Sign U+002B
- Minus Sign U+2212
* Multiplication Sign U+002A
÷ Division Sign U+00F7
= Equal Sign U+003D
≠ Not Equal U+2260
≤ Less Than or Equal To U+2264
≥ Greater Than or Equal To U+2265
∑ Summation Symbol U+2211
∞ Infinity U+221E
Greek Letters α Greek Alpha U+03B1
β Greek Beta U+03B2
γ Greek Gamma U+03B3
16
Category Character Description Unicode Code Point
δ Greek Delta U+03B4
ε Greek Epsilon U+03B5
θ Greek Theta U+03B8
λ Greek Lambda U+03BB
μ Greek Mu U+03BC
π Greek Pi U+03C0
σ Greek Sigma U+03C3
Currency Symbols $ Dollar Sign U+0024
€ Euro Sign U+20AC
£ Pound Sterling U+00A3
¥ Yen Sign U+00A5
₹ Indian Rupee Sign U+20B9
Special Characters ♥ Heart Symbol U+2665
☺ Smiling Face U+263A
☀ Sun Symbol U+2600
★ Black Star U+2605
✈ Airplane U+2708
✔ Check Mark U+2714
Emojis Grinning Face U+1F600
Face with Tears of Joy U+1F602
Red Heart U+2764
Globe Showing Europe-Africa U+1F30D
Party Popper U+1F389
Importance of Unicode
• We can use it to convert from one coding scheme to another. Since Unicode is the
superset for all encoding schemes. Hence, we can convert a code into Unicode and
then convert it into another coding standard.
17
• It is preferred by many coding languages. For example, XML tools and applications
use this standard only.
Advantages of Unicode
18
• It has more than 128,000 • In contrast, it has only 256
characters. characters.
• The characters
themselves involve all the
characters of the ISCII • It has its equivalent coding
encoding. Therefore we characters in the Unicode.
can say that it is a
superset for it.
19
A1. Unicode is a standard for character encoding. The introduction of ASCII characters was not
enough to cover all the languages. Therefore, to overcome this situation, it was introduced.
The Unicode Consortium introduced this encoding scheme.
What is unicode?
Unicode is a standard encoding system that assigns a unique numeric value to every character,
regardless of the platform, program, or language. It allows computers to represent and
manipulate text from different writing systems, including alphabets, ideographs, and symbols.
Unicode uses a set of code points, which are numerical values assigned to each character. These
code points can be represented in various formats, such as unicode transformation format
(UTF-8) or UTF-16, depending on the number of bits used. The code points map to specific
characters, allowing computers to display and interpret text correctly.
20
What is the difference between unicode and American standard code for information
interchange (ASCII)?
ASCII only supports a limited set of characters found in the English language. Unicode, on the
other hand, encompasses a much broader range of characters from various writing systems
around the world. It provides a universal standard for character encoding, making it possible
to represent text from multiple languages.
Yes, Unicode aims to encompass all characters used by human languages, including historical
scripts, symbols, emoji, and even fictional characters. As for the latest version, Unicode 14.0,
it covers over 150 scripts and includes more than 150,000 characters. The Unicode Consortium
regularly updates and expands the standard to include new characters requested by users.
Unicode assigns a unique code point to each character, regardless of its script or language. It
categorizes characters into blocks based on their script, such as Latin, Cyrillic, Arabic, and
Chinese. This allows computers to correctly interpret and display text in different languages
without conflicts or ambiguity.
One of the main benefits of Unicode is its ability to support multilingual environments. By
using a unified encoding system, it enables seamless communication and data exchange across
different platforms and devices. It also promotes interoperability, as software developers can
rely on a single standard when handling text input, storage, and display.
Unicode provides a universal standard for character encoding, which means that text can be
accurately represented and interpreted across different platforms, operating systems, and
programming languages. This eliminates the need for complex conversion schemes and ensures
seamless communication between different systems.
How does unicode handle characters that are not supported by all fonts?
Unicode defines a list of characters, but it does not dictate how they should be visually
represented. Fonts are responsible for rendering the characters, and not all fonts support every
Unicode character. In cases where a character is not supported by a specific font, a fallback
mechanism is used to display a placeholder or substitute symbol instead.
21
Yes, Unicode includes a wide range of symbols, currency signs, mathematical operators, and
other special characters. These characters are assigned specific code points within the Unicode
standard, allowing them to be accurately represented and interpreted.
Unicode introduced skin tone modifiers for emoji characters, allowing users to specify different
skin tones for certain emoji. This allows for greater representation and inclusivity. Skin tone
modifiers are applied using specific code points that modify the base emoji character to reflect
the desired skin tone.
Yes, Unicode includes blocks for various ancient and historical scripts. This allows the
representation of characters from ancient civilizations such as Egyptian hieroglyphs, Mayan
glyphs, and others. The inclusion of these scripts in Unicode enables the study, preservation,
and digital representation of historical texts.
Unicode encodings are unicode transformation format (UTF-8) and UTF-16. UTF-8 is a
variable-width encoding that uses 8-bit code units, making it efficient for representing ASCII
characters while still supporting the full Unicode range. UTF-16, on the other hand, uses 16-
bit code units and is often used in systems that handle larger character sets or require fixed-
width representation.
How does unicode handle complex scripts like Indic scripts or Thai?
Unicode includes specific blocks for complex scripts like Indic scripts (such as Devanagari,
Tamil, Bengali) and Thai. These scripts have unique features such as conjuncts, stacking, and
contextual shaping. Unicode provides rules and guidelines for rendering and processing these
scripts, ensuring correct display and text manipulation within software applications.
What is the difference between unicode and unicode transformation format (UTF-8)?
Unicode is a character encoding standard that assigns unique code points to every character,
while UTF-8 is one of the encoding schemes used to represent Unicode characters. UTF-8 is a
variable-width encoding that uses 8-bit code units to represent characters, making it efficient
for American standard code for information interchange (ASCII) characters and compatible
with legacy systems.
Can unicode handle bidirectional text, like mixing English and Arabic in the same
paragraph?
Yes, Unicode supports bidirectional text by defining rules and algorithms for proper rendering
and display. It allows the mixing of left-to-right scripts (like English) and right-to-left scripts
(like Arabic or Hebrew) within the same document or paragraph, ensuring correct ordering and
alignment of the text.
How does unicode handle character rendering across different devices and operating
systems?
22
Unicode provides a standard for character encoding, but the visual representation depends on
the font rendering system of each device or operating system. Fonts play a crucial role in
displaying characters accurately, including their shape, size, and style. The availability and
quality of fonts can affect how Unicode characters are rendered.
How does unicode handle text input methods for languages with large character sets?
Unicode supports various input methods and techniques for entering text in languages with
large character sets. These methods include keyboard layouts specifically designed for the
script, input methods that leverage phonetic conversions, and software applications that provide
character pickers or predictive text suggestions.
Unicode includes a wide range of symbols, currency signs, mathematical operators, and other
special characters. These characters are assigned specific code points within the Unicode
standard, allowing them to be accurately represented and interpreted.
23