Module 2

Bachelor of Computer Applications, Sem-V
Introduction to Multimedia, BCAC502

Class
2023-24
Study Material
_____________________________________________________________________________________________
Table of Contents
Sl No. Topic Name Page No

1 Text
1
2 Types of Text: Unformatted Text 1
3 Types of Text: Formatted Text 2
4 Types of Text: Hypertext and Hypermedia

3-4
5 Leading 4-5
6 Kerning
5-6
7 Tracking 6
8 Unicode Standard, ASCII Standard 7-8

9 Text Formats
8-10
10 Text Compression Techniques – RLE, LZW, Huffman Coding 10-24
Soumya Roy
Assistant Professor,
Department of Computational Science
Brainware University, Kolkata 1
Class
2023-24
MODULE 2 [TEXT]
Text:
Text is a human-readable sequence of character s and the words they
form that can be encoded into computer-readable formats such as
ASCII . Text is usually distinguished from non-character encoded data,
such as graphic images in the form of bitmap s and program code,
which is sometimes referred to as being in "binary" (but is actually in
its own computer-readable format).
Types of text
Unformatted Formatted Hypertext and

Text Text Hypermedia
Unformatted Text:
Unformatted text is known as plain text is the contents of an ordinary
sequential file readable as textual material without much processing.
Plain text is different from formatted text, where style information is
included, and "binary files" in which some portions must be interpreted
as binary objects (encoded integers, real numbers, images, etc.).
The encoding has traditionally been either ASCII, one of its many
derivatives such as ISO/IEC 646 etc., or sometimes EBCDIC. Unicode-
based encodings such as UTF-8 and UTF-16 are gradually replacing the
older ASCII derivatives limited to 7 or 8 bit codes.
Soumya Roy
Class
2023-24
Formatted Text:
Formatted text are those where apart from the actual alphanumeric
characters, other control characters are used to change the appearance of
the characters, e.g. Bold, underlines, italics, varying shapes, sizes and
colors etc. Most text processing software use such formatting options to
change text appearance. To print such a document, the printer should
also be capable of interpreting these control codes so that the
appropriate appearance may be reproduced.
Hypertext and Hypermedia:

Hypertext is inherently nonlinear. It is comprised of many interlinked
chunks of self-contained text. Readers are not bound to a particular
sequence, but can browse through information intuitively by
association, following their interests by following a highlighted
keyword or phrase in one piece of text to bring up another, associated
piece of text. The following figure illustrates this difference.
Soumya Roy
Class
2023-24
Hypermedia is the generalization of hypertext to include other kinds of

media: images, audio clips and video clips are typically supported in
addition to text. Individual chunks of information are usually referred to
as documents or nodes, and the connections between them as links or
hyperlinks the so-called node-link hypermedia model. The entire set of
nodes and links forms a graph network. A distinct set of nodes and links
which constitutes a logical entity or work is called a hyper document; a
distinct subset of hyperlinks is often called a hyper web.
A source anchor is the starting point of a hyperlink and specifies the
part of a document from which an outgoing link can be activated.
Typically, the user is given visual cues as to where source anchors are
located in a document (for example, a highlighted phrase in a text
document). A destination anchor is the endpoint of a hyperlink and
determines what part of a document should be on view upon arrival at
that node (for example, a text might be scrolled to a specific paragraph).
Often, an entire document is specified as the destination and viewing
commences at some default
location within the document (for example, the start of a text).
Leading:
When working with a paragraph, or just more than one line of type,
leading is the distance between the baselines in the paragraph. A
baseline is the imaginary guideline that type sits on. The standard
proportion of leading to type size is typically 120%. So if the type size
is 20 point, then the most standard leading would be 24 point. The term
originated in the days of hand-typesetting, when thin strips of lead were
inserted into the forms to increase the vertical distance between lines of
Soumya Roy
Class
2023-24
type. The term is still used in modern page layout software such as
QuarkXPress and Adobe InDesign.
In consumer-oriented word processing software, this concept is usually
referred to as "line spacing“ "interline spacing”.
Kerning:
Kerning is an adjustment of space between two specific letters. The goal
of kerning is to create a consistent rhythm of space within a group of
letters and to create an appearance of even spacing between letters.
Fonts have exact amounts of spacing between letter combinations
already built into it, which is called Metric Kerning. Type takes on
Metric Kerning as a default. The goal of kerning is for the type to look
optically correct. There is no mathematical formula, and often times it
just takes practice
Soumya Roy
Class
2023-24
Tracking:
Kerning should not be confused with tracking, which refers to uniform
spacing between all of the letters in a group of text. By increasing
tracking in a word, line of text, or paragraph, a designer can create a
more open and airy element.
Soumya Roy
Class
2023-24
Unicode Standard:
Unicode Standard: The Unicode Standard is a new universal character
coding scheme for written characters and text. It defines a consistent
way of encoding multilingual text which enables textual data to be
exchanged universally. The Unicode Consortium was incorporated in
1991 to promote the Unicode standard.The UTC(Unicode Technical
Committee) is the working group within the consortium responsible for
creation, maintenance and quality of the Unicode Standard. For ex. The
hindi characters “Pa” is represented by the Unicode sequence 0000 1001
0010 1010(U+092A), how it will be rendered on the screen will be
decided by the font vendor. The first byte represents the language area
while the next byte represents the actual character.
ASCII Character Set:
Soumya Roy
Class
2023-24
Extended ASCII Character Set:
Font:
In traditional typography, a font is a particular size, weight and style of
a typeface. In Windows platform, font files are stored in a specific
folder called Fonts under the Windows folder. These files are usually
vector format meaning that character descriptions are stored
mathematically. Windows call these font as True Type Fonts.
Text File Formats:
 TXT: TXT(Text) is an unformatted text document created by an
editor like Notepad on Windows platform.
 DOC,DOCX: DOC(Document) is a proprietary document file

format developed by Microsoft as a native format storing documents
created by the MS -Word package in 1989.From Word 2007, a new
file format DOCX is used which emphasizes an XML based format.
Soumya Roy
Class
2023-24
 RTF:RTF(Rich Text Format) is a proprietary document file format

developed by Microsoft in 1987 for cross-platform document
exchanges. It is the default format for Mac OS X's editor Text Edit.
 PS: PS(PostScript) is a page description language used mainly for

desktop publishing. A page description language is a high level
language that can describe the contents of a page such that it can
be accurately displayed on output devices, usually printer.
PostScript was developed in 1985.in the same year , Apple
LaserWriter was the first printer to ship with PostScript. PostScript
offered a universal language that could be used for any brand of
printer. PostScript represents all graphics and even text as
vectors, i.e. As a combinations of lines and curves. A PostScript
compatible program converts an input document into PS format
which is sent to the printer. A PostScript interpreter inside the
printer converts the vector back into the raster dots to be printed.
PostScript is a page description language run in an interpreter to
generate an image, a process requiring many resources. It can
handle not just graphics, but standard features of programming
languages such as if and loop commands. PDF is largely based on
PostScript but simplified to remove flow control features like
these, while graphics commands such as line to remain.
 PDF: PDF(Portable Document Format) is a file format developed
by Adobe Systems in 1993 for cross-platform exchange of
documents. Each PDF file encapsulates a complete description of a
fixed-layout flat document, including the text, fonts, graphics, and
Soumya Roy
Class
2023-24
other information needed to display it. PDF has several advantages

over PostScript:
 PDF contains tokenized and interpreted results of the
PostScript source code, for direct correspondence between
changes to items in the PDF page description and changes to
the resulting page appearance.
 PDF (from version 1.4) supports true graphic transparency;
PostScript does not.

Text Compression:
Text compression should be lossless. There are several types of
algorithm available for text-compression:
 Run-length Encoding
 Huffman Coding
 LZW coding
 Shannon-FANO Coding.
LZW: An adaptive Lossless compression technique:

A different approach to adaptation is taken by the popular Lempel-Ziv-
Welch (LZW) algorithm. This method was developed originally by Ziv
and Lempel, and subsequently improved by Welch. As the message to
be encoded is processed, the LZW algorithm builds a string table that
maps symbol sequences to/from an N-bit index. The string table
has 2N entries and the transmitted code can be used at the decoder as an
index into the string table to retrieve the corresponding original symbol
sequence. The sequences stored in the table can be arbitrarily long. The
algorithm is designed so that the string table can be reconstructed by the
decoder based on information in the encoded stream—the table, while
central to the encoding and decoding process, is never transmitted! This
property is crucial to the understanding of the LZW method.
Soumya Roy
Class
2023-24
When encoding a byte stream, the first 2 = 256 entries of the string
8
table, numbered 0 through 255, are initialized to hold all the possible
one-byte sequences. The other entries will be filled in as the message
byte stream is processed. First, accumulate message bytes as long as the
accumulated sequences appear as some entry in the string table. At
some
point, appending the next byte b to the accumulated sequence S would
create a sequence S + b that’s not in the string table, where + denotes
appending b to S. The encoder then executes the following steps:
1. It transmits the N-bit code for the sequence S.
2. It adds a new entry to the string table for S + b. If the encoder finds
the table full when it goes to add an entry, it reinitializes the table before
the addition is made.
3. it resets S to contain only the byte b.
This process repeats until all the message bytes are consumed, at which
point the encoder makes a final transmission of the N-bit code for the
current sequence S.
Example:
The following table shows the encoder in action on a repeating
sequence of abc. The string: abcabcabcabcabcabcabcabcabcabcabcabc
Soumya Roy
Class
2023-24
Soumya Roy
Class
2023-24
Disadvantages with LZ compression:

LZ compression substitutes the detected repeated patterns with
references to a dictionary. Unfortunately the larger the dictionary, the
greater the number of bits that are necessary for the references. The
optimal size of the dictionary also varies for different types of data; the
more variable the data, the smaller the optimal size of the directory.
Lempel-Ziv/Huffman practical compression(Example):
Drive Space
DriveSpace and DoubleSpace are programs used in PC systems to
compress files on hard disk drives. They use a mixture of Huffman and
Lempel-Ziv coding, where Huffman codes are used to differentiate
Soumya Roy
Class
2023-24
between data (literal values) and back references and LZ coding is used
for back references.
GIF files
The graphic interface format (GIF) uses a compression algorithm based
on the Lempel- Ziv-Welsh (LZW) compression scheme. When
compressing an image the compression program maintains a list of
substrings that have been found previously. When a repeated string is
found, the referred item is replaced with a pointer to the original. Since
images tend to contain many repeated values, the GIF format is a good
compression technique.
UNIX compress/uncompress
The UNIX programs compress and uncompress use adaptive Lempel-
Ziv coding. They are generally better than pack and unpack which are
based on Huffman coding. Where possible, the compress program adds
a ‘.z’ onto a file when compressed. Compressed files can be restored
using the uncompress or zcat programs.
UNIX archive/zoo
The UNIX-based zoo freeware file compression utility employs the
Lempel- Ziv algorithm. It can store and selectively extract multiple
generations of the same file. Data can thus be recovered from damaged
archives by skipping the damaged portion and locating undamaged data.
CODEC:
After an analog quantity has been digitized, it is stored on the disk as a
digital file. Such files are referred to as a raw or uncompressed media
data. To compress the file and reduce its size, it needs to be filtered
Soumya Roy
Class
2023-24
through a specialized software called a CODEC, which is short for

Compression/Decompression or Coder/Decoder. The software reads the
media data and applies mathematical algorithms to reduce its size. The
algorithm work by trying to find redundant information within the
media files. Redundant information are those which can either be
discarded without affecting the media quality by appreciable measures,
or data that can be expressed in a more compact form., In either case
this leads to a reduction in file size, but the actual amount of reduction
depends on a large number of factors involving both the media data and
the CODEC
Compression:
The process of converting an input data stream (the source stream or the
original raw data) into another data stream (the output, or the
Soumya Roy
Class
2023-24
compressed stream) that has a smaller size (low-redundancy). The

decompress or decoder converts in the opposite direction.
Data compression is popular for two reasons:

1-Faster Transmission.
2-Storing data in Less Memory
Soumya Roy
Class
2023-24
Lossless Compression vs. Lossy Compression:
Run Length Encoding(RLE):

Run-Length Encoding(RLE) is a simple form of lossless data
compression that runs on sequences with the same value occurring
many consecutive times. It encodes the sequence to store only a single
value and its count.
In 'lossless compression', the codecs keep all of the information about a
file. The compressed file, once decompressed, can be reconstructed so it
is exactly like the file before it was compressed, with no loss of any
information at all.
Soumya Roy
Class
2023-24
RLE works by looking through the data in a file and identifying

repeating strings of characters, called a 'run'. The run is then encoded
into a small number of bytes, usually two. The first byte, called the 'run
count', holds the number of characters in the run. The second character,
called the 'run value' is the actual character in the run.
For example, the following data would take 20 bytes to store (because
there are twenty characters): GGGGGGGGGGGGGGGGGGGG but the
same data could be encoded to 20G using RLE and you would only
need two bytes to do it. 20G is called a 'run packet'. The first byte of this
run packet, the run count, holds 20 and the second byte, the run value,
holds G.
RLE algorithms are fast and simple. How well they compress data
depends upon what is being encoded. Suppose you are encoding a
picture of a page in a book that you've just scanned. If the page is
mostly white then RLE will compress the file very well because there
will be lots of runs of the same ASCII code for white. If the page is
mostly a busy color photo, then there will be far less runs of the same
ASCII color code.
Example: If input string is “WWWWAAADEXXXXXX”, then the

Run-length encoding is W4A3D1E1X6.
Run Length Encoding is one of the oldest compression methods. Run-
length encoding is a data compression algorithm that is supported by
most bitmap file formats, such as TIFF, BMP, and PCX. RLE is suited
for compressing any type of data regardless of its information content,
but the content of the data will affect the compression ratio achieved by
RLE.
Soumya Roy
Class
2023-24
Huffman Coding:
File compression, particularly for multimedia data, is widely used to
reduce Internet traffic and transfer times. Two common compression
formats for images are GIF and JPEG. Both of these encoding formats
throw away information about the images, so the original image can not
be reconstructed exactly from the compressed image. GIF and
JPEG are lossy compression techniques. Lossy compression can be very
effective for multimedia data.
Huffman Coding is an entropy encoding algorithm used for lossless data
compression. The term entropy is a generic term which refers to the
compression techniques that do not take into account the nature of the
information to be compressed. Lossless compression techniques are also
Soumya Roy
Class
2023-24
known as statistical compression. This is the most popular coding

methods for data compression specially text compression. It was
developed by David A Huffman while he was a PhD student at MIT and
published in the 1952 paper ‘ a Method for the Construction of
Minimum Redundancy codes’
Instead of using a fixed-length code for each symbol:
1. Represent a frequently occurring character in a source with a shorter
code.
2. Represent a less frequently occurring one with a longer code.
3. The total number of bits in this way of representation is, hopefully,
significantly reduced.
Huffman Encoding Algorithm:

1. Constructing a frequency table sorted in descending order.
2. Building a binary tree carrying out iterations until completion of a
complete Binary tree:
a) Merge the last two items (which have the minimum frequencies)
of the Frequency table to form a new combined item with a sum
frequency of the two.
b) Insert the combined item and update the frequency table
3. Deriving Huffman tree Starting at the root, trace down to every leaf
(mark ‘0’ for a left branch and ‘1’ for a right branch) .
4. Generating Huffman code: Collecting the 0s and 1s for each path
from the root to a leaf and assigning a 0-1 code word for each symbol.
Soumya Roy
Class
2023-24
Huffman Coding Characteristics:

The Huffman coding method consists of identifying the most frequent
bit or byte in a file and coding these patterns with fewer bits. Less
frequent pattern will be coded with more bits; most frequent patterns
will be use shorter bits/codes.
• A table of correspondence called the code-book is prepared between
the initial pattern and their new representation and this must be
available at both encoding and decoding end.
• Frequencies of the occurrences of each character are analyzed during
the encoding process.
• Code words are generated by using binary tree whose branches are
assigned as—a binary 0(zero) for the left branch and binary 1(one) for
the right branch.
• Huffman encoding is a type of variable-length encoding that is based
on the actual character frequencies in a given document. Huffman code
is prefix-free code.
• Time Complexity O(n log n)
Example:
Consider the following example. The frequencies of characters are
calculated in a tabular format. Draw the Huffman tree and derive the
Huffman code.
String: hhhheggggabgggfffbcceeeeddhhhhddcfff
Soumya Roy
Class
2023-24
Soumya Roy
Class
2023-24
Soumya Roy
Class
2023-24
Comparative study of bits length:

For uncompressed data …………..
Each character will take 7 bit ASCII code for representation
Total frequencies= 36
Total bits required= 36 X 7 = 252 bits
After compression………
(No. of bits required to encode the character X character frequency)
(5 X 1)+(5 X 2)+(4 X 3)+(3 X 4)+(3 X 5)+(3 X 6)+(2 X 7)+(2 X 8)
=5+10+12+12+15+18+14+16 =102 bits
Huffman Decoding:
To decode the encoded data we require the Huffman tree.
To decode a Huffman-encoded bit string, start at the root of the
Huffman tree and use the input bits to determine the path to the leaf:
1. Start at the root of the tree.
2. For each bit in the input stream:
 If the bit is a 0, take the left branch.
 If the bit is a 1, take the right branch.
 If at a leaf, output the leaf's byte value and reset position to the root.
Soumya Roy

Module 2

Uploaded by

Copyright:

Available Formats

Module 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

Bachelor of Computer Applications, Sem-V

Introduction to Multimedia, BCAC502

Sl No. Topic Name Page No

3 Types of Text: Formatted Text 2

4 Types of Text: Hypertext and Hypermedia

8 Unicode Standard, ASCII Standard 7-8

Unformatted Formatted Hypertext and

Hypertext and Hypermedia:

Hypermedia is the generalization of hypertext to include other kinds of

Extended ASCII Character Set:

editor like Notepad on Windows platform.

 DOC,DOCX: DOC(Document) is a proprietary document file

 RTF:RTF(Rich Text Format) is a proprietary document file format

 PS: PS(PostScript) is a page description language used mainly for

other information needed to display it. PDF has several advantages

PostScript does not.

LZW: An adaptive Lossless compression technique:

Disadvantages with LZ compression:

through a specialized software called a CODEC, which is short for

compressed stream) that has a smaller size (low-redundancy). The

Data compression is popular for two reasons:

Lossless Compression vs. Lossy Compression:

Run Length Encoding(RLE):

RLE works by looking through the data in a file and identifying

Example: If input string is “WWWWAAADEXXXXXX”, then the

known as statistical compression. This is the most popular coding

Huffman Encoding Algorithm:

Huffman Coding Characteristics:

Comparative study of bits length:

 If the bit is a 1, take the right branch.

You might also like