Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CD Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

1.

1 Language Processors: Assembler, Compiler and Interpreter

Computer programs are generally written in high-level languages (like C++, Python, and
Java). A language processor, or language translator, is a computer program that convert
source code from one programming language to another language or to machine code
(also known as object code). They also find errors during translation.
What is Language Processors?
Compilers, interpreters, translate programs written in high-level languages into machine
code that a computer understands and assemblers translate programs written in low-
level or assembly language into machine code. In the compilation process, there are
several stages. To help programmers write error-free code, tools are available.
Assembly language is machine-dependent, yet mnemonics used to represent instructions
in it are not directly understandable by machine and high-Level language is machine-
independent. A computer understands instructions in machine code, i.e. in the form of 0s
and 1s. It is a tedious task to write a computer program directly in machine code. The
programs are written mostly in high-level languages like Java, C++, Python etc. and are
called source code. These source code cannot be executed directly by the computer and
must be converted into machine language to be executed. Hence, a special translator
system software is used to translate the program written in a high-level language into
machine code is called Language Processor and the program after translated into
machine code (object program/object code).
Types of Language Processors
The language processors can be any of the following three types:
1. Compiler
The language processor that reads the complete source program written in high-level
language as a whole in one go and translates it into an equivalent program in machine
language is called a Compiler. Example: C, C++, C#.
In a compiler, the source code is translated to object code successfully if it is free of
errors. The compiler specifies the errors at the end of the compilation with line numbers
when there are any errors in the source code. The errors must be removed before the
compiler can successfully recompile the source code again the object program can be
executed number of times without translating it again.

2. Assembler
The Assembler is used to translate the program written in Assembly language into
machine code. The source program is an input of an assembler that contains assembly
language instructions. The output generated by the assembler is the object code or
machine code understandable by the computer. Assembler is basically the 1st interface
that is able to communicate humans with the machine. We need an assembler to fill the
gap between human and machine so that they can communicate with each other. code
written in assembly language is some sort of mnemonics(instructions) like ADD, MUL,
MUX, SUB, DIV, MOV and so on. and the assembler is basically able to convert these
mnemonics in binary code. Here, these mnemonics also depend upon the architecture of
the machine.
For example, the architecture of intel 8085 and intel 8086 are different.
3. Interpreter
The translation of a single statement of the source program into machine code is done by
a language processor and executes immediately before moving on to the next line is
called an interpreter. If there is an error in the statement, the interpreter terminates its
translating process at that statement and displays an error message. The interpreter
moves on to the next line for execution only after the removal of the error. An Interpreter
directly executes instructions written in a programming or scripting language without
previously converting them to an object code or machine code. An interpreter translates
one line at a time and then executes it.
Example: Perl, Python and Matlab.

Difference Between Compiler and Interpreter


Compiler Interpreter

A compiler is a program that converts the entire


An interpreter takes a source program and runs it
source code of a programming language into
line by line, translating each line as it comes to it.
executable machine code for a CPU.

The compiler takes a large amount of time to


An interpreter takes less amount of time to analyze
analyze the entire source code but the overall
the source code but the overall execution time of the
execution time of the program is comparatively
program is slower.
faster.

The compiler generates the error message only after


scanning the whole program, so debugging is Its Debugging is easier as it continues translating
comparatively hard as the error can be present the program until the error is met.
anywhere in the program.

The compiler requires a lot of memory for It requires less memory than a compiler because no
generating object codes. object code is generated.
Compiler Interpreter

Generates intermediate object code. No intermediate object code is generated.

The interpreter is a little vulnerable in case of


For Security purpose compiler is more useful.
security.

Examples: C, C++, C# Examples: Python, Perl, JavaScript, Ruby.

1.2 Phases of a Compiler


Introduction of Compiler design
We basically have two phases of compilers, namely the Analysis phase and Synthesis
phase. The analysis phase creates an intermediate representation from the given source
code. The synthesis phase creates an equivalent target program from the intermediate
representation.
A compiler is a software program that converts the high-level source code written in a
programming language into low-level machine code that can be executed by the
computer hardware. The process of converting the source code into machine code
involves several phases or stages, which are collectively known as the phases of a
compiler. The typical phases of a compiler are:
1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as
scanning. This phase reads the source code and breaks it into a stream of
tokens, which are the basic units of the programming language. The tokens
are then passed on to the next phase for further processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also
known as parsing. This phase takes the stream of tokens generated by the
lexical analysis phase and checks whether they conform to the grammar of
the programming language. The output of this phase is usually an Abstract
Syntax Tree (AST).
3. Semantic Analysis: The third phase of a compiler is semantic analysis. This
phase checks whether the code is semantically correct, i.e., whether it
conforms to the language’s type system and other semantic rules. In this
stage, the compiler checks the meaning of the source code to ensure that it
makes sense. The compiler performs type checking, which ensures that
variables are used correctly and that operations are performed on compatible
data types. The compiler also checks for other semantic errors, such as
undeclared variables and incorrect function calls.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate
code generation. This phase generates an intermediate representation of the
source code that can be easily translated into machine code.
5. Optimization: The fifth phase of a compiler is optimization. This phase applies
various optimization techniques to the intermediate code to improve the
performance of the generated machine code.
6. Code Generation: The final phase of a compiler is code generation. This phase
takes the optimized intermediate code and generates the actual machine code
that can be executed by the target hardware.
In summary, the phases of a compiler are: lexical analysis, syntax analysis, semantic
analysis, intermediate code generation, optimization, and code generation.
Symbol Table – It is a data structure being used and maintained by the compiler,
consisting of all the identifier’s names along with their types. It helps the compiler to
function smoothly by finding the identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from
left to right. It is then grouped into various tokens having a collective
meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are
categorized hierarchically into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program
are meaningful or not.
The compiler has two modules namely the front end and the back end. Front-end
constitutes the Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate
code generator. And the rest are assembled to form the back end.
1. Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor (which
performs file inclusion and macro expansion) as the input which is in a pure
high-level language. It reads the characters from the source program and
groups them into lexemes (sequence of characters that “go together”). Each
lexeme corresponds to a token. Tokens are defined by regular expressions
which are understood by the lexical analyzer. It also removes lexical errors
(e.g., erroneous characters), comments, and white space.
2. Syntax Analyzer – It is sometimes called a parser. It constructs the parse
tree. It takes all the tokens one by one and uses Context-Free Grammar to
construct the parse tree.
Why Grammar?
The rules of programming can be entirely represented in a few productions.
Using these productions we can represent what the program actually is. The
input has to be checked whether it is in the desired format or not.
The parse tree is also called the derivation tree. Parse trees are generally
constructed to check for ambiguity in the given grammar. There are certain
rules associated with the derivation tree.
 Any identifier is an expression
 Any number can be called an expression
 Performing any operations in the given expression will always result
in an expression. For example, the sum of two expressions is also
an expression.
 The parse tree can be compressed to form a syntax tree
Syntax error can be detected at this level if the input is not in accordance with the
grammar.

 Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or


not. It furthermore produces a verified parse tree. It also does type checking,
Label checking, and Flow control checking.
 Intermediate Code Generator – It generates intermediate code, which is a
form that can be readily executed by a machine We have many popular
intermediate codes. Example – Three address codes etc. Intermediate code is
converted to machine language using the last two phases which are platform
dependent.
Till intermediate code, it is the same for every compiler out there, but after
that, it depends on the platform. To build a new compiler we don’t need to
build it from scratch. We can take the intermediate code from the already
existing compiler and build the last two parts.
 Code Optimizer – It transforms the code so that it consumes fewer resources
and produces more speed. The meaning of the code being transformed is not
altered. Optimization can be categorized into two types: machine-dependent
and machine-independent.
 Target Code Generator – The main purpose of the Target Code generator is
to write a code that the machine can understand and also register allocation,
instruction selection, etc. The output is dependent on the type of assembler.
This is the final stage of compilation. The optimized code is converted into
relocatable machine code which then forms the input to the linker and loader.
All these six phases are associated with the symbol table manager and error handler as
shown in the above block diagram.
The advantages of using a compiler to translate high-level programming
languages into machine code are:
1. Portability: Compilers allow programs to be written in a high-level
programming language, which can be executed on different hardware
platforms without the need for modification. This means that programs can be
written once and run on multiple platforms, making them more portable.
2. Optimization: Compilers can apply various optimization techniques to the
code, such as loop unrolling, dead code elimination, and constant propagation,
which can significantly improve the performance of the generated machine
code.
3. Error Checking: Compilers perform a thorough check of the source code, which
can detect syntax and semantic errors at compile-time, thereby reducing the
likelihood of runtime errors.
4. Maintainability: Programs written in high-level languages are easier to
understand and maintain than programs written in low-level assembly
language. Compilers help in translating high-level code into machine code,
making programs easier to maintain and modify.
5. Productivity: High-level programming languages and compilers help in
increasing the productivity of developers. Developers can write code faster in
high-level languages, which can be compiled into efficient machine code.
In summary, compilers provide advantages such as portability, optimization, error
checking, maintainability, and productivity.

1.3 Introduction of Lexical Analysis


Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the
High level input program into a sequence of Tokens.
1. Lexical Analysis can be implemented with the Deterministic finite Automata.
2. The output is a sequence of tokens that is sent to the parser for syntax
analysis

What
is a Token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar
of the programming languages. Example of tokens:
 Type token (id, number, real, . . . )
 Punctuation tokens (IF, void, return, . . . )
 Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
 Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the corresponding
token or a sequence of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
How Lexical Analyzer Works?
1. Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
2. Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text
against a set of patterns or regular expressions that define the different types
of tokens.
3. Token classification: In this stage, the lexer determines the type of each
token. For example, in a programming language, the lexer might classify
keywords, identifiers, operators, and punctuation symbols as separate token
types.
4. Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
5. Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens
can then be passed to the next stage of compilation or interpretation.

 The lexical analyzer identifies the error with the help of the automation
machine and the grammar of the given language on which it is based like C,
C++, and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer – a = b + c; It will
generate token sequence like this: id=id+id; Where each id refers to it’s
variable in the symbol table referencing all details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

All the valid tokens are:


'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens. You can observe that we have omitted comments. As another

example, consider below printf statement.


There are 5 valid token in this printf statement. Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}
Answer: Total number of token: 27.

Exercise 2: Count number of tokens: int max(int i);


 Lexical analyzer first read int and finds it to be valid and accepts as token.
 max is read by it and found to be a valid function name after reading (
 int is also a token , then again I as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;

 We can represent in the form of lexemes and tokens as under

Lexeme
Lexemes Tokens Tokens
s

while WHILE a IDENTIEFIER

( LAPREN = ASSIGNMENT

a IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

b IDENTIFIER 2 INTEGER

) RPAREN ; SEMICOLON

Advantages
1. Simplifies Parsing:Breaking down the source code into tokens makes it
easier for computers to understand and work with the code. This helps
programs like compilers or interpreters to figure out what the code is
supposed to do. It’s like breaking down a big puzzle into smaller pieces, which
makes it easier to put together and solve.
2. Error Detection: Lexical analysis will detect lexical errors such as misspelled
keywords or undefined symbols early in the compilation process. This helps in
improving the overall efficiency of the compiler or interpreter by identifying
errors sooner rather than later.
3. Efficiency: Once the source code is converted into tokens, subsequent
phases of compilation or interpretation can operate more efficiently. Parsing
and semantic analysis become faster and more streamlined when working
with tokenized input.
Disadvantages
1. Limited Context: Lexical analysis operates based on individual tokens and
does not consider the overall context of the code. This can sometimes lead to
ambiguity or misinterpretation of the code’s intended meaning especially in
languages with complex syntax or semantics.
2. Overhead: Although lexical analysis is necessary for the compilation or
interpretation process, it adds an extra layer of overhead. Tokenizing the
source code requires additional computational resources which can impact the
overall performance of the compiler or interpreter.
3. Debugging Challenges: Lexical errors detected during the analysis phase
may not always provide clear indications of their origins in the original source
code. Debugging such errors can be challenging especially if they result from
subtle mistakes in the lexical analysis process.

1.4 Bootstrapping in Compiler Design



Bootstrapping is a process in which simple language is used to translate more
complicated program which in turn may handle for more complicated program. This
complicated program can further handle even more complicated program and so on.
Writing a compiler for any high level language is a complicated process. It takes lot of
time to write a compiler from scratch. Hence simple language is used to generate
target code in some stages. to clearly understand the Bootstrapping technique
consider a following scenario. Suppose we want to write a cross compiler for new
language X. The implementation language of this compiler is say Y and the target code
being generated is in language Z. That is, we create XYZ. Now if existing compiler Y
runs on machine M and generates code for M then it is denoted as YMM. Now if we run
XYZ using YMM then we get a compiler XMZ. That means a compiler for source
language X that generates a target code in language Z and which runs on machine M.
Following diagram illustrates the above scenario. Example: We can create compiler of
many different forms. Now we will generate.

Compiler which takes C


language and generates an assembly language as an output with the availability of a
machine of assembly language.
 Step-1: First we write a compiler for a small of C in assembly language.

 Step-2: Then using with small subset of C i.e. C0, for the source language c
the compiler is written.
 Step-3: Finally we compile the second compiler. using compiler 1 the
compiler 2 is compiled.

 Step-4: Thus we get a compiler written in ASM which compiles C and


generates code in ASM.
Bootstrapping is the process of writing a compiler for a programming language using
the language itself. In other words, it is the process of using a compiler written in a
particular programming language to compile a new version of the compiler written in
the same language.
1. The process of bootstrapping typically involves several stages. In the first
stage, a minimal version of the compiler is written in a different language,
such as assembly language or C. This minimal version of the compiler is then
used to compile a slightly more complex version of the compiler written in
the target language. This process is repeated until a fully functional version
of the compiler is written in the target language.
2. There are several advantages to bootstrapping. One advantage is that it
ensures that the compiler is compatible with the language it is designed to
compile. This is because the compiler is written in the same language, so it is
better able to understand and interpret the syntax and semantics of the
language.
3. Another advantage is that it allows for greater control over the optimization
and code generation process. Since the compiler is written in the target
language, it can be optimized to generate code that is more efficient and
better suited to the target platform.
4. However, bootstrapping also has some disadvantages. One disadvantage is
that it can be a time-consuming process, especially for complex languages or
compilers. It can also be more difficult to debug a bootstrapped compiler,
since any errors or bugs in the compiler will affect the subsequent versions
of the compiler.
Overall, bootstrapping is an important technique in compiler design that allows for
greater control over the optimization and code generation process, while ensuring
compatibility between the compiler and the target language.
As for the advantages and disadvantages of bootstrapping in compiler design:

Advantages:

1. Bootstrapping ensures that the compiler is compatible with the language it is


designed to compile, as it is written in the same language.
2. It allows for greater control over the optimization and code generation
process.
3. It provides a high level of confidence in the correctness of the compiler
because it is self-hosted.

Disadvantages:
1. It can be a time-consuming process, especially for complex languages or
compilers.
2. Debugging a bootstrapped compiler can be challenging since any errors or
bugs in the compiler will affect the subsequent versions of the compiler.
3. Bootstrapping requires that a minimal version of the compiler be written in a
different language, which can introduce compatibility issues between the two
languages.
4. Overall, bootstrapping is a useful technique in compiler design, but it
requires careful planning and execution to ensure that the benefits outweigh
the drawbacks.

1.5 Input Buffering in Compiler Design

The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input
scanned.
Input buffering is an important concept in compiler design that refers to the way in which
the compiler reads input from the source code. In many cases, the compiler reads input
one character at a time, which can be a slow and inefficient process. Input buffering is a
technique that allows the compiler to read input in larger chunks, which can improve
performance and reduce overhead.
1. The basic idea behind input buffering is to read a block of input from the
source code into a buffer, and then process that buffer before reading the next
block. The size of the buffer can vary depending on the specific needs of the
compiler and the characteristics of the source code being compiled. For
example, a compiler for a high-level programming language may use a larger
buffer than a compiler for a low-level language, since high-level languages
tend to have longer lines of code.
2. One of the main advantages of input buffering is that it can reduce the
number of system calls required to read input from the source code. Since
each system call carries some overhead, reducing the number of calls can
improve performance. Additionally, input buffering can simplify the design of
the compiler by reducing the amount of code required to manage input.
However, there are also some potential disadvantages to input buffering. For example, if
the size of the buffer is too large, it may consume too much memory, leading to slower
performance or even crashes. Additionally, if the buffer is not properly managed, it can
lead to errors in the output of the compiler.
Overall, input buffering is an important technique in compiler design that can help
improve performance and reduce overhead. However, it must be used carefully and
appropriately to avoid potential problems.

Initia
lly both the pointers point to the first character of the input string as shown below
The forward ptr moves
ahead to search for end of lexeme. As soon as the blank space is encountered, it
indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space
the lexeme “int” is identified. The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the begin ptr(bp) and
forward ptr(fp) are set at next token. The input character is thus read from secondary
storage, but reading in this way from secondary storage is costly. hence buffering
technique is used.A block of data is first read into a buffer, and then second by lexical
analyzer. there are two methods used in this context: One Buffer Scheme, and Two
Buffer Scheme. These are explained as following below.

1. One Buffer Scheme: In this scheme, only one buffer is used to store the
input string but the problem with this scheme is that if lexeme is very long
then it crosses the buffer boundary, to scan rest of the lexeme the buffer has
to be refilled, that makes overwriting the first of lexeme.

2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this
method two buffers are used to store the input string. the first buffer and
second buffer are scanned alternately. when end of current buffer is reached
the other buffer is filled. the only problem with this method is that if length of
the lexeme is longer than length of the buffer then scanning input cannot be
scanned completely. Initially both the bp and fp are pointing to the first
character of first buffer. Then the fp moves towards right in search of end of
lexeme. as soon as blank character is recognized, the string between bp and
fp is identified as corresponding token. to identify, the boundary of first buffer
end of buffer character should be placed at the end first buffer. Similarly end
of second buffer is also recognized by the end of buffer mark present at the
end of second buffer. when fp encounters first eof, then one can recognize
end of first buffer and hence filling up second buffer is started. in the same
way when second eof is obtained then it indicates of second buffer.
alternatively both the buffers can be filled up until end of the input program
and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.

Advantages:

Input buffering can reduce the number of system calls required to read input from the
source code, which can improve performance.
Input buffering can simplify the design of the compiler by reducing the amount of code
required to manage input.

Disadvantages:

If the size of the buffer is too large, it may consume too much memory, leading to slower
performance or even crashes.
If the buffer is not properly managed, it can lead to errors in the output of the compiler.
Overall, the advantages of input buffering generally outweigh the disadvantages when
used appropriately, as it can improve performance and simplify the compiler design.

1.6 Specification of Tokens in Compiler Design

What are Tokens?


A token is the smallest individual element of a program that is meaningful to the compiler. It cannot be
further broken down. Identifiers, strings, keywords, etc., can be the example of the token. In the lexical
analysis phase of the compiler, the program is converted into a stream of tokens.
Different Types of Tokens
There can be multiple types of tokens. Some of them are-
1. Keywords
Keywords are words reserved for particular purposes and imply a special meaning to the compilers.
The keywords must not be used for naming a variable, function, etc.
2. Identifier
The names given to various components in the program, like the function's name or variable's name,
etc., are called identifiers. Keywords cannot be identifiers.
3. Operators
Operators are different symbols used to perform different operations in a programming language.
4. Punctuations
Punctuations are special symbols that separate different code elements in a programming language.
Consider the following line of code in C++ language -
int x = 45;
The above statement has multiple tokens, which are-
Keywords: int
Identifier: x , 45
Operators: =
Punctuators: ;
Specification of Token
In compiler design, there are three specifications of token-
1. String
2. Language
3. Regular Expressions
1. Strings
Strings are a finite set of symbols or characters. These symbols can be a digit or an alphabet. There
is also an empty string which is denoted by ε.
Operations on String
The operations that can be performed on a string are-
1. Prefix
The prefix of String S is any string that is extracted by removing zero or more characters from the end
of string S. For example, if the String is "NINJA", the prefix can be "NIN" which is obtained by
removing "JA" from that String. A string is a prefix in itself.
Proper prefixes are special types of prefixes that are not equal to the String itself or equal to ε. We
obtain it by removing at least one character from the end of the String.
2. Suffix
The suffix of string S is any string that is extracted by removing any number of characters from the
beginning of string S. For example, if the String is "NINJA", the suffix can be "JA," which is obtained
by removing "NIN" from that String. A string is a suffix of itself.
Proper suffixes are special types of suffixes that are not equal to the String itself or equal to ε. It is
obtained by removing at least one character from the beginning of the String.
3. Substring
A substring of a string S is any string obtained by removing any prefixes and suffixes of that String.
For example, if the String is "AYUSHI," then the substring can be "US," which is formed by removing
the prefix "AY" and suffix "HI." Every String is a substring of itself.
Proper substrings are special types that are not equal to the String itself or equal to ε. It is obtained
by removing at least one prefix or suffix from the String.
4. Subsequence
The subsequence of the String is a string obtained by eliminating zero or more symbols from the
String. The symbols that are removed need not be consecutive. For example, if the String is
"NINJAMANIA," then a subsequence can be "NIJAANIA," which is produced by removing "N" and
"M."
Proper subsequences are special subsequences that are not equal to the String itself or equal to ε.
It is obtained by removing at least one symbol from the String.
5. Concatenation
Concatenation is defined as the addition of two strings. For example, if we have two strings S=" Cod"
and T=" ing," then the concatenation ST would be "Coding."
2. Language
A language can be defined as a finite set of strings over some symbols or alphabets.
Operations on Language
The following operations are performed on a language in the lexical analysis phase-
1. Union
Union is one of the most common operations we perform on a set. In terms of languages also, it will
hold a similar meaning.

L ∪ S will be equal to { x | x belongs to either L or S }


Suppose there are two languages, L and S. Then the union of these two languages will be

For example If L = {a, b} and S = {c, d}Then L ∪ S = {a, b, c, d}


2. Concatenation
Concatenation links two languages by linking the strings from one language to all the strings of the
other language.
If there are two languages, L and S, then the concatenation of L and S will be LS equal to { ls |
where l belongs to L and s belongs to S }.
For example, there are two languages, L and S, such that { L`, L"} is the set of strings belonging to
language L and { S,` S, "S`"} is the set of strings belonging to language S.
Then the concatenation of L and S will be LS will be {L'S`, L'S ", L``S`, L``S "}
3. Kleene closure
Kleene closure of a language L is denoted by L*provides a set of all the strings that can be obtained
by concatenating L zero or more times.
If L = {a, b}
then L* = {ε, a, b, aa, bb, aaa, bbb, …}
Positive Closure
L+ denotes the Positive closure of a language L and provides a set of all the strings that can be
obtained by concatenating L one or more times.
If L = {a, b}
then L+ = {a, b, aa, bb, aaa, bbb, …}
3. Regular Expression
Regular expressions are strings of characters that define a searching pattern with the help of which
we can form a language, and each regular expression represents a language.
A regular expression r can denote a language L(r) which can be built recursively over the smaller
regular expression by following some rules.
Writing Regular Expressions
Following symbols are used very frequently to write regular expressions
 The asterisk symbol ( * ): It is used in our regular expression to instruct the compiler
that the symbol that preceded the * symbol can be repeated any number of times in the
pattern. For example, if the expression is ab*c, then it gives the following string- ac,
abc, abbc, abbbc, abbbbbc.. and so on.
 The plus symbol ( + ): It is used in our regular expression to tell the compiler that the
symbol that preceded + can be repeated one or more times in the pattern. For example,
if the expression is ab+c, then it gives the following string- abc, abbc, abbbc,
abbbbbc.. and so on.
 Wildcard Character ( . ): The. A symbol, also known as the wildcard character, is a
character in our regular expression that can be replaced by another character.
 Character Class: It is a way of representing multiple characters. For example, [a –
z] denotes the regular expression a | b | c | d | ….|z.
The following rules are used to define a regular expression r over some alphabet Σ and the languages
denoted by these regular expressions.
 It ε is a regular expression that denotes a language L(ε). The language L(ε) has a set of
strings {ε} which means that this language has a single string which is the empty String.
 If there is a symbol 'a' in Σ, then 'a' is a regular expression that denotes a language L(a).
The language L(a) = {a}, i.e., the language has only one String of length, and the String
holds 'a' in the first position.

r|s denotes the language L(r) ∪ L(s).


 Consider the two regular expressions r and s then:

(r) (s) denotes the language L(r) ⋅ L(s).


(r)* denotes the language (L(r))*.
(r)+ denotes the language L(r)

1.7 Recognition of Tokens in Compiler Design

Recognition of Tokens

 Tokens obtained during lexical analysis are recognized by Finite Automata.


 Finite Automata (FA) is a simple idealized machine that can be used to recognize
patterns within input taken from a character set or alphabet (denoted as C). The
primary task of an FA is to accept or reject an input based on whether the defined
pattern occurs within the input.
 There are two notations for representing Finite Automata. They are:

1. Transition Table
2. Transition Diagram

1. Transition Table
It is a tabular representation that lists all possible transitions for each state and input
symbol combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language

where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions.

 where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
 For this language, the lexical analyzer will recognize the keywords if, then,
and else, as well as lexemes that match the patterns for relop, id, and number.
 To simplify matters, we make the common assumption that keywords are also
reserved words: that is they cannot be used as identifiers.
 The num represents the unsigned integer and real numbers of Pascal.
 In addition, we assume lexemes are separated by white space, consisting of
nonnull sequences of blanks, tabs, and newlines.
 Our lexical analyzer will strip out white space. It will do so by comparing a string
against the regular definition ws, below.

 If a match for ws is found, the lexical analyzer does not return a token to the
parser.
 It is the following token that gets returned to the parser.
2. Transition Diagram
It is a directed labeled graph consisting of nodes and edges. Nodes represent states,
while edges represent state transitions.
Components of Transition Diagram

1. One state is labelled the Start State. It is the initial state of transition diagram

where control resides when we begin to recognize a token.

2. Position is a transition diagram are drawn as circles and are called states.
3. The states are connected by Arrows called edges. Labels on edges are indicating
the input characters
4. Zero or more final states or Accepting states are represented by double circle in
which the tokens has been found.
5. Example:

 Where state "1" is initial state and state 3 is final state.

Here is the transition diagram of Finite Automata that recognizes the lexemes matching
the token relop.
Here is the Finite Automata Transition Diagram for recognizing white spaces.

1.8 LEX in Compiler Design


Lexical Analysis
It is the first step of compiler design, it takes the input as a stream of characters and
gives the output as tokens also known as tokenization. The tokens can be classified into
identifiers, Sperators, Keywords, Operators, Constant and Special Characters.
It has three phases:
1. Tokenization: It takes the stream of characters and converts it into tokens.
2. Error Messages: It gives errors related to lexical analysis such as exceeding
length, unmatched string, etc.
3. Eliminate Comments: Eliminates all the spaces, blank spaces, new lines,
and indentations.
Lex
Lex is a tool or a computer program that generates Lexical Analyzers (converts the
stream of characters into tokens). The Lex tool itself is a compiler. The Lex compiler
takes the input and transforms that input into input patterns. It is commonly used
with YACC(Yet Another Compiler Compiler). It was written by Mike Lesk and Eric Schmidt.
Function of Lex
1. In the first step the source code which is in the Lex language having the file name
‘File.l’ gives as input to the Lex Compiler commonly known as Lex to get the output as
lex.yy.c.
2. After that, the output lex.yy.c will be used as input to the C compiler which gives the
output in the form of an ‘a.out’ file, and finally, the output file a.out will take the stream
of character and generates tokens as output.
lex.yy.c: It is a C program.
File.l: It is a Lex source program
a.out: It is a Lexical analyzer

Block Diagram of Lex

Lex File Format


A Lex program consists of three parts and is separated by %% delimiters:-
Declarations
%%
Translation rules
%%
Auxiliary procedures

Declarations: The declarations include declarations of variables.


Transition rules: These rules consist of Pattern and Action.
Auxiliary procedures: The Auxilary section holds auxiliary functions used in the
actions.
For example:
declaration
number[0-9]
%%
translation
if {return (IF);}
%%
auxiliary function
int numberSum()
DFA (Deterministic finite automata)

o DFA refers to deterministic finite automata. Deterministic refers to the uniqueness


of the computation. The finite automata are called deterministic finite automata if
the machine is read an input string one symbol at a time.

o In DFA, there is only one path for specific input from the current state to the next
state.
o DFA does not accept the null move, i.e., the DFA cannot change state without any
input character.

o DFA can contain multiple final states. It is used in Lexical Analysis in Compiler.

In the following diagram, we can see that from state q0 for input a, there is only one path
which is going to q1. Similarly, from q0, there is only one path for input b going to q2.

Formal Definition of DFA

A DFA is a collection of 5-tuples same as we described in the definition of FA.

1. Q: finite set of states


2. ∑: finite set of the input symbol
3. q0: initial state
4. F: final state
5. δ: Transition function

Transition function can be defined as: δ: Q x ∑→Q

Graphical Representation of DFA

A DFA can be represented by digraphs called state diagram. In which:

1. The state is represented by vertices.

2. The arc labeled with an input character show the transitions.

3. The initial state is marked with an arrow.

4. The final state is denoted by a double circle.

Example 01 :Construct a DFA with ∑ = {0,1} that accepts the only input a string
“10”.
In the above example, the language contains only one string given below
L= {10}
DFA that accepts the only input a string “10” is given below
Example 02:Construct a DFA with ∑ = {a, b} that accepts the only input “aaab”.
The given question provides the following language
L= {aaab}

The following diagra m represents the DFA accepter for L=


{aaab}.

Example:3
Construct DFA, which accept all the string over alphabets ∑ {0,1} that start with
“0”.
Solution:
The given question provides the following language
L = {0, 01, 00, 010, 011, 000, 001,…, }
Let’s draw the DFA, which accepts all the strings starting with “0”.

Example:4
Construct DFA, which accept all the string over alphabets ∑ {0,1} that start with “01”.
Solution:
The given question provides the following language
L = {01, 010, 011, 0100, 0101, 0110, 0111}
Let’s draw the DFA which accepts all the strings of the above language

Example:05
Construct DFA, which accepts all the string over alphabets ∑ {0,1} where binary
integers divisible by 3
Solution:
The given question provides the following language
L = {0, 11, 110, 1001, 1100, 1111, ……, }
The explanation for string 1001 is given below.
1001= 9, 9/3 = 3; hence, 1001 is divisible by 3.
The following diagram represents the DFA accepter for the language where binary
integers are divisible by three.

DFA Example 06
Construct a DFA with sigma ∑ = {0, 1} for the language accepting strings ending
with ‘0011’.
Solution:
The given question provides the following language (L) and DFA diagram
L={0011,00011,10011,110011,000011,…..}
NFA (Non-Deterministic finite automata)

o NFA stands for non-deterministic finite automata. It is easy to construct an NFA


than DFA for a given regular language.

o The finite automata are called NFA when there exist many paths for specific input
from the current state to the next state.

o Every NFA is not DFA, but each NFA can be translated into DFA.

o NFA is defined in the same way as DFA but with the following two exceptions, it
contains multiple next states, and it contains ε transition.

In the following image, we can see that from state q0 for input a, there are two next
states q1 and q2, similarly, from q0 for input b, the next states are q0 and q1. Thus it is
not fixed or determined that with a particular input where to go next. Hence this FA is
called non-deterministic finite automata.

Formal definition of NFA:

NFA also has five states same as DFA, but with different transition function, as shown
follows:

δ: Q x ∑ →2Q

where,

1. Q: finite set of states


2. ∑: finite set of the input symbol
3. q0: initial state
4. F: final state
5. δ: Transition function

Graphical Representation of an NFA

An NFA can be represented by digraphs called state diagram. In which:

1. The state is represented by vertices.

2. The arc labeled with an input character show the transitions.

3. The initial state is marked with an arrow.

4. The final state is denoted by the double circle.

Example 1:

1. Q = {q0, q1, q2}


2. ∑ = {0, 1}
3. q0 = {q0}
4. F = {q2}

Solution:

Transition diagram:

Transition Table:

Present State Next state for Input 0 Next State of Input 1

→q0 q0, q1 q1

q1 q2 q0

*q2 q2 q1, q2

In the above diagram, we can see that when the current state is q0, on input 0, the next
state will be q0 or q1, and on 1 input the next state will be q1. When the current state is
q1, on input 0 the next state will be q2 and on 1 input, the next state will be q0. When
the current state is q2, on 0 input the next state is q2, and on 1 input the next state will
be q1 or q2.

NFA Example
Design an NFA with ∑ = {0, 1} for all binary strings where the second last bit is 1.

Solution
The language generated by this example will include all strings in which the second-last bit is 1.
L= {10, 010, 000010, 11, 101011……..}
The following NFA automaton machine accepts all strings where the second last bit is 1.

 {q0, q1, q2} refers to the set of states


 {0,1} refers to the set of input alphabets
 δ refers to the transition function
 q0 refers to the initial state
 {q2} refers to the set of final states
Transition function δ is defined as
 δ (q0, 0) = q0
 δ (q0, 1) = q0,q1
 δ (q1, 0) = q2
δ (q1, 1) = q2
 δ (q2, 0) = ϕ
 δ (q2, 1) = ϕ
Transition Table for the above Non-Deterministic Finite Automata is-

State 0 1
s

q0 q0 q0,q1

q1 q2 q2

q2 – –

NFA Example
Draw an NFA with ∑ = {0, 1} such that the third symbol from the right is “1”.
Solution
The language generated by this example will include all strings where the third symbol from the
right is “1”.
L= {100, 111, 101, 110, 0100, 1100, 00100, 100101……..}
States 0 1

q0 q0 q0,q1
NFA for the above language is Where,
q1 q2 q2  {q0, q1, q2} refers to the set of states
 {0,1} refers to the set of input alphabets
q2 -q3 q3  δ refers to the transition function
 q0 refers to the initial state
q3 – –  {q3} refers to the set of final states
Transition function δ is defined as
 δ (q0, 0) = q0
 δ (q0, 1) = q0,q1
 δ (q1, 0) = q2
 δ (q1, 1) = q2
 δ (q2, 0) = q3
 δ (q2, 1) = q3
 δ (q3, 0) = ϕ
 δ (q3, 1) = ϕ
Transition Table for the above Non-Deterministic Finite Automata is –

NFA Example
Construct an NFA with ∑ = {a,b,c} where strings contain some “a’s” followed by some “b’s”
followed by some c’s. Solution
The language-generated strings will be like as
L={abc, aabbcc, aaabbcc, aabbbcc, aaaabbbbccc, ………… }

So, the NFA transition diagram for the above language is Where,
 {q0, q1, q2} refers to the set of states
 {a,b,c} refers to the set of input alphabets
 δ refers to the transition function
 q0 refers to the initial state
 {q2} refers to the set of final states
Transition function δ is defined as
 δ (q0, a) = q0
 δ (q0, b) = q1
 δ (q0, c) = ϕ
 δ (q1, a) = ϕ
 δ (q1, b) = q1
 δ (q1, c) = q2
 δ (q2, a) = ϕ
 δ (q2, b) = ϕ
 δ (q2, c) = q2
Transition Table for the above Non-Deterministic Finite Automata is-
States a b c

q0 q0 q1 –

q1 – q1 q2

q2 – – q2

You might also like