Compiler Design - Software Design Project
Compiler Design - Software Design Project
Dedication
SE4 compiler design workgroup
Contents
Chapter One
1 Introduction ………………………………………………………………..… 7
1.1 Preface ………………………………………………………………… 7
1.2 Project proposal …………………………………………………………7
1.3 The plan ……………………………………………………………..7
1.4 Setting Milestones ………………………………………………………8
1.5 Report Overview ……………………………………………………….. 8
Chapter Two
2 Compiler Analysis …………………………………………………………..10
2.1 What Is a Compiler ……………………………………………………11
2.2 History of a Compiler ………………………………………………….11
2.3 Compiler Vs Interpreter ……………………………………………….12
2.4 Compiler Phases ……………………………………………………….12
2.4.1 Front End Phases ………………………………………………..13
2.4.2 Back End Phases …………………………………………………14
2.5 Object Oriented development …………………………………………15
2.6 Compilation Processes …………………………………………………15
2.6.1 Lexical Analyzer Phase ……………………………………………16
2.6.2 Syntax Analyzer Phase ……………………………………………16
2.6.2.1 Context Free Grammar ( CFG ) …………………………...17
2.6.2.2 Derivation the Parse Tree …………………………………17
2.6.3 Symantec Analyzer Phase …………………………………………18
2.6.3.1 Types and Declarations ……………………………………19
2.6.3.2 Type Checking ……………………………………………..19
2.6.3.3 Scope Checking ……………………………………………19
2.6.4 Intermediate Code Generator Phase ……………………………...19
2.6.5 Code Optimizer Phase ……………………………………………..20
2.6.5.1 Types of optimizations …………………………………….21
2.6.6 Code Generator Phase ……………………………………………..21
2.6.7 Error Handling methods …………………………………………..22
2.6.7.1 Recovery from the lexical phase errors ……………………23
2.6.7.2 Recovery from the Symantec phase errors …………….…23
2.6.8 Symbol Table & Symbol Table Manger …………………………..23
Chapter Three
3 Compiler Design
3.1 Compiler Design ……………………………………………………….25
3.2 The Context of the System……………………………………………..25
3.2.1 Compiler Architectural Design……………………………………25
3.2.1.1 Compiler Structuring………………………………………..25
3.2.1.1.1 Compiler Organization……………………………..26
3.2.1.2 Compiler Control Modeling………………………………….27
3.2.1.3 Compiler Modular Decomposition…………………………..28
3.2.1.4 Compiler Domain Specific Architectures…………………….28
3.3 Compiler USE Case Models……………………………………………..28
3.3.1 Lexical Analyzer USE Case Diagram………………………………29
3.3.2 Syntax Analyzer USE Case Diagram………………………………29
3.3.3 Symantec Analyzer USE Case Diagram…………………………..30
3.3.4 I.R Code Generator USE Case Diagram…………………………..30
3.3.5 I.R Code Optimizer USE Case Diagram…………………………..31
3.3.6 Code Generator USE Case Diagram……………………………..32
3.3.7 Symbol Table Manager USE Case Diagram……………………...32
3.3.8 Error Handler USE Case Diagram………………………………..33
3.4 Compiler’s Subsystems Activity Diagrams………………………........33
3.4.1 Lexical Analyzer Activity Diagram………………………………34
3.4.2 Syntax Analyzer Activity Diagram………………………………35
3.4.3 Symantec Analyzer Activity Diagram………………………..…36
3.4.4 I.R Code Generator Activity Diagram………………………..…37
3.4.5 I.R Code Optimizer Activity Diagram………………………..….38
3.4.6 Code Generator Activity Diagram………………………………39
3.4.7 Symbol Table Manager Activity Diagram……………………….40
3.3.8 Error Handler USE Case Diagram………………………………..41
3.5 Compiler Class Diagram………………………………………………..42
3.6 Compiler sequence Diagram……………………………………….….44
3.6.1 Lexical analyzing sequence diagram………………………..….44
3.6.2 syntax analyzing sequence diagram……………………………45
3.6.3 Symantec Analyzing sequence diagram……………………..…46
3.6.4 I.R Code Generation sequence diagram………………………..47
3.6.5 I.R Code Optimizing sequence diagram…………………………48
3.6.6 Code Generation sequence diagram…………………………...49
3.6.7 Additional sequence diagram…………………………………...50
Glossary ………………………………………………………51
Bibliography ………………………………………………..…54
Introduction
1.1 Preface
Compiler design and construction is an exercise in engineering design. The compiler writer must
choose a path through a decision space that is filled with diverse alternatives, each with distinct costs,
advantages, and complexity. Each decision has an impact on the resulting compiler. The quality of the end
product depends on informed decisions at each step of way.
This project or a short study tries to explore the design space – to present some of the ways that used
to design or construct a compiler, this is done side by side with the principles of the software engineering
phases, and we have a main of advantage that we doesn't find in any book that illustrate the compilers
construction, this advantage is the UML ( unified modeling language ) diagrams that illustrate the compiler
development process from many sides and also describe the small procedures that making the subsystems of
the compiler and the communication between them that lead to produce a target file of the compiler.
1.2 Project proposal
In the early days, the approach taken to compiler design used to be directly affected by the complexity of
the processing, the experience of the person(s) designing it, and the resources available.
A compiler for a relatively simple language written by one person might be a single, monolithic piece of
software. When the source language is large and complex, and high quality output is required the design may
be split into a number of relatively independent phases. Having separate phases means development can be
parceled up into small parts and given to different people. It also becomes much easier to replace a single
phase by an improved one, or to insert new phases later (eg, additional optimizations).
So, Our Project is to design a small functional compiler by take in consideration the above hints. Implement
the software design principles in our work and establishing it step by step till the end of this development.
1.4 Setting Milestones
In order to establish a sentiment of progression towards the goals of the project, certain milestones have
been set :
Implementation of the Object Oriented Strategy in the analysis phase of the development.
Implementation of the Object Oriented Strategy in the design phase of the development.
Implementation of the UML ( unified modeling language ) Diagrams in the Design phase as
follow …
o Implementation of the Use Case Diagrams.
o Implementation of the Activity Diagrams.
o Implementation of the Class Diagrams and the relations between this classes.
o Implementation of the Sequence Diagrams.
Chapter 1 A brief overview of what will follow in the report is given here.
Chapter 2 ( Compiler Analysis ) Describe why we use the object oriented technique in the analysis phase.
how the Compiler is produced, perception by giving the illustration of the theoretical concepts and principles
of compiler phases analysis.
Chapter 3 ( Compiler Design ) describes the main design aspects of the project commencing with the
Requirements Analysis and concluding with specific designing of project parts, and giving the main diagrams
during the design phase to simplify the operation of implementation that done by the programmers.
Chapter Two
Compiler
Analysis
SE4 Compiler Design Documentation Page 9
2.2 History of a Compiler
Software for early computers was primarily written in assembly language for many years. Higher level
programming languages were not invented until the benefits of being able to reuse software on different kinds of CPUs
started to become significantly greater than the cost of writing a compiler. The very limited memory capacity of early
computers also created many technical problems when implementing a compiler.
Towards the end of the 1950s, machine‐independent programming languages were first proposed.
Subsequently, several experimental compilers were developed. The first compiler was written by Grace Hopper, in
1952, for the A‐0 programming language. The FORTRAN team led by John Backus at IBM is generally credited as having
introduced the first complete compiler, in 1957. COBOL was an early language to be compiled on multiple
architectures, in 1960.
In many application domains the idea of using a higher level language quickly caught on. Because of the
expanding functionality supported by newer programming languages and the increasing complexity of computer
architectures, compilers have become more and more complex.
Early compilers were written in assembly language. The first self‐hosting compiler — capable of compiling its
own source code in a high‐level language — was created for Lisp by Tim Hart and Mike Levin at MIT in 1962. Since the
1970s it has become common practice to implement a compiler in the language it compiles, although both Pascal and C
have been popular choices for implementation language. Building a self‐hosting compiler is a bootstrapping problem—
the first such compiler for a language must be compiled either by a compiler written in a different language, or (as in
Hart and Levin's Lisp compiler) compiled by running the compiler in an interpreter.
2.3 Compiler Vs Interpreter
We usually prefer to write computer programs in languages we understand rather than in machine
language, but the processor can only understand machine language. So we need a way of converting our
instructions (source code) into machine language. This is done by an interpreter or a compiler.
An interpreter reads the source code one instruction or line at a time, converts this line into machine
code and executes it. The machine code is then discarded and the next line is read. The advantage of this is
it's simple and you can interrupt it while it is running, change the program and either continue or start again.
The disadvantage is that every line has to be translated every time it is executed, even if it is executed many
times as the program runs. Because of this interpreters tend to be slow. Examples of interpreters are Basic on
older home computers, and script interpreters such as JavaScript, and languages such as Lisp and Forth.
A compiler reads the whole source code and translates it into a complete machine code program to
perform the required tasks which is output as a new file. This completely separates the source code from the
executable file. The biggest advantage of this is that the translation is done once only and as a separate
process. The program that is run is already translated into machine code so is much faster in execution. The
disadvantage is that you cannot change the program without going back to the original source code, editing
that and recompiling (though for a professional software developer this is more of an advantage because it
stops source code being copied). Current examples of compilers are Visual Basic, C, C++, C#, Fortran, Cobol,
Ada, Pascal and so on.
You will sometimes see reference to a third type of translation program: an assembler. This is like a compiler,
but works at a much lower level, where one source code line usually translates directly into one machine
code instruction. Assemblers are normally used only by people who want to squeeze the last bit of
performance out of a processor by working at machine code level.
2.4 Compiler Phases
In the early days, the approach taken to compiler design used to be directly affected by the
complexity of the processing, the experience of the person(s) designing it, and the resources available.
A compiler for a relatively simple language written by one person might be a single, monolithic piece
of software. When the source language is large and complex, and high quality output is required the design
may be split into a number of relatively independent phases. Having separate phases means development
can be parceled up into small parts and given to different people. It also becomes much easier to replace a
single phase by an improved one, or to insert new phases later (eg, additional optimizations).
1. Line reconstruction. Languages which strop their keywords or allow arbitrary spaces within identifiers
require a phase before parsing, which converts the input character sequence to a canonical form
ready for the parser. The top‐down, recursive‐descent, table‐driven parsers used in the 1960s typically
read the source one character at a time and did not require a separate tokenizing phase. Atlas
Autocode, and Imp (and some implementations of Algol and Coral66) are examples of stropped
languages whose compilers would have a Line Reconstruction phase.
2. Lexical analysis breaks the source code text into small pieces called tokens. Each token is a single
atomic unit of the language, for instance a keyword, identifier or symbol name. The token syntax is
typically a regular language, so a finite state automaton constructed from a regular expression can be
used to recognize it. This phase is also called lexing or scanning, and the software doing lexical analysis
is called a lexical analyzer or scanner.
3. Preprocessing. Some languages, e.g., C, require a preprocessing phase which supports macro
substitution and conditional compilation. Typically the preprocessing phase occurs before syntactic or
semantic analysis; e.g. in the case of C, the preprocessor manipulates lexical tokens rather than
syntactic forms. However, some languages such as Scheme support macro substitutions based on
syntactic forms.
4. Syntax analysis involves parsing the token sequence to identify the syntactic structure of the program.
This phase typically builds a parse tree, which replaces the linear sequence of tokens with a tree
structure built according to the rules of a formal grammar which define the language's syntax. The
parse tree is often analyzed, augmented, and transformed by later phases in the compiler.
5. Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and
builds the symbol table. This phase performs semantic checks such as type checking (checking for type
errors), or object binding (associating variable and function references with their definitions), or
definite assignment (requiring all local variables to be initialized before use), rejecting incorrect
programs or issuing warnings. Semantic analysis usually requires a complete parse tree, meaning that
this phase logically follows the parsing phase, and logically precedes the code generation phase,
though it is often possible to fold multiple phases into one pass over the code in a compiler
implementation.
2. Optimization: the intermediate language representation is transformed into functionally equivalent
but faster (or smaller) forms. Popular optimizations are inline expansion, dead code elimination,
constant propagation, loop transformation, register allocation or even automatic parallelization.
3. Code generation: the transformed intermediate language is translated into the output language,
usually the native machine language of the system. This involves resource and storage decisions, such
as deciding which variables to fit into registers and memory and the selection and scheduling of
appropriate machine instructions along with their associated addressing modes (see also Sethi‐Ullman
algorithm).
Figure 2.2
The lexical analyzer takes a source program as input, and produces a stream of tokens as
Output, this process could the Tokenization process. The lexical analyzer might recognize particular instances
of tokens such as 3 or 255 for an integer constant token, "Fred" or "Wilma" for a string constant token,
numTickets or queue for a variable token Such specific instances are called lexemes. A lexeme is the actual
character sequence forming a token, the token is the general class that a lexeme belongs to. Some tokens
have exactly one lexeme (e.g., the > character); for others, there are many lexemes (e.g., integer constants).
Regular expression notation can be used for specification of tokens because tokens constitute a
regular set. It is compact, precise, and contains a deterministic finite automata (DFA) that accepts the
language specified by the regular expression.
The DFA is used to recognize the language
specified by the regular expression notation,
making the automatic construction of recognizer
of tokens possible. Therefore, the study of
regular expression notation and finite automata
becomes necessary. And here is a simple Regular
Expression that used to separate any kinds of
tokens as shown in the Figure 1.3.
Figure 2.3
Therefore, suitable notation must be used to specify the constructs of a language. The notation for
the construct specifications should be compact, precise, and easy to understand. The syntax‐structure
specification for the programming language (i.e., the valid constructs of the language) uses context‐free
grammar (CFG), because A regular expression is not powerful enough to represent languages which require
parenthesis matching to arbitrary depths. And for certain classes of grammar, we can automatically construct
an efficient parser that determines if a source program is syntactically correct. Hence, CFG notation is
required topic for study.
2.6.2.1 Context Free Grammar ( CFG )
CFG notation specifies a context‐free language that consists of terminals, nonterminals, a start
symbol, and productions. The terminals are nothing more than tokens of the language, used to form the
language constructs. Nonterminals are the variables that denote a set of strings. For example, S and E are
nonterminals that denote statement strings and expression strings, respectively, in a typical programming
language. The nonterminals define the sets of strings that are used to define the language generated by the
grammar. They also impose a hierarchical structure on the language, which is useful for both syntax analysis
and translation. Grammar productions specify the manner in which the terminals and string sets, defined by
the nonterminals, can be combined to form a set of strings
defined by a particular nonterminal. For example, as shown
in Figure 1.4 consider the production S → aSb. This production
specifies that the set of strings defined by the nonterminal S are
obtained by concatenating terminal a with any string belonging
to the set of strings defined by nonterminal S, and then with
terminal b. Each production consists of a nonterminal on the left‐hand side, and a Figure 2.4
string of terminals and nonterminals on the right‐hand side. The left‐hand side of a
production is separated from the right‐hand side using the "→" symbol, which is used to identify a relation on
a set (V T)*.
Therefore context‐free grammar is a four‐tuple denoted as:
G = (V,T,P,S)
where:
1. V is a finite set of symbols called as nonterminals or variables,
2. T is a set a symbols that are called as terminals,
3. P is a set of productions, and
4. S is a member of V, called as start symbol.
2.6.2.2 Derivation the Parse Tree
When deriving a string w from S, if every derivation is considered to be a step in the tree construction,
then we get the graphical display of the derivation of string w as a tree. This is called a "derivation tree" or a
"parse tree" of string w. Therefore, a derivation tree or parse tree is the display of the derivations as a tree.
Note that a tree is a derivation tree if it satisfies the following requirements:
1. All the leaf nodes of the tree are labeled by terminals of the grammar.
2. The root node of the tree is labeled by the start symbol of the grammar.
3. The interior nodes are labeled by the nonterminals.
4. If an interior node has a label A, and it has n descendents with labels X1, X2, …, Xn from left to right,
then the production rule A → X1 X2 X3 …… Xn must exist in the grammar.
For example, consider a grammar whose list of productions is:
E→ E+E
E→ E*E
E→ id
The tree shown in Figure 1.5 is a derivation tree for a string id + id * id.
Figure 2.5
We begin with some basic definitions to set the stage for performing semantic analysis. A type is a set of values
and a set of operations operating on those values. There are three categories of types in most programming languages:
• Base types int, float, double, char, bool, etc. These are the primitive types provided directly by the
underlying hardware. There may be a facility for user‐defined variants on the base types (such as C enums).
• Compound types arrays, pointers, records, structs, unions, classes, and so on. These types are
constructed as aggregations of the base types and simple compound types.
• Complex types lists, stacks, queues, trees, heaps, tables, etc. You may recognize these as abstract data
types. A language may or may not have support for these sort of higher‐level abstractions.
In many languages, a programmer must first establish the name and type of any data object (e.g., variable,
function, type, etc). In addition, the programmer usually defines the lifetime. A declaration is a statement in a program
that communicates this information to the compiler. The basic declaration is just a name and type, but in many
languages it may include modifiers that control visibility and lifetime (i.e., static in C, private in Java). Some languages
also allow declarations to initialize variables, such as in C, where you can declare and initialize in one statement.
2.6.3.2 Type Checking
Type checking is the process of verifying that each operation executed in a program respects the type system of
the language. This generally means that all operands in any expression are of appropriate types and number. Much of
what we do in the semantic analyzer phase is type checking. Sometimes the rules regarding operations are defined by
other parts of the code (as in function prototypes), and sometimes such rules are a part of the definition of the
language itself (as in "both operands of a binary arithmetic operation must be of the same type"). If a problem is found,
e.g., one tries to add a char pointer to a double in C, we encounter a type error. A language is considered stronglytyped
if each and every type error is detected during compilation. Type checking can be done compilation, during execution,
or divided across both.
2.6.3.3 Scope Checking
To understand how this is handled in a compiler, we need a few definitions. A scope is a section of program
text enclosed by basic program delimiters, e.g., {} in C, or begin‐end in Pascal. Many languages allow nested scopes that
are scopes defined within other scopes. The scope defined by the innermost such unit is called the current scope. The
scopes defined by the current scope and by any enclosing program units are known as open scopes. Any other scope is
a closed. As we encounter identifiers in a program, we need to determine if the identifier is accessible at that point in
the program. This is called scope checking. If we try to access a local variable declared in one function in another
function, we should get an error message. This is because only variables declared in the current scope and in the open
scopes containing the current scope are accessible.
2.6.4 Intermediate Code Generator Phase
Intermediate code generator is the part of the compilation phases that by which a compiler's code
generator converts some internal representation of source code into a form (e.g., machine code) that can be
readily executed by a machine (often a computer).
The input to the Intermediate code generator typically consists of a parse tree or an abstract syntax
tree. The tree is converted into a linear sequence of instructions, usually in an intermediate language such as
three address code. Further stages of compilation may or may not be referred to as "code generation",
depending on whether they involve a significant change in the representation of the program. (For example,
a peephole optimization pass would not likely be called "code generation", although a code generator might
incorporate a peephole optimization pass.
Sophisticated compilers typically perform multiple passes over various intermediate forms. This multi‐
stage process is used because many algorithms for code optimization are easier to apply one at a time, or
because the input to one optimization relies on the processing performed by another optimization. This
organization also facilitates the creation of a single compiler that can target multiple architectures, as only
the last of the code generation stages (the backend) needs to change from target to target.
2.6.5 Code Optimizer Phase
the code optimizer is take the Intermediate Representation Code that would be generated by
the Intermediate Code Generator as it’s input and doing the Compilation Optimization Process that
finally produce the Optimized Code to be the input to the code generator. And here we can define
the compiler optimization process that the process of tuning the output of the intermediate
representation code to minimize or maximize some attribute of an executable computer program.
The most common requirement is to minimize the time taken to execute a program. a less common
one is to minimize the amount of memory occupied. The growth of portable computers has created a
market for minimizing the power consumed by a program.
Code optimization refers to the techniques used by the compiler to improve the execution efficiency
of the generated object code. It involves a complex analysis of the intermediate code and the performance of
various transformations; but every optimizing transformation must also preserve the semantics of the
program. That is, a compiler should not attempt any optimization that would lead to a change in the
program's semantics.
Optimization can be machine‐independent or machine‐dependent. Machine‐independent
optimizations can be performed independently of the target machine for which the compiler is generating
code; that is, the optimizations are not tied to the target machine's specific platform or language. Examples
of machine‐independent optimizations are: elimination of loop invariant computation, induction variable
elimination, and elimination of common subexpressions.
On the other hand, machine‐dependent optimization requires knowledge of the target machine. An attempt to
generate object code that will utilize the target machine's registers more efficiently is an example of machine‐
dependent code optimization. Actually, code optimization is a misnomer; even after performing various optimizing
transformations, there is no guarantee that the generated object code will be optimal. Hence, we are actually
performing code improvement. When attempting any optimizing transformation, the following criteria
should be applied:
1‐ The optimization should capture most of the potential improvements without an unreasonable
amount of effort.
2‐ The optimization should be such that the meaning of the source program is preserved.
3‐ The optimization should, on average, reduce the time and space expended by the object code.
2.6.5.1 Types of optimizations
Techniques used in optimization can be broken up among various scopes which can affect anything
from a single statement to the entire program. Generally speaking, locally scoped techniques are easier to
implement than global ones but result in smaller gains. Some examples of scopes include :
• Local optimizations : These only consider information local to a function definition. This
reduces the amount of analysis that needs to be performed (saving time and reducing storage
requirements) but means that worst case assumptions have to be made when function calls
occur or global variables are accessed (because little information about them is available).
• Loop optimizations : These act on the statements which make up a loop, such as a for loop
(eg, loop‐invariant code motion). Loop optimizations can have a significant impact because
many programs spend a large percentage of their time inside loops.
• Peephole optimizations : Usually performed late in the compilation process after machine
code has been generated. This form of optimization examines a few adjacent instructions (like
"looking through a peephole" at the code) to see whether they can be replaced by a single
instruction or a shorter sequence of instructions. For instance, a multiplication of a value by 2
might be more efficiently executed by left‐shifting the value or by adding the value to itself.
(This example is also an instance of strength reduction.)
2.6.6 Code Generator Phase
It’s the last phase in the compiler operations, and this phase is being a machine‐dependent phase, it is
not possible to generate good code without considering the details of the particular machine for which the
compiler is expected to generate code. Even so, a carefully selected code‐generation algorithm can produce code
that is twice as fast as code generated by an ill‐considered code‐generation algorithm.
And the clearly operations of this phase is the representation of the logical addresses of our compiled
program in the physical addresses, allocating the registers the target machine, also linking the contents of the
symbol table and the optimized code that generated by the code optimizer to produce finally the target file or the
object file.
The lexical analyzer detects an error when it discovers that an input's prefix does not fit the
specification of any token class. After detecting an error, the lexical analyzer can invoke an error recovery
routine. This can entail a variety of remedial actions.
The simplest possible error recovery is to skip the erroneous characters until the lexical analyzer finds
another token. But this is likely to cause the parser to read a deletion error, which can cause severe
difficulties in the syntaxanalysis and remaining phases. One way the parser can help the lexical analyzer can
improve its ability to recover from errors is to make its list of legitimate tokens (in the current context)
available to the error recovery routine. The error‐recovery routine can then decide whether a remaining
input's prefix matches one of these tokens closely enough to be treated as that token.
A symbol table is a data structure used by a compiler to keep track of scope/ binding information about names.
This information is used in the source program to identify the various program elements, like variables, constants,
procedures, and the labels of statements. The symbol table is searched every time a name is encountered in the source
text. When a new name or new information about an existing name is discovered, the content of the symbol table
changes. Therefore, a symbol table must have an efficient mechanism for accessing the information held in the table as
well as for adding new entries to the symbol table.
For efficiency, our choice of the implementation data structure for the symbol table and the organization its
contents should be stress a minimal cost when adding new entries or accessing the information on existing entries.
Also, if the symbol table can grow dynamically as necessary, then it is more useful for a compiler. So, in the advanced
compilers there is a symbol table manager that helping in arranging the data in the symbol table and provide a
specified access to the other component of the compiler that want to access the symbol table to does anything .
This is all the required information that wanted to construct a complete picture about the compiler in the
mind of any interested person that have a little knowledge about the compiler and the automation language. And
know we are ready to go to the next chapter or the important chapter – Compiler Design – because the project or
study is a practice of the principles of the Software Design subject.
Chapter Three
Compiler
Design
SE4 Compiler Design Documentation Page 24
In this chapter, we will show how to design the compiler by using the object oriented strategy, and also using
the standard notations and diagrams of the UML (unified modeling language ) that helping the programmer to
implement this system.
By using the object oriented design, now we will show the
subsystems of our compiler, and how this subsystem is interconnect
with each other. i.e. we will establish the architectural design of our
compiler.
Figure 3.1
subsystems and determine the multi stage in the compiler from the beginning by receiving the source code
through the lexical analyzer, syntax, Symantec, …etc, till the
producing the target file. The lines that linking between this
subsystems is represent the relations between them.
We can find easily the following models in our compiler…
• Repository Model : found in the symbol table manager that is connect with the all others
subsystems to dealing with any subsystem’s data to check, update, and save it to the symbol
table. See the Figure 3.2 to understand this. From this model we can say that our compiler has
a subsystem share the data with all the others, So all of our subsystems has a flexible interface
or the same interface when getting or setting the data with the symbol table manager. Also
the repository model appears in the error handler subsystem that communicate with all the
other subsystems to report the errors if found in any stage.
• Abstract Machine ( Layered Model ) : this is the main model that the compiler depends to it.
Because it handle the main function of the compiler. The compilation process goes to end
through the Layered model, at the beginning the lexical analyzer receive the source code from
the editor screen if the language has a UI or from the File that the source code written to it.
When the Lexal analyzer finishes it’s function of separation the token that the source have, it
save the stream of tokens in the symbol table by helping from the symbol table manager. after
that the syntax analyzer receive this stream of tokens and doing the phrasing operation to this
tokens and produce the phrase tree to the symbol table, also this is done with helping with the
symbol table manager. Then the Symantec analyzer receive this phrase tree and checking the
logical errors then produce the abstract syntax tree. After that
the next phases is begin with the Intermediate representation
code generator that receive the abstract phrase tree and doing
the low level operation as generate the dependent machine
instruction and the instruction set architecture, then it produce
the intermediate representation code. After that this
intermediate representation code was receive by the I.R code
optimizer that doing the additional effort in the optimizing this
code such as reducing the time needed from the CPU to
executed this program and also reduce the size of memory that is
wanted to complete the execution, this is done by using many
functions that work in the low level environment that solving
many problems such as loop optimization and the parallel
optimization and … etc. the output from this stage is the
optimized code that would be received by the code generator
that work with it and the content of the symbol table to execute
many operation such as controlling the run time errors and
transform the logical addresses in the program that we want to
compile to the physical addresses in the memory, also allocating
the registers to produce finally the target code. We can convey
the meaning of this illustration in the following Figure 3.3 that
represent this layered models in our system.
Figure 3.3
After many times to determine that, we can says that the control of our compiler is the Event – Based
Control because the timing in our system is very important. i.e. the compilation process doesn’t take a few
second, it’s take a little bit, and the compilation process goes from any subsystem to another depending to the
output of this subsystem, we have the ordering way to do that. So we can determine the model of controls of our
system as the Broadcast Model because any subsystem is interest to doing it’s job, it is first do the monitoring
operation to determine the time to start depending to any event the broadcast to all subsystems. And we can say
that this is an efficient way has been done in the integrating subsystems. And here subsystem doesn’t know when
an event will be handled. !
We start with the Lexical analyzer use case diagram .!
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Figure 3.10
Figure 3.11
the error that occurred. Figure 3.12 is shown this diagram.
Figure 3.12
Figure 3.13
Figure 3.14
Figure 3.15
Figure 3.16
Figure 3.17
Figure 3.18
Figure 3.19
In the determination of this classes we also extract any redundant class that our main class in need it to
complete it’s functions. The following large Figure 3.20 is show this classes of our compiler.
Figure 3.20
Figure 3.21
Figure 3.22
Figure 3.23
Figure 3.24
Figure 3.25
Figure 3.26
Figure 3.27
Glossary
Compiler
A compiler is a special type of computer program that translates a human readable text file into a
form that the computer can more easily understand.
Interpreter
An interpreter reads the source code one instruction or line at a time, converts this line into
machine code and executes it.
Object-oriented strategy
Is a development strategy where system analyzers ,designers and programmer think in terms of
‘things’ instead of operations or functions.
Lexical Analyzer
Lexical analyzer is the part where the stream of characters making up the source program is read
from left‐to‐right and grouped into tokens.
Tokens
Tokens are sequences of characters with a collective meaning. There are usually only a small
number of tokens for a programming language: constants (integer, double, char, string, etc.),
operators (arithmetic, relational, logical), punctuation, and reserved words.
A lexeme
is the actual character sequence forming a token.
Regular Expression
A regular set is a set of strings for which there exists some finite automata that accepts that set.
That is, if R is a regular set, then R = L(M) for some finite automata M. Similarly, if M is a finite
automata, then L(M) is always a regular set.
Syntax Analyzer
The syntax‐analysis is the part of a compiler that verifies whether or not the tokens generated by
the lexical analyzer are grouped according to the syntactic rules of the language.
The terminals
The terminals are nothing more than tokens of the language, used to form the language
constructs.
Nonterminals
The nonterminals define the sets of strings that are used to define the language generated by the
grammar.
Derivation
Derivation refers to replacing an instance of a given string's nonterminal, by the right‐hand side
of the production rule, whose left‐hand side contains the nonterminal to be replaced.
Symantec Analyzer
Is the part of compiler where delve even deeper to check whether they form a sensible set of
instructions in the programming language.
Type checking is the process of verifying that each operation executed in a program respects the type system
of the language.
A scope
A scope is a section of program text enclosed by basic program delimiters.
Scope Checking
Is the determination if the identifier is accessible at that point in the program.
Is the part of the compilation phases that by which a compiler's code generator converts some internal
representation of source code into a form that can be readily executed by a machine.
Code Optimizer
Refers to techniques a compiler can employ in order to produce an improved object code for a given source
program.
Code Generator
It’s the last phase in the compiler operations, that convert the optimized code to depended machine
code>
Error Handling
One of the important tasks that a compiler must perform is the detection of and recovery from errors.
Symbol Table
A symbol table is a data structure used by a compiler to keep track of scope/ binding information
about names. This information is used in the source program to identify the various program
elements, like variables, constants, procedures, and the labels of statements.
Bibliography
[1] O.G. Kakde , “ Algorithms for Compiler Design ” , Charles River Media 2002.
[2] Alfred v.Aho ,Ravi Sethi and Jeffrey D.Ullman.“ Compilers Principles ,Techniques and Tools “.
[3] Ian Sommerville , “Software Engineering “.
[4] Y.N. Srikant, Priti Shankar, “The Compiler Design Handbook 2nd Edition “Dec.2007
[5] Cohen, “ Introduction to Computer Theory”, New York: Wiley, 1986.
[6] Hopcroft, J. Ullman, “Introduction to Automata Theory, Languages, and Computation,
Reading” MA: Addison‐Wesley, 1979.
[7] Bennett , “Introduction to Compiling Techniques. Berkshire”, England: McGraw‐Hill, 1990.
Websites
www.wikipedia.com