Compiler: From Wikipedia, The Free Encyclopedia
Compiler: From Wikipedia, The Free Encyclopedia
Compiler: From Wikipedia, The Free Encyclopedia
Contents
[hide]
• 1 History
o 1.1 Compilers in education
• 2 Compiler output
o 2.1 Compiled versus interpreted languages
o 2.2 Hardware compilation
• 3 Compiler design
o 3.1 One-pass versus multi-pass compilers
o 3.2 Front end
o 3.3 Back end
• 4 Related techniques
• 5 See also
• 6 Notes
• 7 References
• 8 External links
[edit] History
Software for early computers was exclusively written in assembly
language for many years. Higher level programming languages were not
invented until the benefits of being able to reuse software on different
kinds of CPUs started to become significantly greater than the cost of
writing a compiler. The very limited memory capacity of early
computers also created many technical problems when implementing a
compiler.
The output of some compilers may target hardware at a very low level.
For example a Field Programmable Gate Array (FPGA) or structured
Application-specific integrated circuit (ASIC). Such compilers are said
to be hardware compilers or synthesis tools because the programs they
compile effectively control the final configuration of the hardware and
how it operates; the output of the compilation are not instructions that
are executed in sequence - only an interconnection of transistors or
lookup tables. For example, XST is the Xilinx Synthesis Tool used for
configuring FPGAs. Similar tools are available from Altera, Synplicity,
Synopsys and other vendors.
All but the smallest of compilers have more than two phases. However,
these phases are usually regarded as being part of the front end or the
back end. The point at where these two ends meet is always open to
debate. The front end is generally considered to be where syntactic and
semantic processing takes place, along with translation to a lower level
of representation (than source code).
The back end takes the output from the middle. It may perform more
analysis, transformations and optimizations that are for a particular
computer. Then, it generates code for a particular processor and OS.
While the typical multi-pass compiler outputs machine code from its
final pass, there are several other types:
Lexical analysis
From Wikipedia, the free encyclopedia
Contents
[hide]
• 1 Lexical grammar
• 2 Token
• 3 Scanner
• 4 Tokenizer
• 5 Lexer generator
• 6 Lexical analyzer generators
• 7 See also
• 8 References
[edit] Token
A token is a categorized block of text. The block of text corresponding
to the token is known as a lexeme. A lexical analyzer processes lexemes
to categorize them according to function, giving them meaning. This
assignment of meaning is known as tokenization. A token can look like
anything: English, gibberish symbols, anything; it just needs to be a
useful part of the structured text.
sum=3+2;
sum IDENT
= ASSIGN_OP
3 NUMBER
+ ADD_OP
2 NUMBER
; SEMICOLON
46 - number_of(cows);
The lexemes here might be: "46", "-", "number_of", "(", "cows", ")"
and ";". The lexical analyzer will denote lexemes "46" as 'number', "-"
as 'character' and "number_of" as a separate token. Even the lexeme ";"
in some languages (such as C) has some special meaning.
[edit] Scanner
The first stage, the scanner, is usually based on a finite state machine. It
has encoded within it information on the possible sequences of
characters that can be contained within any of the tokens it handles
(individual instances of these character sequences are known as
lexemes). For instance, an integer token may contain any sequence of
numerical digit characters. In many cases, the first non-whitespace
character can be used to deduce the kind of token that follows and
subsequent input characters are then processed one at a time until
reaching a character that is not in the set of characters acceptable for that
token (this is known as the maximal munch rule). In some languages the
lexeme creation rules are more complicated and may involve
backtracking over previously read characters.
[edit] Tokenizer
Tokenization is the process of demarcating and possibly classifying
sections of a string of input characters. The resulting tokens are then
passed on to some other form of processing. The process can be
considered a sub-task of parsing input.
<sentence>
<word>The</word>
<word>quick</word>
<word>brown</word>
<word>fox</word>
<word>jumps</word>
<word>over</word>
<word>the</word>
<word>lazy</word>
<word>dog</word>
</sentence>
NAME "net_worth_future"
EQUALS
OPEN_PARENTHESIS
NAME "assets"
MINUS
NAME "liabilities"
CLOSE_PARENTHESIS
SEMICOLON
Regular expressions and the finite state machines they generate are not
powerful enough to handle recursive patterns, such as "n opening
parentheses, followed by a statement, followed by n closing
parentheses." They are not capable of keeping count, and verifying that
n is the same on both sides — unless you have a finite set of permissible
values for n. It takes a full-fledged parser to recognize such patterns in
their full generality. A parser can push parentheses on a stack and then
try to pop them off and see if the stack is empty at the end.
The Lex programming tool and its compiler is designed to generate code
for fast lexical analysers based on a formal description of the lexical
syntax. It is not generally considered sufficient for applications with a
complicated set of lexical rules and severe performance requirements;
for instance, the GNU Compiler Collection uses hand-written lexers.
[edit] Lexer generator
Lexical analysis can often be performed in a single pass if reading is
done a character at a time. Single-pass lexers can be generated by tools
such as the classic flex.
Parsing
From Wikipedia, the free encyclopedia
Contents
[hide]
• 1 Human languages
• 2 Programming languages
o 2.1 Overview of process
• 3 Types of parsers
• 4 Examples of parsers
o 4.1 Top-down parsers
o 4.2 Bottom-up parsers
• 5 References
• 6 See also
o 6.1 Parsing concepts
o 6.2 Parser development software
6.2.1 Wikimedia
Most modern parsers are at least partly statistical; that is, they rely on a
corpus of training data which has already been annotated (parsed by
hand). This approach allows the system to gather information about the
frequency with which various constructions occur in specific contexts.
(See machine learning.) Approaches which have been used include
straightforward PCFGs (probabilistic context free grammars), maximum
entropy, and neural nets. Most of the more successful systems use
lexical statistics (that is, they consider the identities of the words
involved, as well as their part of speech). However such systems are
vulnerable to overfitting and require some kind of smoothing to be
effective.[citation needed]
Parsing algorithms for natural language cannot rely on the grammar
having 'nice' properties as with manually-designed grammars for
programming languages. As mentioned earlier some grammar
formalisms are very computationally difficult to parse; in general, even
if the desired structure is not context-free, some kind of context-free
approximation to the grammar is used to perform a first pass. Algorithms
which use context-free grammars often rely on some variant of the CKY
algorithm, usually with some heuristic to prune away unlikely analyses
to save time. (See chart parsing.) However some systems trade speed for
accuracy using, eg, linear-time versions of the shift-reduce algorithm. A
somewhat recent development has been parse reranking in which the
parser proposes some large number of analyses, and a more complex
system selects the best option. It is normally branching of one part and
its subparts
The first stage is the token generation, or lexical analysis, by which the
input character stream is split into meaningful symbols defined by a
grammar of regular expressions. For example, a calculator program
would look at an input such as "12*(3+4)^2" and split it into the tokens
12, *, (, 3, +, 4, ), ^, and 2, each of which is a meaningful symbol in the
context of an arithmetic expression. The parser would contain rules to
tell it that the characters *, +, ^, ( and ) mark the start of a new token, so
meaningless tokens like "12*" or "(3" will not be generated.
The next stage is parsing or syntactic analysis, which is checking that the
tokens form an allowable expression. This is usually done with reference
to a context-free grammar which recursively defines components that
can make up an expression and the order in which they must appear.
However, not all rules defining programming languages can be
expressed by context-free grammars alone, for example type validity and
proper declaration of identifiers. These rules can be formally expressed
with attribute grammars.
The final phase is semantic parsing or analysis, which is working out the
implications of the expression just validated and taking the appropriate
action. In the case of a calculator or interpreter, the action is to evaluate
the expression or program; a compiler, on the other hand, would
generate some kind of code. Attribute grammars can also be used to
define these actions.
[edit] Types of parsers
The task of the parser is essentially to determine if and how the input
can be derived from the start symbol of the grammar. This can be done
in essentially two ways:
The term back end is sometimes confused with code generator because
of the overlapped functionality of generating assembly code. Some
literature uses middle end to distinguish the generic analysis and
optimization phases in the back end from the machine-dependent code
generators.
Due to the extra time and space needed for compiler analysis and
optimizations, some compilers skip them by default. Users have to use
compilation options to explicitly tell the compiler which optimizations
should be enabled.
A program that translates from a low level language to a higher level one
is a decompiler.