Lexical Analyzer 1
Lexical Analyzer 1
Lexical Analyzer 1
1
Lexical Analyzer
comments
2
3
4
5
6
7
8
Attributes for Tokens
Tokens have at most one associated attribute
token id, e.g., its lexeme, its type, and the location at which it is first
found is kept in the symbol table.
Attribute value for an identifier is a pointer to the symbol-table entry.
Example
• The token names and associated attribute values for the For tran statement
• E = M * C ** 2
• are written below as a sequence of pairs.
• <id, pointer to symbol-table entry for E>
• <assign_op>
• <id, pointer to symbol-table entry for M>
• <mult_op>
• <id, pointer to symbol-table entry for C>
• <exp_op>
• <number, integer value 2>
9
10
11
12
13
14
Specification of Tokens
• Regular expressions are an important notation for specifying lexeme
patterns.
• Some Terminology:
Alphabet: a finite set of symbols. Ex: letters, digits, punctuation, set
{0, 1} is the binary alphabet, ASCII characters is an alphabet.
String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
– ε is the empty string
– |s| is the length of string s.
Language: sets of strings over some fixed alphabet
– ∅ the empty set is a language.
– {ε} the set containing empty string is a language
– The set of well-formed C programs is a language
– The set of all possible identifiers is a language.
15
16
Terms for Parts of Strings
The following string-related terms are commonly used:
A prefix of string s is any string obtained by removing zero or more symbols
from the end of s. For example, ban, banana, and ε are prefixes of banana.
A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana, banana, and ε are suffixes of
banana.
A substring of s is obtained by deleting any prefix and any suffix from s. For
instance, banana, nan, and ε are substrings of banana.
The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not ε or not equal to s itself.
A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s. For example, baan is a subsequence of banana.
17
Operators on Strings
Concatenation
• If x and y are strings, then the concatenation of x and y, denoted xy,
is the string formed by appending y to x.
• Example, if x = dog and y = house, then xy = doghouse.
• The empty string is the identity under concatenation; that is, for
any strings, εs= sε = s.
Exponentiation
• Define s0 to be ε, and for all i > 0, define si to be si-1 s. Since εs =
s, it follows that s1 = s. Then s2 = ss, s3 = sss, and so on.
18
Operations on Languages
19
Operations on Languages
• Concatenation:
– L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }
• Union
– L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
• Exponentiation:
– L0 = {ε} L1 = L L2 = LL
• Kleene Closure
– L* =
• Positive Closure
– L+ =
20
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 ∪ L2 = {a,b,c,d,1,2}
21
Example
L={A, B, ... , Z, a, b, ... , z} and D={0, 1, ... 9}.
L U D is the set of letters and digits - strictly speaking the language with
62 strings of length one, each of which strings is either one letter or one
digit.
LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit.
L4 is the set of all 4-letter strings.
L* is the set of all strings of letters, including , the empty string.
L(L U D)* is the set of all strings of letters and digits beginning with a
letter.
D+ is the set of all strings of one or more digits.
22
Regular Expressions
• We use regular expressions to describe tokens of a programming
language.
• A regular expression is built up of simpler regular expressions (using
defining rules)
• Each regular expression denotes a language.
• Regular expression r denotes a language L(r), which also recursively
defined by languages denoted by r’s subexpressions.
• A language denoted by a regular expression is called as a regular set.
Ex: Regular expression of C identifiers is
•
23
Induction: Larger regular expressions from smaller ones.
•
24
Regular Expressions (Rules)
Regular expressions over alphabet Σ
• (r)+ = (r)(r)*
• (r)? = (r) | ε
25
Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*|c means (a(b)*)|(c)
• Ex:
– Σ = {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0* => {ε ,0,00,000,0000,....}
– (0|1)* => all strings with 0 and 1, including the empty string
26
Example
Let = {a,b}.
1.The regular expression alb denotes the language {a, b}.
2.(a|b)(a|b) denotes {aa,ab, ba, bb}, the language of all strings oflength two over the
alphabet - Another regular expression for the same language is aa|ab|ba|bb.
3.a* denotes the language consisting of all strings of zero or more a's, that is, {ε, a, aa,
aaa,... }.
4.(a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is,
all strings of a's and b's: {ε, a, b, aa, ab, ba, bb, aaa, ... }. Another regular expression for
the same language is (a*b*)*.
5.a|a*b denotes the language {a, b, ab, aab, aaab, ... }, that is, the string a
and all strings consisting of zero or more a's and ending in b.
27
Algebraic laws for regular expressions
28
Regular Definitions
• To write regular expression for some languages can be difficult, because
their regular expressions can be quite complex. In those cases, we may
use regular definitions.
• We can give names to regular expressions, and we can use these names
as symbols to define other regular expressions.
29
Regular Definitions (cont.)
• Ex: Identifiers in C
letter → A | B | ... | Z | a | b | ... | z
digit → 0 | 1 | ... | 9
id → letter (letter | digit ) *
– If we try to write the regular expression representing identifiers without using
regular definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
• Ex: Unsigned numbers in C
digit → 0 | 1 | ... | 9
digits → digit +
opt-fraction → ( . digits ) ?
opt-exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent
30
Extensions of Regular Expressions
One or more instances. For unary, postfix operator
i. r* = r+ | ε
ii. r+ = rr* = r*r
Zero or one instance. The unary postfix operator ? "zero or one
occurrence."
i. r? => r|ε, or
ii. L(r?) => L(r) U {ε}.
Character classes.
i. a1|a2|...|an =>[a1a2...an]
ii. a|b|c=>[abc]
iii. a|b|c|...|z=>[a-z]
31
Example (Shorthands)
• Ex: Identifiers in C
letter_ → [A-Za-z_]
digit → [0-9]
id → letter_(letter | digit ) *
• Ex: Unsigned numbers in C
digit → [0-9]
digits → digit +
number → digits ( . digits ) ? ( E (+|-)? digits ) ?
32
Recognition of Tokens
•
33
Patterns for tokens
digit [0-9]
digits digit+
number digits (.digits)? ( E[+-]? digits )?
letter [A-Za-z]
id letter ( letter | digit )*
ifif
then then
else else
relop < | > | <= | >= |= |<>
•ws ( blank I tabI newline)+
34
Tokens, their patterns, and attribute values
LEXEMES TOKEN NAME ATTRIBUTE VALUE
Any ws
if if
then then
else else Pointer to table entry
Any id id
Any number number Pointer to table entry
LT
< relop
LE
<= relop
= relop EQ
<> relop NE
> relop GT
>= relop GE
35
Transition diagram for relop
36
Transition diagram.....
37