Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

CS327 - Compilers

Lexical Analysis

Abhishek Bichhawat 12/01/2024

Lexical Analysis - Lexemes
● Divide code into lexical units
○ Partition input into lexemes (syntactic category)

if x == 0 then { y = 1 ; } else { z = 2 ; }

if x == 0 then { y = 1 ; } else { z = 2 ; }
Lexical Analysis - Token Classes
● Divide code into lexical units
● Classify lexemes as per the role
○ Keywords, identifiers, numbers, parentheses, semi-colon, whitespaces etc.
○ Classes correspond to sets of strings
■ E.g. Identifiers are alphanumeric strings starting with an alphabet
Numbers are string consisting of digits
Keywords are specific words

if x0 == 0 then { x1 = 1; } else { x2 = 2; }
Lexical Analysis - Tokens
● Divide code into lexical units
● Classify lexemes as per the role
● Input tokens to the parser, which relies on this classification

x = 1

Lexical Analysis - Token Classes
● Divide code into lexical units
● Classify lexemes as per the role
● Input to the parser, which relies on this classification
● Number of tokens in each class for the following program?

if (x0==01) then {y1=10;} else {z2=20;}

keyword = identifier =
number = operator =
whitespace = other =
Lexical Analysis - Challenges
● Recognizing tokens
○ Example - FORTRAN
■ Disregards whitespaces, so, DO 5 I = 1.5 is the same as DO5I=1.5
■ DO 5 I = 1,5 is loop and DO 5 I = 1.5 is standard assignment
Lexical Analysis - Challenges
● Recognizing tokens
○ Example - FORTRAN
■ Disregards whitespaces, so, DO 5 I = 1.5 is the same as DO5I=1.5
■ DO 5 I = 1,5 is loop and DO 5 I = 1.5 is standard assignment
○ May require reading ahead before deciding on the tokens
■ if (x==0) then {y=1;} else {z=2;}
○ Also, a problem in modern languages like C++
■ A<B<C>>
■ Should we treat >> as stream operator or is the above snippet valid in C++?
Regular Expressions
● To define what set of strings are in a token class, we use regular
expressions, and in turn, regular languages (sets of strings)
● Alphabet is a set of characters (e.g., ASCII)
● Expression R (over some alphabet) :
○ R = 𝜖
| c
| R 1R 2
| R1|R2
| R*
Regular Languages
● To define what set of strings are in a token class, we use regular
expressions, and in turn, regular languages (sets of strings)
● Expression R (over some alphabet) denotes language L(R):
○ L(𝜖) = L(“”) = {“”}
○ L(c) = {“c”}
○ L(R1R2) = {x1x2 | x1 ∈ L(R1), x2 ∈ L(R2)}
○ L(R1|R2) = L(R1) ∪ L(R2)
○ L(R*) = L(𝜖) ∪ L(R) ∪ L(RR) ∪ …
Regular Languages - Example
Consider the alphabet {0,1}
1. What is the language 0*?

2. What is the language of (0|1)1?

3. What is the language of (0*|1*)?

4. What is the language of (0|1)*? (Is it same as 3?)

Regular Languages - Example
Consider the alphabet {0,1}
1. What is the language 0*?
a. {“”, “0”, “00”, “000”, …}
2. What is the language of (0|1)1?
a. {“01”, “11”}
3. What is the language of (0*|1*)?
a. Strings of 0s or strings of 1s, and the empty string
4. What is the language of (0|1)*? (Is it same as 3?)
a. All strings of 0s and 1s, and the empty string
Regular Languages
● Some other language-specific expressions with .(?),+,-,^
○ Option : ‘a’|𝜖 ⇔ a?
○ One or more occurrences: a+ ⇔ ‘a’|’aa’|’aaa’|...
○ Range : ‘a’|’b’|’c’|...|’z’ ⇔ [a-z]
○ Excluded range: complement of [a-z] ⇔ [^a-z]
Regular Languages - Example
Equivalent regular languages of:

1. (0 | 1)*(10 | 11 | 1)(0 | 1)*

2. (01 | 11)*(0 | 1)*

3. (0 | 1)*(0 | 1)(0 | 1)*

Regular Languages - Example
Equivalent regular languages of:

1. (0 | 1)*(10 | 11 | 1)(0 | 1)*

a. (0 | 1)*1(0 | 1)*
2. (01 | 11)*(0 | 1)*
a. (0 | 1)*
3. (0 | 1)*(0 | 1)(0 | 1)*
a. (0 | 1)+
Regular Languages - Example
Meaningful statement for the regular languages:

1. (0|1)*0

2. b*(abb*)*(a|𝜖)

3. (a|b)*aa(a|b)*
Regular Languages - Example
Meaningful statement for the regular languages:

1. (0|1)*0
2. b*(abb*)*(a|𝜖)
3. (a|b)*aa(a|b)*
Regular Expressions
1. Keywords in Java?
2. Numbers in Java?
3. Identifiers in Java?
4. Whitespaces in Java?
Regular Expressions
1. Keywords in Java? ‘if’|’else’|’void’|...
2. Numbers in Java? 0 | [1-9][0-9]*
3. Identifiers in Java? [_a-zA-Z][_a-zA-Z0-9]*
4. Whitespaces in Java? (‘ ‘ | ‘\n’ | ‘\t’ | ‘\r’)+
(\s is the regex for whitespace in Java)
Lexical Specifications
Lexical Specifications
● Given a string s, determine if the string is in the set of strings
constituting the language L(R)
● Break the input into tokens to pass on to the next phase
Lexical Specifications
1. Regex for all token classes
a. Number = 0|[1-9][0-9]*
b. Keywords = “if” | “else” | “then”
c. Identifiers = [a-zA-Z_][a-zA-Z_0-9]*
d. …
Lexical Specifications
1. Regex for all token classes
2. Construct R matching lexemes
a. R = Keyword | Identifier | Number | …
Lexical Specifications
1. Regex for all token classes
2. Construct R matching lexemes
3. Let x1..xn be the input
For 1 ≤ i ≤ n, check x1..xi ∈ L(R)
If yes, remove x1..xi from input and repeat 3
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Which one to choose?
Maximal Munch!
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Always take the longer one!
● How do we resolve ambiguities?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Which one to choose?
Priority Ordering
● How do we resolve ambiguities?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Priority to the one that appears earlier in the rule set!
No Rule Match?
If x1..xi ∉ L(R):
No Rule Match?
If x1..xi ∉ L(R):

Include an ERROR rule s.t.

ERROR = {all strings not in the lexical specification}
Language of Email Addresses
Alphabet = {letters, digits, “.”, “@”}
Language of Email Addresses
Alphabet = {letters, digits, “.”, “@”}

letter = [a-z]
digit = [0-9]
net = letter letter+
id_dom = letter (letter|digit)+
email = id_dom “@” id_dom “.” net

You might also like