Lecture 3
Lecture 3
Lecture 3
Lexical Analysis
if x == 0 then { y = 1 ; } else { z = 2 ; }
if x == 0 then { y = 1 ; } else { z = 2 ; }
Lexical Analysis - Token Classes
● Divide code into lexical units
● Classify lexemes as per the role
○ Keywords, identifiers, numbers, parentheses, semi-colon, whitespaces etc.
○ Classes correspond to sets of strings
■ E.g. Identifiers are alphanumeric strings starting with an alphabet
Numbers are string consisting of digits
Keywords are specific words
if x0 == 0 then { x1 = 1; } else { x2 = 2; }
Lexical Analysis - Tokens
● Divide code into lexical units
● Classify lexemes as per the role
● Input tokens to the parser, which relies on this classification
x = 1
<identifier,“x”>,<operator,“=”>,<number,“1”>
Lexical Analysis - Token Classes
● Divide code into lexical units
● Classify lexemes as per the role
● Input to the parser, which relies on this classification
● Number of tokens in each class for the following program?
1. (0|1)*0
2. b*(abb*)*(a|𝜖)
3. (a|b)*aa(a|b)*
Regular Languages - Example
Meaningful statement for the regular languages:
1. (0|1)*0
a. EVEN NUMBERS IN BINARY FORM
2. b*(abb*)*(a|𝜖)
a. STRINGS OF A’S AND B’S WITH NO CONSECUTIVE A’S
3. (a|b)*aa(a|b)*
a. STRINGS OF A’S AND B’S WITH CONSECUTIVE A’S
Regular Expressions
1. Keywords in Java?
2. Numbers in Java?
3. Identifiers in Java?
4. Whitespaces in Java?
Regular Expressions
1. Keywords in Java? ‘if’|’else’|’void’|...
2. Numbers in Java? 0 | [1-9][0-9]*
3. Identifiers in Java? [_a-zA-Z][_a-zA-Z0-9]*
4. Whitespaces in Java? (‘ ‘ | ‘\n’ | ‘\t’ | ‘\r’)+
(\s is the regex for whitespace in Java)
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Lexical Specifications
Lexical Specifications
● Given a string s, determine if the string is in the set of strings
constituting the language L(R)
● Break the input into tokens to pass on to the next phase
Lexical Specifications
1. Regex for all token classes
a. Number = 0|[1-9][0-9]*
b. Keywords = “if” | “else” | “then”
c. Identifiers = [a-zA-Z_][a-zA-Z_0-9]*
d. …
Lexical Specifications
1. Regex for all token classes
2. Construct R matching lexemes
a. R = Keyword | Identifier | Number | …
Lexical Specifications
1. Regex for all token classes
2. Construct R matching lexemes
3. Let x1..xn be the input
For 1 ≤ i ≤ n, check x1..xi ∈ L(R)
If yes, remove x1..xi from input and repeat 3
Question?
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Which one to choose?
Maximal Munch!
● How do we resolve ambiguities?
○ x1..xi ∈ L(R) and x1..xj ∈ L(R) s.t. i ≠ j
○ Always take the longer one!
Question?
● How do we resolve ambiguities?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Which one to choose?
Priority Ordering
● How do we resolve ambiguities?
○ x1..xi ∈ L(Rj) and x1..xi ∈ L(Rk) s.t. j ≠ k
○ Priority to the one that appears earlier in the rule set!
No Rule Match?
If x1..xi ∉ L(R):
No Rule Match?
If x1..xi ∉ L(R):
letter = [a-z]
digit = [0-9]
net = letter letter+
id_dom = letter (letter|digit)+
email = id_dom “@” id_dom “.” net