SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Topics:
Lexical Analyzer
 Specification of tokens
 Recognition of Tokens
 Data structures Involved in Lexical
Analysis.

Lexical Analysis
It involves………
 The

Role of the Lexical Analyzer

 Tokens , Patterns, and Lexeme
 Attributes for Tokens
Role of Lexical Analyzer


As the first phase of a compiler, the
main task of the lexical analyzer is to
read the input characters of the source
program, group them into lexemes, and
produce as output a sequence of tokens
for each lexeme in the source program.


The stream of tokens is sent to the
parser for syntax analysis. It is common
for the lexical analyzer to interact with
the symbol table as well. When the
lexical analyzer discovers a lexeme
constituting an identifier, it needs to
enter that lexeme into the symbol table
The call, suggested by the
getNextToken command, causes the
lexical analyzer to read characters from
its input until it can identify the next
lexeme and produce for it the next
token, which it returns to the parser.
Sometimes, lexical analyzers are divided into
a cascade of two processes:
 a) Scanning consists of the simple
processes that do not require tokenization
of the input, such as deletion of comments
and compaction of consecutive whitespace
characters into one.
 b) Lexical analysis proper is the more
complex portion, where the scanner
produces the sequence of tokens as
output.
Token :
When discussing lexical analysis, we use three
related but distinct terms:
 A token is a pair consisting of a token name
and an optional attribute value. The token
name is an abstract symbol representing a
kind of lexical unit,
 e.g., a particular keyword, or a sequence of
input characters denoting an identifier. The
token names are the input symbols that the
parser processes. In what follows, we shall
generally write the name of a token in
boldface. We will often refer to a token by its
token name.
Patterns


A pattern is a description of the form that
the lexemes of a token may take. In the
case of a keyword as a token, the
pattern is just the sequence of
characters that form the keyword. For
identifiers and some other tokens, the
pattern is a more complex structure that
is matched by many strings.
One token for each keyword. The pattern
for a keyword is the same as the keyword
itself.
2. One token representing all identifiers.
3. One or more tokens representing
constants, such as numbers and literal
strings .
4. Tokens for each punctuation symbol, such
as left and right parentheses, comma, and
semicolon.
1.
Lexemes:


A lexeme is a sequence of characters in
the source program that matches the
pattern for a token and is identified by
the lexical analyzer as an instance of
that token.
Strings and Languages .
 Operations on Languages
 Regular Expressions . . .
 Regular Definitions . . . .
 Extensions of Regular Expressions

Specification of Tokens



Regular expressions are an important notation for
specifying lexeme patterns.
While they cannot express all possible patterns those
types of patterns that we actually need for tokens.
expressions to automata that perform the
recognition of the specified token
patterns, they are very effective in
specifying those types of patterns that we
actually need for tokens.
we shall see how these expressions are used
in a lexical-analyzer generator.
 how to build the lexical analyzer by
converting regular expressions to automata
that perform the recognition of the specified
tokens.

Strings and Languages
An alphabet is any finite set of symbols.
Typical examples of symbols are letters, digits,
and punctuation. The set {a, 1 } is the binary
alphabet. ASCII is an important example of an
alphabet ; it is used in many software
systems.Uni-code,
which
includes
approximately 100,000 characters from
alphabets around the world, is another
important example of an alphabet.
A string over an alphabet is a finite sequence of
symbols drawn from that alphabet. In language
theory, the terms "sentence" and "word" are often
used as synonyms for "string."
 The length of a string s, usually written | s| , is the
number of occurrences of symbols in s. For example,
banana is a string of length six.
 The empty string, denoted t, is the string of length
zero.
 A language is any countable set of strings over some
fixed alphabet.

Terms for Parts of Strings


A prefix of string S is any string obtained
by removing zero or more symbols from
the end of s. For example, ban, banana,
and E are prefixes of banana.



A suffix of string s is any string obtained
by removing zero or more symbols from
the beginning of s. For example, nana,
banana, and E are suffixes of banana.
A substring of s is obtained by deleting any
prefix and any suffix from s. For instance,
banana, nan, and E are substrings of
banana.
 The
proper prefixes, suffixes, and
substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s
that are not E or not equal to s itself.
 A subsequence of s is any string formed
by deleting zero or more not necessarily
consecutive positions of s. For example,
baan is a subsequence of banana.

If x and y are strings, then the concatenation of x and
y, denoted xy, is the string formed by appending y to
x. For example, if x = dog and y = house, then xy =
doghouse. The empty string is the identity under
concatenation;
 This definition is very broad. Abstract languages like
0, the empty set, or {t} , the set containing only the
empty string, are languages under this definition.

Operations on Languages
In lexical analysis, the most important operations
on languages are union, concatenation, and
closure, which are defined formally. Union is the
familiar operation on sets. The concatenation of
languages is all strings formed by taking a string
from the first language and a string from the
second language, in all possible ways , and
concatenating them.


The (Kleene) closure of a language
L, denoted L*, is the set of strings you get
by concatenating L zero or more times.
Note that L0 , the "concatenation of L zero
times," is defined to be {ɛ} , and
inductively, L ͥ is L ͥ ˉ ˡ L.

Finally, the positive closure, denoted L +
,is the same as the Kleene closure, but
without the term L0 .
 That is, ɛ will not be in L + unless it is in L
itself.

Regular Language


The set of regular languages over an
alphabet is defined recursively as below.
Any language belonging to this set is
a regular language over.
Definition of Set of Regular Languages :
 Basis Clause: Ø, {ʌ} and {a} for any
symbol a ϵ
are regular languages.


Inductive Clause: If Lr and Ls are regular
languages, then Lr
Ls , LrLs and Lr* are
regular languages.



Extremal Clause: Nothing is a regular
language unless it is obtained from the
above two clauses.
Regular expression


Regular expressions are used to denote
regular languages. They can represent regular
languages and operations on them succinctly.
The set of regular expressions over an
alphabet ∑ is defined recursively as below.
Any element of that set is a regular
expression.



Basis Clause: Ø, ʌ , and a are regular
expressions corresponding to languages Ø ,
{ʌ} and {a}, respectively, where a is an element
of


Inductive Clause:
 If r and s are regular expressions

corresponding to languages Lr and Ls ,
then ( r + s ) , ( rs ) and ( r*) are regular
expressions corresponding to
languages Lr Ls , LrLs and Lr* ,
respectively.


Extremal Clause:
 Nothing is a regular expression unless it is

obtained from the above two clauses.
Rules




is a regular expression that denotes { }, the set
containing empty string.
If a is a symbol in , then a is a regular expression
that denotes {a}, the set containing the string a.
Suppose r and s are regular expressions denoting
the language L(r) and L(s), then
 (r) |(s) is a regular expression denoting
L(r) L(s).
 (r)(s) is regular expression denoting L (r) L(s).
 (r) * is a regular expression denoting (L (r) )*.
 (r) is a regular expression denoting L (r).
Example of Regular
Expressions


Example: Let = { a, b }

 The regular expression a|b denotes the set {a,

b}
 The regular expression (a|b)(a|b) denotes {aa,
ab, ba, bb} the set of all strings of a’s and b’s of
length two. Another regular expression for this
same set is aa| ab| ba| bb.
 The regular language a* denotes that the set of
all strings of zero or more a’s i.e… {ϵ,a, aa,

aaa, aaaa, ……..}
Precedence Conventions
The unary operator * has the highest
precedence and is left associative.
 Concatenation has the second highest
precedence and is left associative.
 | has the lowest precedence and is left
associative.
 (a)|(b)*(c) a|b*c

 The regular expression (a|b)* denotes the

set of all strings containing zero or more
instances of an a or b, that is the set of all
strings of a’s and b’s. Another regular
expression for this set is (a*b*)*
 The regular expression a |a*b denotes the
set containing the string a and all strings
consisting of zero or more a’s followed by a
b.
Properties of Regular
Expression
where each di is a distinct name, and each ri
is a regular expression over the symbols in
{d1,d2,…,di-1}, i.e.,
 Note that each di can depend on all the
previous d's.
 Note also that each di can not depend on
following d's. This is an important difference
between regular definitions and productions.

Regular Definitions


These look like the productions of a context
free grammar we saw previously, but there
are differences. Let Σ be an alphabet, then
a regular definition is a sequence of
definitions
d1 → r1
d2 → r2
...
dn → rn
Example: C identifiers can be described by the
following regular definition
letter_ → A | B | ... | Z | a | b | ... | z | _
digit → 0 | 1 | ... | 9
CId → letter_ ( letter_ | digit)*

Example: Unsigned numbers
digit → 0| 1 | 2 | … | 9
digit → digit digit*
optional_fractional → digit | ϵ
optional_exponent → (E ( + | - | ϵ ) digit ) | ϵ
num → digit optional_fraction
optional_exponent
Extensions of Regular
Expressions


Many extensions have been added to regular
expressions to enhance their ability to specify string
patterns. Here we mention a few notational
extensions that were first incorporated into unix
utilities such as Lex that are particularly useful in
the specification lexical analyzers .


One or more instances. The unary, postfix
operator
+ represents the positive closure of a
regular expression and its language. That is , if r is
a regular expression, then (r)+denotes the language
(L( r) ) +. The operator + has the same precedence
and associativity as the operator *. Two useful
algebraic laws, r* = r+ I ϵ and r+ = rr* = r*r
relate the Kleene closure and positive closure.
Zero or one instance. The unary postfix
operator ? means "zero or one occurrence ."
That is , r? is equivalent to r lϵ , or put another
way, L(r?) = L(r) U {ϵ}. The ? operator has the
same precedence and associativity as * and +.
Example


digit → 0| 1 | 2 | … | 9
digit → digit digit+
optional_fractional → ( . digit ) ?
optional_exponent → (E ( + | - ) ? digit ) ?
num → digit optional_fraction optional_exponent
Character classes. A regular expression
a1 la2|···| an , where the ai ' s are each
symbols of the alphabet , can be replaced
by the shorthand [a la2 ··· an] . More
importantly, when a 1 , a2 ,.·· ,an form a
logical se-quence, e.g., consecutive
uppercase letters, lowercase letters, or digits,
we can replace them by a I -an , that is ,
just the first and last separated by a
hyphen. Thus, [abc] is shorthand for al b i c,
and [a- z] is shorthand for al b l··· l z .
Example
[A-Z a-z][A-Z a-z 0-9]*


Lexical analyzer
 Also

called a scanner or tokenizer
 Converts stream of characters into a
stream of tokens


Tokens are:
 Keywords such as for, while, and class.

 Special characters such as +, -, (, and <
 Variable name occurrences
 Constant occurrences such as 1, 0,

true.
Comparison with Lexical Analysis
Phase

Input

Output

Lexer

Sequence of
characters

Sequence of
tokens

Parser

Sequence of
tokens

Parse tree
The role of lexical analyzer
Source
program

token
Lexical Analyzer

Parser

getNextToken

Symbol
table

To semantic
analysis
What are Tokens For?


Parser relies on token classification
 e.g., How to handle reserved keywords? As an

identifier or a separate keyword for each?

Output of the lexer is a stream of tokens which is
input to the parser.
 The lexer usually discards “uninteresting” tokens
that do not contribute to parsing


 Examples: white space, comments
Recognition of Tokens
Task of recognition of token in a lexical analyzer


Isolate the lexeme for the next token in the input
buffer



Produce as output a pair consisting of the
appropriate token and attribute-value, such as
<id,pointer to table entry> , using the translation
table given in the Fig in next page
Recognition of Tokens
Task of recognition of token in a lexical
analyzer
Regular
expression
if
id

Token

<

relop

if
id

Attributevalue
Pointer to
table entry
LT
Recognition of tokens


Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ

expr -> term relop term
| term
term -> id
| number
Recognition of tokens
(cont.) is to formalize the patterns:
 The next step
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id
-> letter (letter|digit)*
If
-> if
Then -> then
Else
-> else
Relop -> < | > | <= | >= | = | <>
 We also need to handle whitespaces:
ws -> (blank | tab | newline)+

The terminals of the grammar,
which are if, then, else, relop,
id, and number, are the names
of tokens as used by the
lexical analyzer.

The lexical analyzer also has
the job of stripping out
whitespace, by recognizing
the "token" ws .
Tokens, their patterns, and attribute values

Compiler Construction
LEXICAL ANALYSIS
Recognition of Tokens
Method to recognize the token


Use Transition Diagram
Diagrams for Tokens


Transition Diagrams (TD) are used to represent the
tokens



Each Transition Diagram has:
 States: Represented by Circles
 Actions: Represented by Arrows between states
 Start State: Beginning of a pattern (Arrowhead)

 Final

State(s):
Circles)



End

of

pattern

(Concentric

Deterministic - No need to choose between 2
different actions
Recognition of Tokens
Transition Diagram(Stylized flowchart)


Depict the actions that take place when a lexical analyzer is
called by the parser to get the next token

start

0
Start
state

>

=

Accepting
state

6

7

other

*
8

Notes: Here we use „*‟ to indicate states on which input
retraction must take place

return(relop,GE)

return(relop,GT)
if it is necessary to retract the forward
pointer one position ( i.e., the lexeme does not
include the symbol that got us to the accepting
state) , then we shall additionally place a *near
that accepting state.
Ex :RELOP = < | <= | = | <> | > | >=
We begin in state 0,
the start state < as the first input symbol, then
among the lexemes that match the pattern for
relop we can only be looking at <, <>, or <=.
Recognition of Identifiers


Ex2:

ID = letter(letter | digit) *

Transition Diagram:

letter or digit
letter

start

9

#

other

10

11
return(id)

# indicates input retraction

Compiler
Construction
Install the reserved words in the
symbol table initially. A field of the symboltable entry indicates that these strings are never
ordinary identi-fiers, and tells which token they
represent . We have supposed that this method is
in use in Fig .
When we find an identifier, a call to install ID
places it in the symbol table if it is not already
there and returns a pointer to the symbol-table
entry for the lexeme found Of course, any
identifier not in the symbol table during lexical
analysis cannot be a reserved word, so its token
is id.
 So

far we have seen the transition
diagrams for identifiers and the
relational operators.

 What

remains are:

 Whitespace
 Numbers
Recognizing white spaces:
We also want the laxer to remove
whitespace so we define a new token .
{ ws → ( blank | tab | newline ) + }

where blank, tab, and newline are symbols
used to represent the corresponding ascii
characters.
Transition diagram for white
space:
The “delim” in the diagram represents
any of the whitespace characters, say
space, tab, and newline.
 The final star is there because we
needed to find a non-whitespace
character in order to know when the
whitespace ends and this character
begins the next token.
 There is no action performed at the
accepting state. Indeed the lexer does
not return to the parser, but starts again
from its beginning as it still must find the
next token.

Transition Diagram for Numbers :

Accepting float
e.g. 12.31E4

Accepting integer
e.g. 12

Accepting float
e.g. 12.31
Explaination with example :


Beginning in state 12, if we see a digit,
we go to state 13. In that state, we can
read any number of additional digits .
However, if we see anything but a digit or
a dot , we have seen a number in the form
of an integer; 12 is an example .

Accepting integer
e.g. 12
If we instead see a dot in state 13,
then we have an "optional fraction ."
 State 14 is entered, and we look for
one or more additional digits; state 15
is used for that purpose.



If we see an E, then we have an
"optional exponent ," whose recognition
is the job of states 16 through 19.
Architecture of a TransitionDiagram- Based Lexical Analyzer :


Each state is represented by a piece of
code . We may imagine a variable state
holding the number of the current state for a
transition diagram. A switch based on the
value of state takes us to code for each of
the possible states, where we find the action
of that state. Often, the code for a state is
itself a switch statement or multi way branch
that determines the next state by reading
and examining the next input character.
Transition diagram for relop:
Sketch of implementation of relop transition
diagram
getRelop ( ), a C++ function whose job is
to simulate the transition diagram and
return an object of type TOKEN , that is ,
a pair consisting of the token name
(which must be relop in this case) and an
attribute value .
 getRelop ( ) first creates a new object ret
Token and initializes its first component to
RELOP , the symbolic code for token relop

We see the typical behavior of a state in
case 0, the case where the current state
is 0. A function nextChar ( ) obtains the
next character from the input and
assigns it to local variable c. We then
check c for the three characters we expect
to find , making the state transition
dictated by the transition diagram .
 For example, if the next input character
is =, we go to state 5.



If the next input character is not one that
can begin a comparison operator, then a
function fail ( ) is called, and It should
reset the forward pointer to lexemeBegin.


Because state 8 bears a *, we must
retract the input pointer one position
(i.e. , put c back on the input stream)
. That task is accomplished by the
function retract ( ). Since state 8
represents the recognition of lexeme
>=, we set the second component of
the returned object , which we suppose
is named attribute, to GT, the code for
this operator.
“The symbol table is a data structure
used by all phases of the compiler to
keep track of user defined symbols and
keywords.”
 In

computer science, a symbol
table is a data structure used by a
language
translator
such
as
a compiler or interpreter, where
each identifier in a program's source
code is associated with information
relating to its
declaration or
appearance in the source, such as
its type, scope level and sometimes
its location.
 During

early phases (lexical and
syntax analysis) symbols are
discovered and put into the symbol
table.
 During later phases symbols are
looked up to validate their usage.
Uses




An object file will contain a symbol table of the
identifiers it contains that are externally visible.
During the linking of different object files,
a linker will use these symbol tables to resolve
any unresolved references.
A symbol table may only exist during the
translation process, or it may be embedded in the
output of that process for later exploitation, for
example, during an interactive debugging session,
or as a resource for formatting a diagnostic report
during or after execution of a program.




While reverse engineering an executable,
many tools refer to the symbol table to check
what address has been assigned to global
variables & Non-functions if the symbol table
has been stripped or cleaned out before being
converted into executable tools will find it
harder to determining address or understand
anything about the program.
At that time of accessing variable & allocating
memory dynamically a compiler should
perform many works & as such, the extended
stack model requires the symbol table.
Typical symbol table activities:
Add a new name
 Add information for a name
 Access information for a name






determine if a name is present in the table
remove a name
revert to a previous usage for a name (close a
scope).
Many possible Implementations
Linear List
 Sorted List
 Hash Table
 Tree Structure

Typical information fields:
print value
 kind
(e.g. reserved, type_id, var_id, func_id, etc.)
 block number/level number
 type
 initial value
 base address etc.

Lexical Analysis Guide

Weitere ähnliche Inhalte

Was ist angesagt?

Recursive Descent Parsing
Recursive Descent Parsing  Recursive Descent Parsing
Recursive Descent Parsing Md Tajul Islam
 
Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expressionAnimesh Chaturvedi
 
Compiler design syntax analysis
Compiler design syntax analysisCompiler design syntax analysis
Compiler design syntax analysisRicha Sharma
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computationBipul Roy Bpl
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design MAHASREEM
 
Token, Pattern and Lexeme
Token, Pattern and LexemeToken, Pattern and Lexeme
Token, Pattern and LexemeA. S. M. Shafi
 
Introduction TO Finite Automata
Introduction TO Finite AutomataIntroduction TO Finite Automata
Introduction TO Finite AutomataRatnakar Mikkili
 
Semantics analysis
Semantics analysisSemantics analysis
Semantics analysisBilalzafar22
 
A Role of Lexical Analyzer
A Role of Lexical AnalyzerA Role of Lexical Analyzer
A Role of Lexical AnalyzerArchana Gopinath
 
Lecture 01 introduction to compiler
Lecture 01 introduction to compilerLecture 01 introduction to compiler
Lecture 01 introduction to compilerIffat Anjum
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture NotesFellowBuddy.com
 

Was ist angesagt? (20)

1.Role lexical Analyzer
1.Role lexical Analyzer1.Role lexical Analyzer
1.Role lexical Analyzer
 
Phases of Compiler
Phases of CompilerPhases of Compiler
Phases of Compiler
 
Recursive Descent Parsing
Recursive Descent Parsing  Recursive Descent Parsing
Recursive Descent Parsing
 
Predictive parser
Predictive parserPredictive parser
Predictive parser
 
Lexical analyzer
Lexical analyzerLexical analyzer
Lexical analyzer
 
Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expression
 
Role-of-lexical-analysis
Role-of-lexical-analysisRole-of-lexical-analysis
Role-of-lexical-analysis
 
Compiler design syntax analysis
Compiler design syntax analysisCompiler design syntax analysis
Compiler design syntax analysis
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computation
 
Back patching
Back patchingBack patching
Back patching
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design
 
Token, Pattern and Lexeme
Token, Pattern and LexemeToken, Pattern and Lexeme
Token, Pattern and Lexeme
 
Introduction TO Finite Automata
Introduction TO Finite AutomataIntroduction TO Finite Automata
Introduction TO Finite Automata
 
Semantics analysis
Semantics analysisSemantics analysis
Semantics analysis
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
NFA to DFA
NFA to DFANFA to DFA
NFA to DFA
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
A Role of Lexical Analyzer
A Role of Lexical AnalyzerA Role of Lexical Analyzer
A Role of Lexical Analyzer
 
Lecture 01 introduction to compiler
Lecture 01 introduction to compilerLecture 01 introduction to compiler
Lecture 01 introduction to compiler
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture Notes
 

Andere mochten auch

Symbol table design (Compiler Construction)
Symbol table design (Compiler Construction)Symbol table design (Compiler Construction)
Symbol table design (Compiler Construction)Tech_MX
 
Minimization of DFA
Minimization of DFAMinimization of DFA
Minimization of DFAkunj desai
 
Bottom - Up Parsing
Bottom - Up ParsingBottom - Up Parsing
Bottom - Up Parsingkunj desai
 
Programming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresProgramming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresNisarg Amin
 
Programming Languages / Translators
Programming Languages / TranslatorsProgramming Languages / Translators
Programming Languages / TranslatorsProject Student
 
NFA or Non deterministic finite automata
NFA or Non deterministic finite automataNFA or Non deterministic finite automata
NFA or Non deterministic finite automatadeepinderbedi
 
Intermediate code- generation
Intermediate code- generationIntermediate code- generation
Intermediate code- generationrawan_z
 
Language translator
Language translatorLanguage translator
Language translatorasmakh89
 
Intermediate code generation
Intermediate code generationIntermediate code generation
Intermediate code generationRamchandraRegmi
 

Andere mochten auch (19)

Compiler
CompilerCompiler
Compiler
 
Recognition-of-tokens
Recognition-of-tokensRecognition-of-tokens
Recognition-of-tokens
 
Symbol table design (Compiler Construction)
Symbol table design (Compiler Construction)Symbol table design (Compiler Construction)
Symbol table design (Compiler Construction)
 
Minimization of DFA
Minimization of DFAMinimization of DFA
Minimization of DFA
 
Dfa vs nfa
Dfa vs nfaDfa vs nfa
Dfa vs nfa
 
Optimization of dfa
Optimization of dfaOptimization of dfa
Optimization of dfa
 
Bottom - Up Parsing
Bottom - Up ParsingBottom - Up Parsing
Bottom - Up Parsing
 
DFA Minimization
DFA MinimizationDFA Minimization
DFA Minimization
 
Programming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresProgramming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwares
 
optimization of DFA
optimization of DFAoptimization of DFA
optimization of DFA
 
Nfa vs dfa
Nfa vs dfaNfa vs dfa
Nfa vs dfa
 
Programming Languages / Translators
Programming Languages / TranslatorsProgramming Languages / Translators
Programming Languages / Translators
 
Compiler Chapter 1
Compiler Chapter 1Compiler Chapter 1
Compiler Chapter 1
 
NFA or Non deterministic finite automata
NFA or Non deterministic finite automataNFA or Non deterministic finite automata
NFA or Non deterministic finite automata
 
Intermediate code- generation
Intermediate code- generationIntermediate code- generation
Intermediate code- generation
 
Translators(Compiler, Assembler) and interpreter
Translators(Compiler, Assembler) and interpreterTranslators(Compiler, Assembler) and interpreter
Translators(Compiler, Assembler) and interpreter
 
Language translator
Language translatorLanguage translator
Language translator
 
Intermediate code generation
Intermediate code generationIntermediate code generation
Intermediate code generation
 
Code generation
Code generationCode generation
Code generation
 

Ähnlich wie Lexical Analysis Guide

Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfDrIsikoIsaac
 
Regular Expression in Compiler design
Regular Expression in Compiler designRegular Expression in Compiler design
Regular Expression in Compiler designRiazul Islam
 
Specification of Token
Specification of TokenSpecification of Token
Specification of TokenA. S. M. Shafi
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammarmeresie tesfay
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)bolovv
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 028threspecter
 
Compiler Design
Compiler Design Compiler Design
Compiler Design waqar ahmed
 
compiler Design course material chapter 2
compiler Design course material chapter 2compiler Design course material chapter 2
compiler Design course material chapter 2gadisaAdamu
 
Structure of the compiler
Structure of the compilerStructure of the compiler
Structure of the compilerSudhaa Ravi
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxssuser039bf6
 
Automata definitions
Automata definitionsAutomata definitions
Automata definitionsSajid Marwat
 
Ch 2.pptx
Ch 2.pptxCh 2.pptx
Ch 2.pptxwoldu2
 
I am kind of confused about quantifiers. I am not sure how to transl.pdf
I am kind of confused about quantifiers. I am not sure how to transl.pdfI am kind of confused about quantifiers. I am not sure how to transl.pdf
I am kind of confused about quantifiers. I am not sure how to transl.pdfAMITPANCHAL154
 
Language for specifying lexical Analyzer
Language for specifying lexical AnalyzerLanguage for specifying lexical Analyzer
Language for specifying lexical AnalyzerArchana Gopinath
 

Ähnlich wie Lexical Analysis Guide (20)

Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdf
 
Regular Expression in Compiler design
Regular Expression in Compiler designRegular Expression in Compiler design
Regular Expression in Compiler design
 
Specification of Token
Specification of TokenSpecification of Token
Specification of Token
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 02
 
Compiler Design
Compiler Design Compiler Design
Compiler Design
 
compiler Design course material chapter 2
compiler Design course material chapter 2compiler Design course material chapter 2
compiler Design course material chapter 2
 
Structure of the compiler
Structure of the compilerStructure of the compiler
Structure of the compiler
 
Ch2 automata.pptx
Ch2 automata.pptxCh2 automata.pptx
Ch2 automata.pptx
 
Regular expression (compiler)
Regular expression (compiler)Regular expression (compiler)
Regular expression (compiler)
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptx
 
Automata definitions
Automata definitionsAutomata definitions
Automata definitions
 
Ch 2.pptx
Ch 2.pptxCh 2.pptx
Ch 2.pptx
 
I am kind of confused about quantifiers. I am not sure how to transl.pdf
I am kind of confused about quantifiers. I am not sure how to transl.pdfI am kind of confused about quantifiers. I am not sure how to transl.pdf
I am kind of confused about quantifiers. I am not sure how to transl.pdf
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Compiler Design QA
Compiler Design QACompiler Design QA
Compiler Design QA
 
Language for specifying lexical Analyzer
Language for specifying lexical AnalyzerLanguage for specifying lexical Analyzer
Language for specifying lexical Analyzer
 

Kürzlich hochgeladen

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 

Kürzlich hochgeladen (20)

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 

Lexical Analysis Guide

  • 1.
  • 2.
  • 3. Topics: Lexical Analyzer  Specification of tokens  Recognition of Tokens  Data structures Involved in Lexical Analysis. 
  • 4. Lexical Analysis It involves………  The Role of the Lexical Analyzer  Tokens , Patterns, and Lexeme  Attributes for Tokens
  • 5. Role of Lexical Analyzer  As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.
  • 6.  The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table
  • 7. The call, suggested by the getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produce for it the next token, which it returns to the parser.
  • 8. Sometimes, lexical analyzers are divided into a cascade of two processes:  a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.  b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.
  • 9. Token : When discussing lexical analysis, we use three related but distinct terms:  A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit,  e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name.
  • 10. Patterns  A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.
  • 11. One token for each keyword. The pattern for a keyword is the same as the keyword itself. 2. One token representing all identifiers. 3. One or more tokens representing constants, such as numbers and literal strings . 4. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon. 1.
  • 12. Lexemes:  A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.
  • 13. Strings and Languages .  Operations on Languages  Regular Expressions . . .  Regular Definitions . . . .  Extensions of Regular Expressions 
  • 14. Specification of Tokens   Regular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns those types of patterns that we actually need for tokens.
  • 15. expressions to automata that perform the recognition of the specified token patterns, they are very effective in specifying those types of patterns that we actually need for tokens. we shall see how these expressions are used in a lexical-analyzer generator.  how to build the lexical analyzer by converting regular expressions to automata that perform the recognition of the specified tokens. 
  • 16. Strings and Languages An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuation. The set {a, 1 } is the binary alphabet. ASCII is an important example of an alphabet ; it is used in many software systems.Uni-code, which includes approximately 100,000 characters from alphabets around the world, is another important example of an alphabet.
  • 17. A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theory, the terms "sentence" and "word" are often used as synonyms for "string."  The length of a string s, usually written | s| , is the number of occurrences of symbols in s. For example, banana is a string of length six.  The empty string, denoted t, is the string of length zero.  A language is any countable set of strings over some fixed alphabet. 
  • 18. Terms for Parts of Strings  A prefix of string S is any string obtained by removing zero or more symbols from the end of s. For example, ban, banana, and E are prefixes of banana.  A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana, banana, and E are suffixes of banana.
  • 19. A substring of s is obtained by deleting any prefix and any suffix from s. For instance, banana, nan, and E are substrings of banana.  The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are not E or not equal to s itself.  A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana. 
  • 20. If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed by appending y to x. For example, if x = dog and y = house, then xy = doghouse. The empty string is the identity under concatenation;  This definition is very broad. Abstract languages like 0, the empty set, or {t} , the set containing only the empty string, are languages under this definition. 
  • 21. Operations on Languages In lexical analysis, the most important operations on languages are union, concatenation, and closure, which are defined formally. Union is the familiar operation on sets. The concatenation of languages is all strings formed by taking a string from the first language and a string from the second language, in all possible ways , and concatenating them.
  • 22.  The (Kleene) closure of a language L, denoted L*, is the set of strings you get by concatenating L zero or more times. Note that L0 , the "concatenation of L zero times," is defined to be {ɛ} , and inductively, L ͥ is L ͥ ˉ ˡ L. Finally, the positive closure, denoted L + ,is the same as the Kleene closure, but without the term L0 .  That is, ɛ will not be in L + unless it is in L itself. 
  • 23. Regular Language  The set of regular languages over an alphabet is defined recursively as below. Any language belonging to this set is a regular language over.
  • 24. Definition of Set of Regular Languages :  Basis Clause: Ø, {ʌ} and {a} for any symbol a ϵ are regular languages.  Inductive Clause: If Lr and Ls are regular languages, then Lr Ls , LrLs and Lr* are regular languages.  Extremal Clause: Nothing is a regular language unless it is obtained from the above two clauses.
  • 25. Regular expression  Regular expressions are used to denote regular languages. They can represent regular languages and operations on them succinctly. The set of regular expressions over an alphabet ∑ is defined recursively as below. Any element of that set is a regular expression.  Basis Clause: Ø, ʌ , and a are regular expressions corresponding to languages Ø , {ʌ} and {a}, respectively, where a is an element of
  • 26.  Inductive Clause:  If r and s are regular expressions corresponding to languages Lr and Ls , then ( r + s ) , ( rs ) and ( r*) are regular expressions corresponding to languages Lr Ls , LrLs and Lr* , respectively.  Extremal Clause:  Nothing is a regular expression unless it is obtained from the above two clauses.
  • 27. Rules    is a regular expression that denotes { }, the set containing empty string. If a is a symbol in , then a is a regular expression that denotes {a}, the set containing the string a. Suppose r and s are regular expressions denoting the language L(r) and L(s), then  (r) |(s) is a regular expression denoting L(r) L(s).  (r)(s) is regular expression denoting L (r) L(s).  (r) * is a regular expression denoting (L (r) )*.  (r) is a regular expression denoting L (r).
  • 28. Example of Regular Expressions  Example: Let = { a, b }  The regular expression a|b denotes the set {a, b}  The regular expression (a|b)(a|b) denotes {aa, ab, ba, bb} the set of all strings of a’s and b’s of length two. Another regular expression for this same set is aa| ab| ba| bb.  The regular language a* denotes that the set of all strings of zero or more a’s i.e… {ϵ,a, aa, aaa, aaaa, ……..}
  • 29. Precedence Conventions The unary operator * has the highest precedence and is left associative.  Concatenation has the second highest precedence and is left associative.  | has the lowest precedence and is left associative.  (a)|(b)*(c) a|b*c 
  • 30.  The regular expression (a|b)* denotes the set of all strings containing zero or more instances of an a or b, that is the set of all strings of a’s and b’s. Another regular expression for this set is (a*b*)*  The regular expression a |a*b denotes the set containing the string a and all strings consisting of zero or more a’s followed by a b.
  • 32. where each di is a distinct name, and each ri is a regular expression over the symbols in {d1,d2,…,di-1}, i.e.,  Note that each di can depend on all the previous d's.  Note also that each di can not depend on following d's. This is an important difference between regular definitions and productions. 
  • 33. Regular Definitions  These look like the productions of a context free grammar we saw previously, but there are differences. Let Σ be an alphabet, then a regular definition is a sequence of definitions d1 → r1 d2 → r2 ... dn → rn
  • 34. Example: C identifiers can be described by the following regular definition letter_ → A | B | ... | Z | a | b | ... | z | _ digit → 0 | 1 | ... | 9 CId → letter_ ( letter_ | digit)* Example: Unsigned numbers digit → 0| 1 | 2 | … | 9 digit → digit digit* optional_fractional → digit | ϵ optional_exponent → (E ( + | - | ϵ ) digit ) | ϵ num → digit optional_fraction optional_exponent
  • 35. Extensions of Regular Expressions  Many extensions have been added to regular expressions to enhance their ability to specify string patterns. Here we mention a few notational extensions that were first incorporated into unix utilities such as Lex that are particularly useful in the specification lexical analyzers .  One or more instances. The unary, postfix operator + represents the positive closure of a regular expression and its language. That is , if r is a regular expression, then (r)+denotes the language (L( r) ) +. The operator + has the same precedence and associativity as the operator *. Two useful algebraic laws, r* = r+ I ϵ and r+ = rr* = r*r relate the Kleene closure and positive closure.
  • 36. Zero or one instance. The unary postfix operator ? means "zero or one occurrence ." That is , r? is equivalent to r lϵ , or put another way, L(r?) = L(r) U {ϵ}. The ? operator has the same precedence and associativity as * and +. Example  digit → 0| 1 | 2 | … | 9 digit → digit digit+ optional_fractional → ( . digit ) ? optional_exponent → (E ( + | - ) ? digit ) ? num → digit optional_fraction optional_exponent
  • 37. Character classes. A regular expression a1 la2|···| an , where the ai ' s are each symbols of the alphabet , can be replaced by the shorthand [a la2 ··· an] . More importantly, when a 1 , a2 ,.·· ,an form a logical se-quence, e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a I -an , that is , just the first and last separated by a hyphen. Thus, [abc] is shorthand for al b i c, and [a- z] is shorthand for al b l··· l z . Example [A-Z a-z][A-Z a-z 0-9]* 
  • 38.
  • 39. Lexical analyzer  Also called a scanner or tokenizer  Converts stream of characters into a stream of tokens  Tokens are:  Keywords such as for, while, and class.  Special characters such as +, -, (, and <  Variable name occurrences  Constant occurrences such as 1, 0, true.
  • 40. Comparison with Lexical Analysis Phase Input Output Lexer Sequence of characters Sequence of tokens Parser Sequence of tokens Parse tree
  • 41. The role of lexical analyzer Source program token Lexical Analyzer Parser getNextToken Symbol table To semantic analysis
  • 42. What are Tokens For?  Parser relies on token classification  e.g., How to handle reserved keywords? As an identifier or a separate keyword for each? Output of the lexer is a stream of tokens which is input to the parser.  The lexer usually discards “uninteresting” tokens that do not contribute to parsing   Examples: white space, comments
  • 43. Recognition of Tokens Task of recognition of token in a lexical analyzer  Isolate the lexeme for the next token in the input buffer  Produce as output a pair consisting of the appropriate token and attribute-value, such as <id,pointer to table entry> , using the translation table given in the Fig in next page
  • 44. Recognition of Tokens Task of recognition of token in a lexical analyzer Regular expression if id Token < relop if id Attributevalue Pointer to table entry LT
  • 45. Recognition of tokens  Starting point is the language grammar to understand the tokens: stmt -> if expr then stmt | if expr then stmt else stmt |Ɛ expr -> term relop term | term term -> id | number
  • 46. Recognition of tokens (cont.) is to formalize the patterns:  The next step digit -> [0-9] Digits -> digit+ number -> digit(.digits)? (E[+-]? Digit)? letter -> [A-Za-z_] id -> letter (letter|digit)* If -> if Then -> then Else -> else Relop -> < | > | <= | >= | = | <>  We also need to handle whitespaces: ws -> (blank | tab | newline)+ The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of tokens as used by the lexical analyzer. The lexical analyzer also has the job of stripping out whitespace, by recognizing the "token" ws .
  • 47. Tokens, their patterns, and attribute values Compiler Construction
  • 48. LEXICAL ANALYSIS Recognition of Tokens Method to recognize the token  Use Transition Diagram
  • 49. Diagrams for Tokens  Transition Diagrams (TD) are used to represent the tokens  Each Transition Diagram has:  States: Represented by Circles  Actions: Represented by Arrows between states  Start State: Beginning of a pattern (Arrowhead)  Final State(s): Circles)  End of pattern (Concentric Deterministic - No need to choose between 2 different actions
  • 50. Recognition of Tokens Transition Diagram(Stylized flowchart)  Depict the actions that take place when a lexical analyzer is called by the parser to get the next token start 0 Start state > = Accepting state 6 7 other * 8 Notes: Here we use „*‟ to indicate states on which input retraction must take place return(relop,GE) return(relop,GT)
  • 51. if it is necessary to retract the forward pointer one position ( i.e., the lexeme does not include the symbol that got us to the accepting state) , then we shall additionally place a *near that accepting state.
  • 52. Ex :RELOP = < | <= | = | <> | > | >= We begin in state 0, the start state < as the first input symbol, then among the lexemes that match the pattern for relop we can only be looking at <, <>, or <=.
  • 53. Recognition of Identifiers  Ex2: ID = letter(letter | digit) * Transition Diagram: letter or digit letter start 9 # other 10 11 return(id) # indicates input retraction Compiler Construction
  • 54. Install the reserved words in the symbol table initially. A field of the symboltable entry indicates that these strings are never ordinary identi-fiers, and tells which token they represent . We have supposed that this method is in use in Fig . When we find an identifier, a call to install ID places it in the symbol table if it is not already there and returns a pointer to the symbol-table entry for the lexeme found Of course, any identifier not in the symbol table during lexical analysis cannot be a reserved word, so its token is id.
  • 55.
  • 56.  So far we have seen the transition diagrams for identifiers and the relational operators.  What remains are:  Whitespace  Numbers
  • 57. Recognizing white spaces: We also want the laxer to remove whitespace so we define a new token . { ws → ( blank | tab | newline ) + } where blank, tab, and newline are symbols used to represent the corresponding ascii characters.
  • 58. Transition diagram for white space:
  • 59. The “delim” in the diagram represents any of the whitespace characters, say space, tab, and newline.  The final star is there because we needed to find a non-whitespace character in order to know when the whitespace ends and this character begins the next token.  There is no action performed at the accepting state. Indeed the lexer does not return to the parser, but starts again from its beginning as it still must find the next token. 
  • 60. Transition Diagram for Numbers : Accepting float e.g. 12.31E4 Accepting integer e.g. 12 Accepting float e.g. 12.31
  • 61. Explaination with example :  Beginning in state 12, if we see a digit, we go to state 13. In that state, we can read any number of additional digits . However, if we see anything but a digit or a dot , we have seen a number in the form of an integer; 12 is an example . Accepting integer e.g. 12
  • 62. If we instead see a dot in state 13, then we have an "optional fraction ."  State 14 is entered, and we look for one or more additional digits; state 15 is used for that purpose. 
  • 63.  If we see an E, then we have an "optional exponent ," whose recognition is the job of states 16 through 19.
  • 64. Architecture of a TransitionDiagram- Based Lexical Analyzer :  Each state is represented by a piece of code . We may imagine a variable state holding the number of the current state for a transition diagram. A switch based on the value of state takes us to code for each of the possible states, where we find the action of that state. Often, the code for a state is itself a switch statement or multi way branch that determines the next state by reading and examining the next input character.
  • 66. Sketch of implementation of relop transition diagram
  • 67. getRelop ( ), a C++ function whose job is to simulate the transition diagram and return an object of type TOKEN , that is , a pair consisting of the token name (which must be relop in this case) and an attribute value .  getRelop ( ) first creates a new object ret Token and initializes its first component to RELOP , the symbolic code for token relop 
  • 68. We see the typical behavior of a state in case 0, the case where the current state is 0. A function nextChar ( ) obtains the next character from the input and assigns it to local variable c. We then check c for the three characters we expect to find , making the state transition dictated by the transition diagram .  For example, if the next input character is =, we go to state 5. 
  • 69.  If the next input character is not one that can begin a comparison operator, then a function fail ( ) is called, and It should reset the forward pointer to lexemeBegin.
  • 70.  Because state 8 bears a *, we must retract the input pointer one position (i.e. , put c back on the input stream) . That task is accomplished by the function retract ( ). Since state 8 represents the recognition of lexeme >=, we set the second component of the returned object , which we suppose is named attribute, to GT, the code for this operator.
  • 71.
  • 72. “The symbol table is a data structure used by all phases of the compiler to keep track of user defined symbols and keywords.”
  • 73.  In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating to its declaration or appearance in the source, such as its type, scope level and sometimes its location.
  • 74.  During early phases (lexical and syntax analysis) symbols are discovered and put into the symbol table.  During later phases symbols are looked up to validate their usage.
  • 75. Uses   An object file will contain a symbol table of the identifiers it contains that are externally visible. During the linking of different object files, a linker will use these symbol tables to resolve any unresolved references. A symbol table may only exist during the translation process, or it may be embedded in the output of that process for later exploitation, for example, during an interactive debugging session, or as a resource for formatting a diagnostic report during or after execution of a program.
  • 76.   While reverse engineering an executable, many tools refer to the symbol table to check what address has been assigned to global variables & Non-functions if the symbol table has been stripped or cleaned out before being converted into executable tools will find it harder to determining address or understand anything about the program. At that time of accessing variable & allocating memory dynamically a compiler should perform many works & as such, the extended stack model requires the symbol table.
  • 77. Typical symbol table activities: Add a new name  Add information for a name  Access information for a name     determine if a name is present in the table remove a name revert to a previous usage for a name (close a scope).
  • 78. Many possible Implementations Linear List  Sorted List  Hash Table  Tree Structure 
  • 79. Typical information fields: print value  kind (e.g. reserved, type_id, var_id, func_id, etc.)  block number/level number  type  initial value  base address etc. 