2. Lexical Analysis
• 1st phase of compiler.
• Part of front end of
compiler.
• It is usually implemented
as subroutine or co-
routine of parser.
3. Lexical Analysis
• Reads the source program as a file of characters and
divide them up into tokens.
• Tokens are recognized using Finite Automata or
Transition Diagrams.
4. Lexical Analysis
• The vital functions of Lexical Analyzer are:-
– Separate tokens from program
– Insert tokens into symbol table
– Eliminates whitespaces, comments, newline characters,
etc. from string.
– Give tokens to parser (pass their integer no./code to
parser)
– Keep track of line numbers (for error handling).
– Pre-processing of Macros.
5. Lexical Analysis
• It reads characters from the input, groups them into
lexeme and
• Then, passes the tokens formed by the lexemes,
together with their attribute value, to the parser.
• In some situations ,it has to read some characters
ahead before it can decide on the token to be
returned to the parser.
6. Lexical Analysis
• Ex: lexical analyzer for C must read ahead after it
sees the character ‘>’.
• If the next character is = ,then the character
sequence >= is the lexeme forming the token for
the “Greater than or equal to ” operator.
• Other wise > is the lexeme forming “Greater than ”
operator ,and the lexical analyzer has read one
character too many.
7. Lexical Analysis
• The extra character has to be pushed back on to the
input, because it can be the beginning of the next
lexeme in the input.
• The lexical analyzer and parser form a producer-
Consumer pair.
• Produced tokens can be held in a token buffer until
they are consumed.
8. Input Buffering
• The interaction b/w the two is constrained only by
the size of the buffer, because:-
– lexical analyzer can not proceed when the buffer is full.
– parser can not proceed when the buffer is empty.
• Commonly,the buffer holds just one token.
• Lexical Analyzer can be a procedure called by a
parser, returning tokens on demand.
9. Input Buffering
• The implementation of reading and pushing back
character is usually done by setting up an input buffer.
• A block of character is read into the buffer at a time.
• A pointer keeps track of the portion of the input that
has been analysed.
• Pushing back a character is implemented by moving
the pointer.
10. Input Buffering
• Source program is read from HDD.
• LA has to access secondary memory everytime it
needs to identify tokens.
• It may be time consuming, hence an input buffer
is used.
• LA scans input string from left to right one char. at
a time to identify tokens.
11. Input Buffering
• It uses 2 pointers:
– Begin Pointer (bptr) – points to begin of string.
– Look-ahead Pointer (lptr) – moves ahead to search for
end of token.
• Ex: int a, b;
14. Tokens, Lexemes & Patterns
• Connected with lexical analysis are three
important terms with similar meaning.
• Lexeme
• Token
• Patterns
15. Tokens, Lexemes & Patterns
• A token is a pair consisting of a token name and an
optional attribute value.
• Token names:
– Keywords,
– operators,
– identifiers,
– constants,
– literal strings,
– punctuation symbols(such as commas, semi-colons)
16. Tokens, Lexemes & Patterns
A lexeme is a sequence of characters in the
source program that matches the pattern for a
token and is identified by the lexical analyzer as
an instance of that token.
Ex: Relation Operators {<,<=,>,>=,==,<>}
17. Tokens, Lexemes & Patterns
• A pattern is a description of the form that the
lexemes of token may take.
• It gives an informal or formal description of a token.
Ex: identifier
• Gives a precise description/ specification of tokens.
• Used to automatically generate a lexical analyzer