• Any program written in a programming language
must be translated before it can be executed.
• This translation is typically accomplished by a
software system called compiler.
• This course aims to introduce the principles and
techniques used to perform this translation and the
issues that arise in the construction of a compiler.
Course Aims
6
Learning Outcomes:
• A student successfully completing this course should be able to:
• understand the principles governing all phases of the compilation
process.
• understand the role of each of the basic components of a
standard compiler.
• show awareness of the problems of and methods and techniques
applied to each phase of the compilation process.
• apply standard techniques to solve basic problems that arise in
compiler construction.
• understand how the compiler can take advantage of particular
processor characteristics to generate good code.
References
• Class textbook
• Compilers: Principles, Techniques, and
Tools by Aho, Sethi, and Ullman
• Other useful books
• Advanced Compiler Design &
Implementation, Steven Muchnick
• Building an Optimizing Compiler,
Robert Morgan
• Modern Compiler Implementation in
Java, Andrew Appel
Software Categories
• System SW
• Programs written for computer systems
• Compilers, operating systems, …
• Application SW
• Programs written for computer users
• Word-processors, spreadsheets, & other application packages
A Layered View of a Computer from the perspective of compiler
Machine with all its hardware
System Software
Compilers, Interpreters, Preprocessors, etc.
Operating System, Device Drivers
Application Programs
Word-Processors, Spreadsheets,
Database Software, IDEs,
etc…
Programs
• Any program can be written in any programming language
• A programming language(PL) is
• A set of rules and symbols used to construct a computer program
• A language used to interact with the computer
11
Why study Compilation Technology?
• Success stories (one of the earliest branches in CS)
• Applying theory to practice (scanning, parsing, static analysis)
• Ideas from different parts of computer science are involved:
• AI: Heuristic search techniques; greedy algorithms - Algorithms: graph
algorithms - Theory: pattern matching - Also: Systems, Architecture
• Compiler construction can be challenging and fun:
• new architectures always create new challenges; success requires
mastery of complex interactions; results are useful; opportunity to
achieve performance.
13
Principles of Compilation
The compiler must:
• preserve the meaning of the program being compiled.
• “improve” the source code in some way.
• Space (size of compiled code)
• Feedback (information provided to the user)
• Debugging
• Compilation time efficiency (fast or slow compiler?)
Compilers
• “Compilation”
• Translation of a program written in a source language
into a semantically equivalent program written in a
target language
Compiler
Error messages
Source
Program
Target
Program
Input
Output
Target program : an executable machine-language program.
History
IBM developed 704 in 1954. All programming was done in assembly
language. Cost of software development far exceeded cost of hardware.
Low productivity.
• Speedcoding interpreter: programs ran about 10 times slower than
hand written assembly code
• John Backus (in 1954): Proposed a program that translated high level
expressions into native machine code. Skeptism all around. Most people
thought it was impossible
• Fortran I project (1954-1957): The
first compiler was released
Fortran I
• The first compiler had a huge impact on the programming
languages and computer science. The whole new field of
compiler design was started.
• More than half the programmers were using Fortran by 1958.
• The development time was cut down to half.
• Modern compilers preserve the basic structure of the Fortran I
compiler !!!
Computer Languages
– Machine Language
• Uses binary code
• Machine-dependent
• Not portable
• Assembly Language
• Uses mnemonics(list of words to remembers)
• Machine-dependent
• Not usually portable
• High-Level Language (HLL)
• Uses English-like language
• Portable (but must be compiled for different platforms)
• Examples: Pascal, C, C++, Java, Fortran, . . .
Machine Language
• The representation of a computer program which is actually read and understood
by the computer.
• A program in machine code consists of a sequence of machine instructions.
• Instructions:
• Machine instructions are in binary code
• Instructions specify operations and memory cells involved in the operation
Example:
Operation Address
0010 0000 0000 0100
0100 0000 0000 0101
0011 0000 0000 0110
Assembly Language
A symbolic representation of the machine language of a specific processor.
Is converted to machine code by an assembler.
Each line of assembly code produces one machine instruction (One-to-one correspondence).
Programming in assembly language is slow and error-prone but is more efficient in terms of
hardware performance.
Mnemonic representation of the instructions and data
Example:
Load Price
Add Tax
Store Cost
High-level language
• A programming language which use statements consisting of English-like keywords
such as "FOR", "PRINT" or “IF“, ... etc.
• Each statement corresponds to several machine language instructions (one-to-many
correspondence).
• Much easier to program than in assembly language.
• Operations can be described using familiar symbols
• Example:
Cost = Price + Tax
Editors , Preprocessors , Linker & Loader
• - Editors
• Compiler have been bundled together with editor and other programs into an interactive
development environment (IDE)
• May include some operations of a compiler, informing some errors
• - Preprocessors
• Delete comments, include other files, and perform macro substitutions
• - Linkers
• Collect separate object files into a directly executable file
• Connect an object program to the code for standard library functions and to resource supplied by OS
• - Loaders
• Resolve all re-locatable address relative to a given base
• Make executable code more flexible
Compiling and running C programs
Editor
Compiler
Linker
Source code
file.c
Object code
file.obj
Executable code
file.exe
Libraries
Debuggers
• Used to determine execution error in a compiled program
• Keep tracks of most or all of the source code information
• Stop execution at pre-specified locations called breakpoints
Interpreters
• Execute the source program immediately rather than generating
object code
• Examples: BASIC, LISP, used often in educational or development
situations
• Speed of execution is slower than compiled code
• Share many of their operations with compilers
How to translate?
• Direct translation is difficult. Why?
• • Source code and machine code mismatch in level of abstraction
• – Variables vs Memory locations/registers
• – Functions vs jump/return
• – Parameter passing
• – structs
• • Some languages are farther from machine code than others
• – For example, languages supporting Object Oriented Paradigm
How to translate easily?
• Translate in steps. Each step handles a reasonably simple, logical, and
well defined task
• • Design a series of program representations
• • Intermediate representations should be amenable to program
manipulation of various kinds (type checking, optimization, code
generation etc.)
• • Representations become more machine specific and less language
specific as the translation proceeds
The first few steps
• The first few steps can be understood by analogies to how humans
comprehend a natural language
• • The first step is recognizing/knowing alphabets of a language. For
example
• – English text consists of lower and upper case alphabets, digits,
punctuations and white spaces
• –Written programs consist of characters from the ASCII characters set
(normally 9-13, 32-126)
The first few steps
• The next step to understand the sentence is recognizing words
• –How to recognize English words?
• –Words found in standard dictionaries
• –Dictionaries are updated regularly
The first few steps
• How to recognize words in a programming language?
• – a dictionary (of keywords etc.)
• – rules for constructing words (identifiers, numbers etc.)
• • This is called lexical analysis
• • Recognizing words is not completely trivial.
• For example: w hat ist his se nte nce?
Lexical Analysis: Challenges
• • We must know what the word separators are
• • The language must define rules for breaking a sentence into a
sequence of words.
• • Normally white spaces and punctuations are word separators in
languages.
Lexical Analysis: Challenges
• • In programming languages a character from a different class may also
be treated as word separator.
• • The lexical analyzer breaks a sentence into a sequence of words or
tokens:
• – If a == b then a = 1 ; else a = 2 ;
• – Sequence of words (total 14 words)
• if a == b then a = 1 ; else a = 2 ;
The next step
• • Once the words are understood, the next step is to understand the
structure of the sentence
• • The process is known as syntax checking or parsing
Parsing
Parsing a program is exactly the same process as shown in
previous slide.
• Consider an expression
if x == y then z = 1 else z = 2
Understanding the meaning
• • Once the sentence structure is understood we try to
understand the meaning of the sentence (semantic
analysis)
• • A challenging task
• • Example: Prateek said Nitin left his assignment at home
• • What does his refer to? Prateek or Nitin?
Understanding the meaning
• • Worse case Amit said Amit left his assignment at
home
• • Even worse Amit said Amit left Amit’s assignment at
home
• • How many Amits are there? Which one left the
assignment? Whose assignment got left?
Semantic Analysis
• • Too hard for compilers.
• They do not have capabilities similar to human understanding
• • However, compilers do perform analysis to understand the meaning
and catch inconsistencies
• • Programming languages define strict rules to avoid such ambiguities
• { int Amit = 3;
{
int Amit = 4;
cout << Amit;
}
• }
More on Semantic Analysis
• • Compilers perform many other checks besides variable
bindings
• • Type checking Amit left her work at home
• • There is a type mismatch between her and Amit. Presumably
Amit is a male. And they are not the same person.
Code Optimization
• • No strong counter part with English, but is similar to
editing/précis writing
• • Automatically modify programs so that they
• –Run faster
• –Use less resources (memory, registers, space, fewer fetches
etc.)
Code Optimization
• • Some common optimizations
• –Common sub-expression elimination
• –Copy propagation
• –Dead code elimination
• –Code motion
• –Strength reduction
• –Constant folding
• • Example: x = 15 * 3 is transformed to x = 45
Parts of Compilers
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Code Generation
5. Optimization
Analysis
Synthesis
Front
End
Back
End
Compilers
• Analysis
• Front End
• Split source code into
different constitute
pieces(token).
• Put the pieces based on
grammatical rules(Parse).
• Report Errors.
• Synthesis
• Back End
• Produce intermediate code
• Optimize Intermediate code
• Generate target
code(machine language
code)
48
Structure of a Compiler
• Front end: analysis
• Read source program and understand its structure and meaning
• Back end: synthesis
• Generate equivalent target language program
Source Target
Front End Back End
Phases of a Compiler
49
Code
Generator
Code
Optimizer
Intermediate
Code
Generator
Semantic
Analyzer
Syntax
Analyzer
Lexical
Analyzer
Error Handler
Symbol Table
Manager
Target
Program
Source
Program
The Structure of a Compiler
50
Scanner Parser
Semantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all Phases of The Compiler)
(Character Stream)
Intermediate
Representation
Target machine code
Analysis phase
Synthesis phase
by Neng-Fa Zhou
Analysis source program
lexical analyzer
syntax analyzer
semantic analyzer
source program
tokens
parse trees
parse trees
The Structure of a Compiler
52
Scanner Parser
Semantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Scanner (Lexical Analyzer)
The scanner begins the analysis of the source program by reading the
input, character by character, and grouping characters into individual
words and symbols (tokens)
Puts information about identifiers into the symbol table.
Regular expressions are used to describe tokens (lexical
constructs).
A (Deterministic) Finite State Automaton can be used in the
implementation of a lexical analyzer.
(Character Stream)
Intermediate
Representation
Target machine code
The Structure of a Compiler
54
Scanner Parser
Semantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Parser (Syntax Analyzer)
Given a formal syntax specification (typically as a context-free grammar [CFG] ),
the parse reads tokens and groups them into units as specified by the productions
of the CFG being used.
As syntactic structure is recognized, the parser either calls corresponding semantic
routines directly or builds a syntax tree.
CFG ( Context-Free Grammar )
BNF ( Backus-Naur Form )
GAA ( Grammar Analysis Algorithms )
(Character Stream)
Intermediate
Representation
Target machine code
55
Parser (Syntax Analyzer)
• A Syntax Analyzer creates the syntactic structure (generally a
parse tree) of the given program.
• A syntax analyzer is also called as a parser.
• A parse tree describes a syntactic structure.
parse tree
56
Parser (Syntax Analyzer (CFG) )
• The syntax of a language is specified by a context free grammar
(CFG).
• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satisfies the rules
implied by a CFG or not.
• If it satisfies, the syntax analyzer creates a parse tree for the given program.
• Ex: We use BNF (Backus Naur Form) to specify a CFG
assgstmt -> identifier := expression
expression -> identifier
expression -> number
expression -> expression + expression
57
Syntax Analyzer versus Lexical Analyzer
• Which constructs of a program should be recognized by the
lexical analyzer, and which ones by the syntax analyzer?
• Both of them do similar things; But the lexical analyzer deals with
simple non-recursive constructs of the language.
• The syntax analyzer deals with recursive constructs of the language.
• The lexical analyzer simplifies the job of the syntax analyzer.
• The lexical analyzer recognizes the smallest meaningful units (tokens) in
a source program.
• The syntax analyzer works on the smallest meaningful units (tokens) in
a source program to recognize meaningful structures in our
programming language.
The Structure of a Compiler
58
Scanner Parser
Semantic
Routines
Code
Generator
Optimizer
Source
Program
(Character Stream)
Tokens Syntactic
Structure
Intermediate
Representation
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Semantic Routines
Perform two functions
Check the static semantics of each construct
Do the actual translation
The heart of a compiler
Result is: Syntax Directed Translation
Semantic Processing Techniques
Ex:
newval = oldval + 12
The type of the identifier newval must match with type of the expression (oldval+12)
Target machine code
Symbol Table
• There is a record for each identifier
• The attributes include name, type, location, etc.
Synthesis of Object Code
intermediate code generator
code optimizer
code generator
parse tree & symbol table
intermediate code
optimized intermediate code
target program
The Structure of a Compiler
62
Scanner Parser
Semantic
Routines
Code
Generator
Optimizer
Source
Program
(Character Stream)
Tokens Syntactic
Structure
Intermediate
Representation
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Intermediate Code Generation
A compiler may produce an explicit intermediate codes representing the
source program.
These intermediate codes are generally machine (architecture independent).
But the level of intermediate codes is close to the level
Target machine code
The Structure of a Compiler
64
Scanner Parser
Semantic
Routines
Code
Generator
Optimizer
Source
Program Tokens Syntactic
Structure
Symbol and
Attribute
Tables
(Used by all
Phases of
The Compiler)
Optimizer
The IR code generated by the semantic routines is analyzed
and transformed into functionally equivalent but improved IR
code
This phase can be very complex and slow
Peephole optimization
loop optimization, register allocation, code scheduling
(Character Stream)
Intermediate
Representation
Target machine code
The Structure of a Compiler
66
Source
Program
(Character Stream)
Scanner
Tokens
Parser
Syntactic
Structure
Semantic
Routines
Intermediate
Representation
Optimizer
Code
Generator
Code Generator
Produces the target language in a specific
architecture.
The target program is normally is a relocatable object
file containing the machine codes.
Target machine code