Yes, we're going to look at file parsing. Sounds a bit boring, right? Wrong.
In this talk, just for fun, we'll find out how to parse a file. We'll look at simple, hand crafted parsers. We'll finally figure out just how lex and yacc work. And we'll pick apart structured parsers that build abstract syntax trees as you type - ReSharper style. How is an IDEs parser different to a compilers? How do you handle sensible error recovery? What about significant whitespace?
Everything you always wanted to know about parsing a file, but were too afraid to ask.
3. @citizenmatt
Why would we write a parser?
• Speed, efficiency
• Reduce dependencies
• Custom or simple formats
• Things that aren’t files - DSLs
Command line options, HTTP headers, stdout, natural language commands
E.g. YouTrack queries
• When we’re just as interested in the structure of a file
as its contents
15. @citizenmatt
What is a lexer (aka scanner)?
• Performs lexical analysis
Lexical - relating to the words or vocabulary of a language
• Converts a string into a stream of tokens
Identifier, comment, string literal, braces, parentheses, whitespace, etc.
• Tokens are lightweight - typically integer values
(ReSharper uses singleton object instances)
• Parser pattern matches over tokens
Integer or object reference comparisons
17. @citizenmatt
Lexers are a solved problem
Use a lexer generator
lex (1975), flex, CsLex, FsLex, JFLex, etc.
18. @citizenmatt
Anatomy of a lexer input file
User code (e.g. using directives)
%%
directives
set up namespaces, class names, interfaces
declare regex macros
declare states
%%
rules and actions
<state> rule { action }
20. @citizenmatt
How does it work?
• Lexer generates source code
• Rules (regexes) converted into single Finite State Machine
All regexes combined, matched at same time
• Encoded in state transition tables
• Lookup based on state and input char
• Very fast
• Not very maintainable
Seriously
22. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
23. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
24. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
25. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
26. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
27. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
28. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
29. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
30. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
31. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
32. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
33. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
34. @citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
36. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
37. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
38. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
39. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
40. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
41. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
42. Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a
44. @citizenmatt
What is a parser?
• Performs syntactic analysis
Verifies and matches syntax of a file
• Pattern matching on stream of tokens from lexer
Can look at token offsets and text, too
• Syntax is described by a grammar
• Grammar is represented as a recursive hierarchy of rules
Top level is the whole file, composing down to structures and tokens
47. @citizenmatt
Types of parsers
• Top down/recursive descent
Match the root of the tree, recursively split up into child elements
• Bottom up/recursive ascent
Start with matching the leaves of the tree, combine into larger
constructs as you go
50. @citizenmatt
Building a parser
• Hand rolled
Mechanical process to build. Easy to understand
Usually top down/recursive descent
Can use grammar to build syntax tree classes
• Parser generators
yacc/bison, ANTLR, etc.
Usually bottom up. Can be hard to debug - table driven
• ReSharper mostly uses top-down procedural parsers
Generated and hand rolled
Mainly historical. Easier to maintain, easier error recovery, etc.
58. @citizenmatt
Which doesn’t match the grammar
shaderBlock:
SHADER_KEYWORD
STRING_LITERAL
LBRACE
…
RBRACE
;
SHADER_KEYWORD
NEW_LINE
WHITESPACE
STRING_LITERAL
NEW_LINE
WHITESPACE
NEW_LINE
COMMENT
NEW_LINE
LBRACE
WHITESPACE
…
WHITESPACE
RBRACE
59. @citizenmatt
• Filter whitespace and comments from the stream of tokens
ReSharper’s tokens have IsFiltered property
• Decorator pattern
Wrap original lexer, swallow filtered tokens
Filtering lexers
Filtering
lexer
Lexer
Parser
Program
structure
61. @citizenmatt
IDE requirements, Part 1
• Code editor features
Syntax highlighting, code folding, etc.
• Syntax error highlighting
• Inspections
• Refactoring
• Formatting
• Etc.
62. @citizenmatt
IDE requirements, Part 1
• Need to work with the contents and structure of a file
• Contents give us semantic information
• Structure allows us to report inspections, refactor, etc.
Map the semantics back to the file
• Need to represent the structure of the file
• Syntax tree is obvious choice
Inspections walk the tree, refactorings rewrite the tree
66. @citizenmatt
Back to Filtering Lexers
• If we filter tokens out, we have to add them back again
• We need a Missing Tokens Inserter to add whitespace
and comments back into parse tree
Filtering
lexer
Lexer
Parser
Concrete
parse tree
Missing
tokens
inserter
67. @citizenmatt
Missing Tokens Inserter
• Walk leaf elements of tree
Tokens
• Advances (cached) lexer for each leaf element
• Check current lexer token has same offset as leaf
element
• If not, create leaf element and insert into tree
69. @citizenmatt
How do we parse this?
There are no end of scope markers!
And we’ve filtered out the whitespace!
let ArraySample() =
let numLetters = 26
let results = Array.create numLetters 0
let data = "The quick brown fox"
for i = 0 to data.Length - 1 do
let c = data.Chars(i)
let c = Char.ToUpper(c)
if c >= 'A' && c <= 'Z' then
let i = Char.code c - Char.code 'A'
results.[i] <- results.[i] + 1
printf "done!n"
70. @citizenmatt
Insert zero-width tokens
• Another lexer decorator
• Keeps track of whitespace before it’s filtered
• Inserts “invisible” tokens into token stream
indicating indent/outdent or block start/end
Possibly also token to indicate invalid indentation
• Token is zero-width. Doesn’t affect parse tree
• Parser can match these invisible tokens in grammar
72. @citizenmatt
Altering tokens
• F# example: 2. and [2..0] ambiguous
• Original lexer matches 2. as FLOAT
and 2.. as INT_DOT_DOT
• Another lexer decorator
Augment generated rules with custom code
• Decorator recognises INT_DOT_DOT
Splits into two tokens for parser
73. @citizenmatt
When regexes aren’t enough
• ShaderLab nested comments
• Not possible to match with regex
Don’t even try
• Rule to match start of comment - /*
Finish lexing by hand, counting start and end comment chars
Ignore START_COMMENT and return different token - COMMENT
• It doesn’t have to be completely machine generated
/* This /* is */ valid */
75. @citizenmatt
Pre-processor tokens
• Pre-processor tokens can
appear anywhere
• How do you add them to
the grammar/parser?
• ShaderLab has CGPROGRAM
and CGINCLUDE which are
essentially pre-processor
tokens
• (Also nested language - Cg)
76. @citizenmatt
Parsing pre-processor tokens
• Two pass parsing
• First pass parses pre-processor tokens
• Filtering lexer strips pre-processor tokens
• Parse normally
• Parsed pre-processor tree nodes inserted as missing
tokens
79. @citizenmatt
IDE Requirements, Part 2
• Error highlighting
The code is broken every time you type
• Incremental lexing + parsing
Performance
• Version tolerance
E.g. multiple versions of C#
• Nested/composable languages
83. @citizenmatt
What happens when there’s an error?
• The parser adds an error element into the tree
• Error element spans whatever has been parsed so far
Might just be unexpected token, or incorrect element construct
• Highlighting the error in the editor is trivial
Inspection simply looks for error element, adds highlight
84. @citizenmatt
How do we find an error?
• Error start is obvious
mismatched rule, unexpected token
• Where does the error stop?
Off by one token could affect rest of file
• IDE must try to recover
How?
85. @citizenmatt
Error recovery
• Panic mode
Eat tokens until finds a “follows” token
• Token insertion/removal/substitution
• Error rules in grammar
89. @citizenmatt
Error production rules
• Create a rule that anticipates an error
• E.g. consume any tokens that shouldn’t be there
emptyBlock:
LBRACE
errorElementWithoutRBrace*
RBRACE
;
91. @citizenmatt
What’s the problem?
• Don’t parse entire file on every change
• Only reparse smallest subtree that encloses change
Block nodes (method bodies, classes, etc. Not if, for, etc.)
• Avoid re-lexing the entire file, too
92. @citizenmatt
Incremental lexing
• Requires a cache of the original token stream
Token type, offsets and state of lexer (int)
• Copy cached tokens up to change position
• Restart lexer at change position with known state from
cache
• Lex until we can match tail of cached tokens
93. @citizenmatt
Incremental parsing
• Walk up syntax tree, find nearest element that can
reparse and that encompasses change
E.g. method/class body
• Find start of block
E.g. opening LBRACE ‘{‘
• Use updated cached lexer to find end of block
E.g. closing RBRACE ‘}’
• Parse block, add new element into tree
Uses custom entry point into parser
95. @citizenmatt
Three types
• Injected languages
E.g. self-contained islands in a string literal (regex)
• Inherited languages
E.g. TypeScript is a superset of JavaScript
• Nested languages
E.g. JavaScript/CSS nested inside HTML. Razor and C#
96. @citizenmatt
Injected languages
• Build a parse tree for the contents of another node
E.g. ShaderLab CG_PROGRAM, regular expressions, …
• Provides syntax highlighting, code completion, etc.
• Attaches a new parse tree to the node of another tree
• Changes to injected tree persisted to string and pushed
as change to the owning tree
• Changes to owning tree cause full reparse of injected
language
97. @citizenmatt
Inherited languages
• E.g. TypeScript is a superset of JavaScript
• TypeScriptParser derives from JavaScriptParser
Share a lexer
• Custom hand rolled parsers
Recursive descent
• Easier to inherit and override key methods
Gang of Four Template pattern
• Also XamlParser, MSBuildParser, WebConfigParser
Custom XML parsers
98. @citizenmatt
Nested languages
• E.g. .aspx, .cshtml - HTML superset, with C# “islands”
• ReSharper parses .aspx/.cshtml file
Builds parse tree for ASPX/Razor syntax
• HTML superset requires lexer superset
• HtmlCompoundLexer lexes “outer” language’s tokens
When encounters HTML, switches to standard HTML lexer
• How to handle C# islands?
99. @citizenmatt
Secondary documents
• ASPX/Razor - C# islands
• Create secondary in-memory C# file
Mirrors what gets generated when .aspx file is compiled
• Maps C# islands in .aspx to in-memory C# file
• Inspections, code completion, etc. work through the
mapping