Parsers. We might not think about them but anyone who writes code uses parsers every day. And the best part, they are useful not only for compiler design but for implementing other things like custom search queries, DSLs, parsing log files and data.
Writing parsers, a prerequisite for implementation of such features, might seem scary at first (it seemed to me at first!), but in reality, writing parsers is not that complicated.
In this talk, I will explain a bit of theory behind parsers, show how they can be written by hand or with tools such as ANTLR.
2. What is in common between those?
• Parse text logs (or any structured data) to make it searchable
• Parse custom (and complex!) configuration format
• Allow users to query your data
• Adjust or refactor incoming structured user queries
• Implement a DSL
• Parse a custom data format (no, not with REGEX!!)
14. LL Parser
• Predict based on current token and lookahead, decide which rule
to try to apply
• Match apply grammar rule and apply results to AST
• Top-down parsing!
• Backtrack if predict step is wrong
15. LR Parsers
• Shift put next token to buffer
• Reduce Apply grammar rule on tokens in buffer
• Bottom-up parsing!
16. LL Parsers vs LR Parsers
• Ambiguity resolving capabilities
• Error handling is better in LL (better error context!)
• LL implementations are easier to understand
• Pretty much equal in results
• In most cases, performance is not an issue!
Note: other types of parsers are less useful in practice
https://core.ac.uk/download/pdf/62921535.pdf
17. Why ANTLR?
• LL(*) parser generator
• Can generate parsers in MANY languages
• (C#, Java, C++, JavaScript and more)
• Can parse pretty much any useful grammar
• Can handle regular and contex-free languages in Chomsky Hierarchy
• Resolves ambiguities with programmatic predicates
18. ANTLR4 Grammar
• Combined grammar (lexer + parser in the same file)
• Separated grammar (lexer and parser different files)
• Can have multiple files with "includes"
20. Lexer
• “Rules” define how to parse each token
• “Rule” definitions are a variant of regex
• Can define “fragments” – composable and re-usable definitions
• Lexing is “greedy”
21. Parsing
• Parsing “rules” represent state machine
• Parsing “rules” may use either tokens or other rules
• ANTLR4 supports left-recursion
22. Interpreting: ANTLR4 Visitor vs Listener
Visitor
• Needs explicit Visit() calls
• Call Visit() for each rule
Listener
• Methods called by ANTLR
• Traversal “events”
• EnterXYZ()
• VisitXYZ()
• ExitXYZ()
24. What we didn't cover
• Syntax ambiguity
• Parser performance (multiple ways to write the same thing!)
• Error handling (ANTLR4)
• Island grammars (ANTLR4)
• Actions and attributes (ANTLR4)
• Semantic predicates (ANTLR4)