The document discusses parsing and Scala parser combinators. It provides an example of using parser combinators to define a parser that parses a line of text into a WordFreq case class with a string and integer field. The parser combinators approach allows defining parsing functions that are combined to parse more complex structures. This provides a robust yet easier way to define parsers compared to other options like hand-written parsers.
2. Why? How?
- DSLs Everywhere
- Also parsing in general
- Internal vs. External
Lots of options for parsing
- String.split
- RegEx
- Hand-Written
- Parser Generators
From simple and fragile to
robust and complex
Goal: Robust, but easier to create (and in Scala…)
3. Parser Combinators
The “functional way” of writing a parser
A parsing function:
Character Stream → Result[Character Stream, T]
Combine functions into more complicated patterns
- Sequence
- Choice
- ...
Not only in Scala; but we’ll focus on the standard Scala Parser Combinators.
4. Scala Parser Combinators - Simple Example
Input: A line consisting of a word (string) and its count (number), e.g. “johnny 12”
case class WordFreq(word: String, count: Int) {
override def toString = "Word <" + word + "> " +
"occurs with frequency " + count
}
And we’d like to use it in a program:
object TestSimpleParser extends SimpleParser {
def main(args: Array[String]) = {
parse(freq, "johnny 12") match {
case Success(matched,_) => println(matched)
case Failure(msg,_) => println("FAILURE: " + msg)
case Error(msg,_) => println("ERROR: " + msg)
}
}
}
5. Scala Parser Combinators - Simple Example
Then a possible parser is:
class SimpleParser extends RegexParsers {
def word: Parser[String] = """[a-z]+""".r ^^ { _.toString }
def number: Parser[Int] = """(0|[1-9]d*)""".r ^^ { _.toInt }
def freq: Parser[WordFreq] = word ~ number ^^ { case wd ~ fr => WordFreq(wd,fr) }
}
The basic pattern:
- Defining functions for parsing simple strings
- Map matched strings into more meaningful object model
- Combine results into more complex structures
6. Actimize Profiling Language
Context:
- A data profiling engine
- Aggregations, functions, metadata
- Highly customizable by clients,
professional services, etc.
- Existing interface: XML
configuration files
10. Summary
Where? Parsing – DSLs and Beyond
Why? Building a robust parser is complicated
How? Scala Parser Combinators
Hinweis der Redaktion
Whether we know it or not, we use DSLs in all sorts of places. For example, the SBT tool has a DSL for the domain of building code, rule engines often have DSLs, etc.
This is not so much the topic of this talk, but i am pointing this out to clarify the motivation.
Also, parsing in general is useful beyond the context of some user-facing DSL. We sometimes need to parse messages, for example, that are not necessarily in some well known format.
One prominent example of this that I know of is parsing financial protocols, but there are more.
One important distinction we need to make in this context is the distinction between Internal and External DSLs.
Internal DSLs are basically DSLs that are defined in the syntax of some host language. Meaning: syntactically, they are a subset of some other, usually more general purpose language.
The language used to define build in SBT is one such example.
Ruby and Scala are usually popular choices for hosts of such language, given their flexible syntax.
External DSLs are languages defined in a way that’s decoupled from any host language - the syntax is usually defined from scratch, and needs to be parsed on its own.
The pros and cons of each language are a debate matter on its own, which we don’t have time to dive into right now; and I’m sure you can think of advantages to each.
In this talk i’m focusing on the External DSLs - those that require specialized parsing.
----
Parsing is of course the problem of turning a stream of characters into something more meaningful to the program at hand.
We have all sorts of ways to parse strings, as you can see here (not an exhaustive list).
Some of these way are fairly simple to write, but are then less robust.
While the more robust ways, e.g. ANTLR, or handwritten parsers, are usually more robust but significantly more complex to write and/or integrate with our system.
Note: it’s not that it’s not possible or not good to use these methods, they are all good in some circumstances.
What i would like to present here is another way to do this, that’s naturally available in Scala as well, and i believe provides a fair tradeoff between robustness and complexity.
The idea of parser combinators isn’t actually very new or unique to Scala, or very complicated.
The idea is fairly simple: take a function that identifies a certain string (parses it), and combine it with other functions for other strings, to create a more complete parser.
Like i said, this isn’t unique to Scala, but it has been part of the Scala standard library up to Scala 2.11 when it was separated. Here we’ll focus on this library.
This is a very simple example.
Our task here is to basically parse a file of simple one line entries where each line is a word and a number - the word and how much it appears somewhere.
We have a simple model class here - WordFreq. Basically a tuple of the word and its count.
And we see here how this parser is used - that’s the bold part here.
The “matched” variable here is bound to an instance of “WordFreq”.
And this is how the parser is actually implemented using parser combinators.
Each parser function is defined as a function in the parser class.
We define here two simple functions, parsing a word and a number using simple regular expressions.
The 3rd method - ‘freq’ - is actually created using a sequence combinator. Essentially creating a new parser function out of the other two, when it appears in sequence in the input.
Note that the output type of that method is in fact the model class we defined earlier.
This very simple parser already illustrates the basic pattern:
Defining functions for parsing strings
Map the matched strings into a more meaningful object model (String, Int, WordFreq in our case).
Combine the results of each function into more complex structures.
Note how in this case the ‘fr’ is in fact of type Int - the result of parsing, as defined by the ‘number’ function.
And now to a more interesting case.
Just to set the context, in Actimize we deal with a lot of data profiling.
As a result, we developed a fairly robust profiling engine that allows us to express rather complicated profiles, including customization by clients, etc.
The engine works pretty well. The problem is that its interface isn’t great - it’s basically huge XML files.
We wanted to achieve something like this, where in ~32 lines of code we define the same profile, also in a way that’s a lot more convenient to read and write.
And this is an example of the model classes used in defining the profile.
This should be the result of the parsing, like the ‘WordFreq’ class in the previous example. And from these we can do more interesting stuff, for example generating the necessary SQL statements.
This has classes for the different metadata elements, defining mappings, filters, etc.
In our initial implementation right now we basically just generated the same XML file and let the existing engine work as it is, but we can of course skip that step.
And this is the actual parse code.
The whole parser is roughly 200 lines of code.
We can see here that it’s a simple class, extending one of the Scala Parser Combinators classes, and adding the definitions for the different parsers.
We start with simple keyword definitions, but then move on to combine them into complete statements, and map them into the concrete parsed results.
Given this, the compiler implementation is fairly straightforward - just serialize these classes into XMLs; I used an existing schema, with JAXB to generate the XML in this case.