Talk given at LRUG, may, 2009 about Treetop, a ruby parsing expression grammar. It should hopefully convince you that parsers fit better than regular expressions in quite a few cases.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Treetop - I'd rather have one problem
1. Some people, when faced with a problem think,
“I know, I’ll use regular expressions”.
Now they have two problems.
I’d rather have one problem.
Treetop • Roland Swingler • LRUG May 2009
Tuesday, 19 May 2009
This quotation is used a lot in presentations, normally before the presenter delves into some
gnarly regexps. I’m looking for a better way.
3. Tuesday, 19 May 2009
I run a film listing site: http://filmli.st. All the data is scraped from other sites - getting the
data is easy with net/http or httparty or similar and then parsing the html with nokogiri or
hpricot, but...
4. <span>
Fri/Sun-Tue 10.45 12.30 (Tue) 12.40 (not Tue)
4.00 7.00 9.30; Wed 3.00 7.30 9.00
</span>
Tuesday, 19 May 2009
... you still need to turn a text string like this into a list of Times so you can do interesting
things with it. Regexps? No. That way lies madness.
6. Tuesday, 19 May 2009
Chatroom bots need to be able to distinguish between messages that they should take
actions on and those which they should ignore. How should we define what messages they
should listen out for?
10. Scenario: producing human-readable tests
Given I have non-technical stakeholders
When I write some integration tests
Then they should be understandable by everyone
Tuesday, 19 May 2009
Wouldn’t it be great if someone had written a library like this?
11. Tuesday, 19 May 2009
They have! Cucumber. Cucumber’s implementation got me started looking into...
12. Tuesday, 19 May 2009
Treetop. A ruby Parsing Expression Grammar. Basically a parser generator, but really simple.
13. What is a parser?
Tuesday, 19 May 2009
A parser determines whether strings are syntactically valid according to a set of rules known
as a grammar.
14. Yes / No
Tuesday, 19 May 2009
From a theoretical viewpoint, parsers just say true or false, depending on whether the string
is valid or not.
15. Syntax Tree
Tuesday, 19 May 2009
Not so useful, so instead we get back a syntax tree we can do useful things with.
16. whereis <person> [on <day>]
Tuesday, 19 May 2009
Lets try building a tree for this example. You can consider a string to be a list of characters,
but to start getting meaning from it, you need a tree.
17. words words
whereis <person> [on <day>]
Tuesday, 19 May 2009
We have some words...
18. words variable words variable
whereis <person> [on <day>]
Tuesday, 19 May 2009
variables...
19. optional part
words variable words variable
whereis <person> [on <day>]
Tuesday, 19 May 2009
an optional part of an expression (enclosed with square brackets)
20. expression
optional part
words variable words variable
whereis <person> [on <day>]
Tuesday, 19 May 2009
and a root node for the whole expression
21. grammar Message
end
Tuesday, 19 May 2009
lets build that up in treetop. Each of those four types of node in the tree is going to have a
rule. We write these rules in a grammar - you think of it like a ruby module.
22. grammar Message
rule expression
(words / variable / optional_part)+
end
end
Tuesday, 19 May 2009
The first rule for the whole expression. Lots of things should be familiar from regular
expressions - ‘+’ for one or more, brackets for grouping, and ‘/’ is like the regexp ‘|’ for
alternation. So this says an expression is one or more words, variables or optional parts, in
any order.
23. grammar Message
rule expression
(words / variable / optional_part)+
end
rule words
[^><[]]+
end
end
Tuesday, 19 May 2009
words - character classes, just like regexps
24. grammar Message
rule expression
(words / variable / optional_part)+
end
rule words
[^><[]]+
end
rule variable
'<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>'
end
end
Tuesday, 19 May 2009
variables are enclosed with angle brackets, can be any valid ruby identifier string, and are
labeled so we can use part of the text later.
25. grammar Message
rule expression
(words / variable / optional_part)+
end
rule words
[^><[]]+
end
rule variable
'<' identifier:( [a-zA-Z_] [a-zA-Z_0-9 ]* ) '>'
end
rule optional_part
quot;[quot; expression quot;]quot;
end
end
Tuesday, 19 May 2009
optional parts are enclosed with square brackets. Here we see that rules can be recursive -
which makes the parser significantly more powerful than regular expressions.
26. $ tt message.treetop
Tuesday, 19 May 2009
We compile the grammar with the command line tt command - you can also load grammars
dynamicaly
27. require ‘message’
parser = MessageParser.new
tree = parser.parse(“whereis <person>...”)
Tuesday, 19 May 2009
this gives us a parser we can call from ruby code
28. require ‘message’
parser = MessageParser.new
tree = parser.parse(“whereis <person>...”)
tree.elements[0].text_value
#=> “whereis ”
tree.elements[1].identifier.text_value
#=> “person”
Tuesday, 19 May 2009
each node knows about its children and its text_value. The label we defined earlier provides
sugar methods to access particular subnodes.
29. Fri/Sun-Tue 4.00 7.00
Tuesday, 19 May 2009
Another example. This time we’ll think about the tree in a top down fashion rather than
bottom up. This is closer to how treetop will actually evaluate an expression.
30. expression
Fri/Sun-Tue 4.00 7.00
Tuesday, 19 May 2009
31. expression
days times
Fri/Sun-Tue 4.00 7.00
Tuesday, 19 May 2009
32. expression
days times
day day range time time
Fri / Sun-Tue 4.00 7.00
Tuesday, 19 May 2009
33. expression
days times
day day range time time
day day hrs mins hrs mins
Fri / Sun - Tue 4 . 00 7 . 00
Tuesday, 19 May 2009
35. rule times
time (“ ” time)+
end
rule time
hours “.” minutes
end
rule hours
1 [0-2] / [0-9]
end
rule minutes
[0-5] [0-9]
end
Tuesday, 19 May 2009
36. rule days
(day !“-” / day_range) (“/” days)?
end
rule day_range
day “-” day
end
rule day
“Mon”/“Tue”/“Wed”/“Thu”/“Fri”/“Sat”/“Sun”
end
Tuesday, 19 May 2009
The bit highlighted in red is a negative lookahead assertion. We need this because treetop
evaluates alternatives from left to right - if we didn’t have the assertion then Sun-Tue would
match Sun as a Day, not a DayRange, and we’d be left with “-Tue” which isn’t valid.
38. rule time
hours “.” minutes
end
irb> aTimeNode.text_value #=> “9.00”
irb> aTimeNode.elements.size #=> 3
irb> aTimeNode.hours.text_value #=> “9”
Tuesday, 19 May 2009
39. rule time
hours “.” minutes {
def to_seconds
hours.to_i * 60 * 60 + minutes.to_i * 60
end
}
end
irb> aTimeNode.text_value #=> “9.00”
irb> aTimeNode.to_seconds #=> 32400
Tuesday, 19 May 2009
We can add in methods inline in the grammar. This is just like a module scope, and we can
do any ruby we like in here.
40. # in film_time.treetop
rule time
hours “.” minutes <TimeNode>
end
# in another .rb file
class TimeNode < Treetop::Runtime::SyntaxNode
def to_seconds
hours.to_i * 60 * 60 + minutes.to_i * 60
end
end
Tuesday, 19 May 2009
Cleaner in my mind to split these out into actual subclasses of SyntaxNode - keeps the
grammar more readable. In some cases you need to have modules rather than subclasses.
41. Interpretation &
Compilation
Tuesday, 19 May 2009
We’re going to build up a regular expression for the bot example. Each node will be
reponsible for building a different part of the regexp.
42. expression
optional part
words variable words variable
whereis <person> [on <day>]
/^whereis (.+?)(?:s+on (.+?))?$/
Tuesday, 19 May 2009
43. expression
optional part
words variable words variable
whereis <person> [on <day>]
/^whereis (.+?)(?:s+on (.+?))?$/
Tuesday, 19 May 2009
44. expression
optional part
words variable words variable
whereis <person> [on <day>]
/^whereis (.+?)(?:s+on (.+?))?$/
Tuesday, 19 May 2009
45. expression
optional part
words variable words variable
whereis <person> [on <day>]
/^whereis (.+?)(?:s+on (.+?))?$/
Tuesday, 19 May 2009
46. expression
optional part
words variable words variable
whereis <person> [on <day>]
/^whereis (.+?)(?:s+on (.+?))?$/
Tuesday, 19 May 2009
47. Interpreter Pattern
Tuesday, 19 May 2009
This is confusing - it comes from GoF. Actually we’re doing compilation here. Each node gets
an interpret method - you treat the syntax tree as a composite.
48. # expression
def interpret
children = elements.map {|node| node.interpret }
RegExp.compile(“^” + children.join + “$”)
end
Tuesday, 19 May 2009
49. # words
def interpret
Regexp.escape(text_value)
end
Tuesday, 19 May 2009
50. # variable
def interpret
“(.+?)”
end
Tuesday, 19 May 2009
51. # optional_part
def interpret
children = elements.map {|node| node.interpret }
“(?:s+” + children.join + “)?”
end
Tuesday, 19 May 2009
52. Adding context
Tuesday, 19 May 2009
For anything more than a simple language, you’ll need to pass around context as you
interpret the tree.
53. # expression
def interpret(context=[])
children = elements.map do |node|
node.interpret(context)
end
matcher = RegExp.new(“^” + children.join + “$”)
...
Tuesday, 19 May 2009
In our case we just want to record the list of variable names, so an Array will suffice. Each
interpret method now needs to take this context.
54. # variable
def interpret(context)
context << identifier.text_value.to_sym
“(.+?)”
end
Tuesday, 19 May 2009
55. # expression
def interpret(context=[])
children = elements.map do |node|
node.interpret(context)
end
matcher = RegExp.new(“^” + children.join + “$”)
class << matcher
send(:define_method, :variables) do
context
end
end
matcher
end
Tuesday, 19 May 2009
we decorate the regular expression with a list of the variables. In the real code, the returned
match objects are also decorated so you have methods for each variable and don’t have to
remember the captured groups by position
56. Other Options
Tuesday, 19 May 2009
You can also build external interpreters / compilers that use the tree
58. # We want to write:
hello [world]
# We actually mean:
hello[ world]
Tuesday, 19 May 2009
whitespace shuffling. In the reall code, grammar is more complicated - most of the
complication comes from dealing with edge cases here
59. # We should optimize:
hello [[[world]]]
# To this:
hello [world]
Tuesday, 19 May 2009
This isn’t done in the real code, but should be.
60. # Left recursion without consuming input BAD:
rule infinity_and_beyond
infinity_and_beyond / “foo”
end
Tuesday, 19 May 2009
62. Other libraries
Tuesday, 19 May 2009
Racc - accepts yacc grammars. Racc runtime is part of the ruby std dist. so once you’ve built
your parser there is no dependency. Ragel - used by mongrel/thin.
63. Thanks!
Twitter: @knaveofdiamonds
XMPP bot:
http://github.com/knaveofdiamonds/harken
Film listings for London’s indie cinemas:
http://filmli.st
Treetop:
http://github.com/nathansobo/treetop
http://treetop.rubyforge.org
Tuesday, 19 May 2009