SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Stop overusing regular expressions!
Franklin Chen
http://franklinchen.com/
Pittsburgh Tech Fest 2013
June 1, 2013
Famous quote
In 1997, Jamie Zawinski famously wrote:
Some people, when confronted with a problem, think,
“I know, I’ll use regular expressions.”
Now they have two problems.
Purpose of this talk
Assumption: you already have experience using regexes
Goals:
Change how you think about and use regexes
Introduce you to advantages of using parser combinators
Show a smooth way to transition from regexes to parsers
Discuss practical tradeoffs
Show only tested, running code:
https://github.com/franklinchen/
talk-on-overusing-regular-expressions
Non-goals:
Will not discuss computer science theory
Will not discuss parser generators such as yacc and ANTLR
Example code in which languages?
Considerations:
This is a polyglot conference
Time limitations
Decision: focus primarily on two representative languages.
Ruby: dynamic typing
Perl, Python, JavaScript, Clojure, etc.
Scala: static typing:
Java, C++, C#, F#, ML, Haskell, etc.
GitHub repository also has similar-looking code for
Perl
JavaScript
An infamous regex for email
The reason for my talk!
A big Perl regex for email address based on RFC 822 grammar:
/(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+
)+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:r
rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[]
?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]
t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@
31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[
](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:
(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^
(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:
|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[
?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 0
rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|
t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-
?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*
)*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-03
t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]
Personal story: my email address fiasco
To track and prevent spam: FranklinChen+spam@cmu.edu
Some sites wrongly claimed invalid (because of +)
Other sites did allow registration
I caught spam
Unsubscribing failed!
Some wrong claimed invalid (!?!)
Some silently failed to unsubscribe
Had to set up spam filter
Problem: different regexes for email?
Examples: which one is better?
/A[^@]+@[^@]+z/
vs.
/A[a-zA-Z]+@([^@.]+.)+[^@.]+z/
Readability: first example
Use x for readability!
/A # match string begin
[^@]+ # local part: 1+ chars of not @
@ # @
[^@]+ # domain: 1+ chars of not @
z/x # match string end
matches
FranklinChen+spam
@
cmu.edu
Advice: please write regexes in this formatted style!
Readability: second example
/A # match string begin
[a-zA-Z]+ # local part: 1+ of Roman alphabetic
@ # @
( # 1+ of this group
[^@.]+ # 1+ of not (@ or dot)
. # dot
)+
[^@.]+ # 1+ of not (@ or dot)
z/x # match string end
does not match
FranklinChen+spam
@
cmu.edu
Don’t Do It Yourself: find libraries
Infamous regex revisited: was automatically generated into the Perl
module Mail::RFC822::Address based on the RFC 822 spec.
If you use Perl and need to validate email addresses:
# Install the library
$ cpanm Mail::RFC822::Address
Other Perl regexes: Regexp::Common
Regex match success vs. failure
Parentheses for captured grouping:
if ($address =~ /A # match string begin
([a-zA-Z]+) # $1: local part: 1+ Roman alphabetic
@ # @
( # $2: entire domain
(?: # 1+ of this non-capturing group
[^@.]+ # 1+ of not (@ or dot)
. # dot
)+
([^@.]+) # $3: 1+ of not (@ or dot)
)
z/x) { # match string end
print "local part: $1; domain: $2; top level: $3n";
}
else {
print "Regex match failed!n";
}
Regexes do not report errors usefully
Success:
$ perl/extract_parts.pl ’prez@whitehouse.gov’
local part: prez; domain: whitehouse.gov; top level: gov
But failure:
$ perl/extract_parts.pl ’FranklinChen+spam@cmu.edu’
Regex match failed!
We have a dilemma:
Bug in the regex?
The data really doesn’t match?
Better error reporting: how?
Would prefer something at least minimally like:
[1.13] failure: ‘@’ expected but ‘+’ found
FranklinChen+spam@cmu.edu
^
Features of parser libraries:
Extraction of line, column information of error
Automatically generated error explanation
Hooks for customizing error explanation
Hooks for error recovery
Moving from regexes to grammar parsers
Only discuss parser combinators, not generators
Use the second email regex as a starting point
Use modularization
Modularized regex: Ruby
Ruby allows interpolation of regexes into other regexes:
class EmailValidator::PrettyRegexMatcher
LOCAL_PART = /[a-zA-Z]+/x
DOMAIN_CHAR = /[^@.]/x
SUB_DOMAIN = /#{DOMAIN_CHAR}+/x
AT = /@/x
DOT = /./x
DOMAIN = /( #{SUB_DOMAIN} #{DOT} )+ #{SUB_DOMAIN}/x
EMAIL = /A #{LOCAL_PART} #{AT} #{DOMAIN} z/x
def match(s)
s =~ EMAIL
end
end
Modularized regex: Scala
object PrettyRegexMatcher {
val user = """[a-zA-Z]+"""
val domainChar = """[^@.]"""
val domainSegment = raw"""(?x) $domainChar+"""
val at = """@"""
val dot = """."""
val domain = raw"""(?x)
($domainSegment $dot)+ $domainSegment"""
val email = raw"""(?x) A $user $at $domain z"""
val emailRegex = email.r
def matches(s: String): Boolean =
(emailRegex findFirstMatchIn s).nonEmpty
end
Triple quotes are for raw strings (no backslash interpretation)
raw allows raw variable interpolation
.r is a method converting String to Regex
Email parser: Ruby
Ruby does not have a standard parser combinator library. One that
is popular is Parslet.
require ’parslet’
class EmailValidator::Parser < Parslet::Parser
rule(:local_part) { match[’a-zA-Z’].repeat(1) }
rule(:domain_char) { match[’^@.’] }
rule(:at) { str(’@’) }
rule(:dot) { str(’.’) }
rule(:sub_domain) { domain_char.repeat(1) }
rule(:domain) { (sub_domain >> dot).repeat(1) >>
sub_domain }
rule(:email) { local_part >> at >> domain }
end
Email parser: Scala
Scala comes with standard parser combinator library.
import scala.util.parsing.combinator.RegexParsers
object EmailParsers extends RegexParsers {
override def skipWhitespace = false
def localPart: Parser[String] = """[a-zA-Z]+""".r
def domainChar = """[^@.]""".r
def at = "@"
def dot = "."
def subDomain = rep1(domainChar)
def domain = rep1(subDomain ~ dot) ~ subDomain
def email = localPart ~ at ~ domain
}
Inheriting from RegexParsers allows the implicit conversions from
regexes into parsers.
Running and reporting errors: Ruby
parser = EmailValidator::Parser.new
begin
parser.email.parse(address)
puts "Successfully parsed #{address}"
rescue Parslet::ParseFailed => error
puts "Parse failed, and here’s why:"
puts error.cause.ascii_tree
end
$ validate_email ’FranklinChen+spam@cmu.edu’
Parse failed, and here’s why:
Failed to match sequence (LOCAL_PART AT DOMAIN) at
line 1 char 13.
‘- Expected "@", but got "+" at line 1 char 13.
Running and reporting errors: Scala
def runParser(address: String) {
import EmailParsers._
// parseAll is method inherited from RegexParsers
parseAll(email, address) match {
case Success(_, _) =>
println(s"Successfully parsed $address")
case failure: NoSuccess =>
println("Parse failed, and here’s why:")
println(failure)
}
}
Parse failed, and here’s why:
[1.13] failure: ‘@’ expected but ‘+’ found
FranklinChen+spam@cmu.edu
^
Infamous email regex revisited: conversion!
Mail::RFC822::Address Perl regex: actually manually
back-converted from a parser!
Original module used Perl parser library Parse::RecDescent
Regex author turned the grammar rules into a modularized
regex
Reason: speed
Perl modularized regex source code excerpt:
# ...
my $localpart = "$word(?:.$lwsp*$word)*";
my $sub_domain = "(?:$atom|$domain_literal)";
my $domain = "$sub_domain(?:.$lwsp*$sub_domain)*";
my $addr_spec = "$localpart@$lwsp*$domain";
# ...
Why validate email address anyway?
Software development: always look at the bigger picture
This guy advocates simply sending a user an activation email
Engineering tradeoff: the email sending and receiving
programs need to handle the email address anyway
Email example wrapup
It is possible to write a regex
But first try to use someone else’s tested regex
If you do write your own regex, modularize it
For error reporting, use a parser: convert from modularized
regex
Do you even need to solve the problem?
More complex parsing: toy JSON
{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}
Trick question: could you use a regex to parse this?
Recursive regular expressions?
2007: Perl 5.10 introduced “recursive regular expressions” that can
parse context-free grammars that have rules embedding a reference
to themselves.
$regex = qr/
( # match an open paren (
( # followed by
[^()]+ # 1+ non-paren characters
| # OR
(??{$regex}) # the regex itself
)* # repeated 0+ times
) # followed by a close paren )
/x;
Not recommended!
But, gateway to context-free grammars
Toy JSON parser: Ruby
To save time:
No more Ruby code in this presentation
Continuing with Scala since most concise
Remember: concepts are language-independent!
Full code for a toy JSON parser is available in the Parslet web site
at https://github.com/kschiess/parslet/blob/master/
example/json.rb
Toy JSON parser: Scala
object ToyJSONParsers extends JavaTokenParsers {
def value: Parser[Any] = obj | arr |
stringLiteral | floatingPointNumber |
"null" | "true" | "false"
def obj = "{" ~ repsep(member, ",") ~ "}"
def arr = "[" ~ repsep(value, ",") ~ "]"
def member = stringLiteral ~ ":" ~ value
}
Inherit from JavaTokenParsers: reuse stringLiteral and
floatingPointNumber parsers
~ is overloaded operator to mean sequencing of parsers
value parser returns Any (equivalent to Java Object)
because we have not yet refined the parser with our own
domain model
Fancier toy JSON parser: domain modeling
Want to actually query the data upon parsing, to use.
{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}
We may want to traverse the JSON to address and then to
the second neighbor, to check whether it is "Y"
Pseudocode after storing in domain model:
data.address.neighbors[1] == "Y"
Domain modeling as objects
{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}
A natural domain model:
JObject(
Map(distance -> JFloat(5.6),
address -> JObject(
Map(street -> JString(0 Nowhere Road),
neighbors ->
JArray(List(JString(X), JString(Y))),
garage -> JNull))))
Domain modeling: Scala
// sealed means: can’t extend outside the source file
sealed abstract class ToyJSON
case class JObject(map: Map[String, ToyJSON]) extends
ToyJSON
case class JArray(list: List[ToyJSON]) extends ToyJSON
case class JString(string: String) extends ToyJSON
case class JFloat(float: Float) extends ToyJSON
case object JNull extends ToyJSON
case class JBoolean(boolean: Boolean) extends ToyJSON
Fancier toy JSON parser: Scala
def value: Parser[ToyJSON] = obj | arr |
stringLiteralStripped ^^ { s => JString(s) } |
floatingPointNumber ^^ { s => JFloat(s.toFloat) } |
"null" ^^^ JNull | "true" ^^^ JBoolean(true) |
"false" ^^^ JBoolean(false)
def stringLiteralStripped: Parser[String] =
stringLiteral ^^ { s => s.substring(1, s.length-1) }
def obj: Parser[JObject] = "{" ~> (repsep(member, ",") ^^
{ aList => JObject(aList.toMap) }) <~ "}"
def arr: Parser[JArray] = "[" ~> (repsep(value, ",") ^^
{ aList => JArray(aList) }) <~ "]"
def member: Parser[(String, ToyJSON)] =
((stringLiteralStripped <~ ":") ~ value) ^^
{ case s ~ v => s -> v }
Running the fancy toy JSON parser
val j = """{
"distance" : 5.6,
"address" : {
"street" : "0 Nowhere Road",
"neighbors" : ["X", "Y"],
"garage" : null
}
}"""
parseAll(value, j) match {
case Success(result, _) =>
println(result)
case failure: NoSuccess =>
println("Parse failed, and here’s why:")
println(failure)
}
Output: what we proposed when data modeling!
JSON: reminder not to reinvent
Use a standard JSON parsing library!
Scala has one.
All languages have a standard JSON parsing library.
Shop around: alternate libraries have different tradeoffs.
Other standard formats: HTML, XML, CSV, etc.
JSON wrapup
Parsing can be simple
Domain modeling is trickier but still fit on one page
Use an existing JSON parsing library already!
Final regex example
These slides use the Python library Pygments for syntax
highlighting of all code (highly recommended)
Experienced bugs in the highlighting of regexes in Perl code
Found out why, in source code for class PerlLexer
#TODO: give this to a perl guy who knows how to parse perl
tokens = {
’balanced-regex’: [
(r’/(|[^]|[^/])*/[egimosx]*’, String.Regex, ’#p
(r’!(|[^]|[^!])*![egimosx]*’, String.Regex, ’#p
(r’(|[^])*[egimosx]*’, String.Regex, ’#pop’),
(r’{(|[^]|[^}])*}[egimosx]*’, String.Regex, ’#p
(r’<(|[^]|[^>])*>[egimosx]*’, String.Regex, ’#p
(r’[(|[^]|[^]])*][egimosx]*’, String.Regex,
(r’((|[^]|[^)])*)[egimosx]*’, String.Regex,
(r’@(|[^]|[^@])*@[egimosx]*’, String.Regex, ’#
(r’%(|[^]|[^%])*%[egimosx]*’, String.Regex, ’#
(r’$(|[^]|[^$])*$[egimosx]*’, String.Regex,
Conclusion
Regular expressions
No error reporting
Flat data
More general grammars
Composable structure (using combinator parsers)
Hierarchical, nested data
Avoid reinventing
Concepts are language-independent: find and use a parser
library for your language
All materials for this talk available at https://github.com/
franklinchen/talk-on-overusing-regular-expressions.
The hyperlinks on the slide PDFs are clickable.

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl TechniquesDave Cross
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and YouJames Armes
 
Ruby for Java Developers
Ruby for Java DevelopersRuby for Java Developers
Ruby for Java DevelopersRobert Reiz
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsingBoy Baukema
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational BiologyAtreyiB
 
Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)FrescatiStory
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1Rakesh Mukundan
 
Modern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersModern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersDave Cross
 
Introduction to Modern Perl
Introduction to Modern PerlIntroduction to Modern Perl
Introduction to Modern PerlDave Cross
 
Unit 1-introduction to perl
Unit 1-introduction to perlUnit 1-introduction to perl
Unit 1-introduction to perlsana mateen
 
New Stuff In Php 5.3
New Stuff In Php 5.3New Stuff In Php 5.3
New Stuff In Php 5.3Chris Chubb
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAAiman Hud
 

Was ist angesagt? (20)

Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 
Ruby for Java Developers
Ruby for Java DevelopersRuby for Java Developers
Ruby for Java Developers
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Perl Programming - 02 Regular Expression
Perl Programming - 02 Regular ExpressionPerl Programming - 02 Regular Expression
Perl Programming - 02 Regular Expression
 
Perl
PerlPerl
Perl
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
 
Control structure
Control structureControl structure
Control structure
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
 
Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)Tutorial on Regular Expression in Perl (perldoc Perlretut)
Tutorial on Regular Expression in Perl (perldoc Perlretut)
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1
 
Cs3430 lecture 15
Cs3430 lecture 15Cs3430 lecture 15
Cs3430 lecture 15
 
Modern Perl for Non-Perl Programmers
Modern Perl for Non-Perl ProgrammersModern Perl for Non-Perl Programmers
Modern Perl for Non-Perl Programmers
 
Introduction to Modern Perl
Introduction to Modern PerlIntroduction to Modern Perl
Introduction to Modern Perl
 
Perl_Part1
Perl_Part1Perl_Part1
Perl_Part1
 
Pearl
PearlPearl
Pearl
 
Unit 1-introduction to perl
Unit 1-introduction to perlUnit 1-introduction to perl
Unit 1-introduction to perl
 
New Stuff In Php 5.3
New Stuff In Php 5.3New Stuff In Php 5.3
New Stuff In Php 5.3
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 

Andere mochten auch

Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2mshenry
 
Basicgrammar1
Basicgrammar1Basicgrammar1
Basicgrammar1Vicky
 
Presentación1
Presentación1Presentación1
Presentación1Vicky
 
オールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせオールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせpgcafe
 
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...Verina Ingram
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docsjmingma
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075BenjamasS
 
F:\Itag48 (53011810065)
F:\Itag48 (53011810065)F:\Itag48 (53011810065)
F:\Itag48 (53011810065)BenjamasS
 
L.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting BootL.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting Bootjessradio
 
Real Estate / Special Assets Seminar
Real Estate / Special Assets SeminarReal Estate / Special Assets Seminar
Real Estate / Special Assets SeminarStites & Harbison
 
Nme magazine presentation slideshare
Nme magazine presentation slideshareNme magazine presentation slideshare
Nme magazine presentation slideshareBecca McPartland
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptaciónCESSI ArgenTIna
 

Andere mochten auch (20)

Halifax Index 2012 Presentation
Halifax Index 2012 PresentationHalifax Index 2012 Presentation
Halifax Index 2012 Presentation
 
Appcode
AppcodeAppcode
Appcode
 
Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2
 
Dentist appointment
Dentist appointmentDentist appointment
Dentist appointment
 
Basicgrammar1
Basicgrammar1Basicgrammar1
Basicgrammar1
 
Presentación1
Presentación1Presentación1
Presentación1
 
Pres
PresPres
Pres
 
Ludmi y mika
Ludmi y mikaLudmi y mika
Ludmi y mika
 
オールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせオールドエコノミーを喰いつくせ
オールドエコノミーを喰いつくせ
 
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
Integrating Customary and Statutory Systems. The struggle to develop a Legal ...
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docs
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075
 
F:\Itag48 (53011810065)
F:\Itag48 (53011810065)F:\Itag48 (53011810065)
F:\Itag48 (53011810065)
 
L.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting BootL.L. Bean Maine Hunting Boot
L.L. Bean Maine Hunting Boot
 
Real Estate / Special Assets Seminar
Real Estate / Special Assets SeminarReal Estate / Special Assets Seminar
Real Estate / Special Assets Seminar
 
Flat plans
Flat plansFlat plans
Flat plans
 
Nme magazine presentation slideshare
Nme magazine presentation slideshareNme magazine presentation slideshare
Nme magazine presentation slideshare
 
Pres
PresPres
Pres
 
Gic2012 aula2-ingles
Gic2012 aula2-inglesGic2012 aula2-ingles
Gic2012 aula2-ingles
 
#ForoEGovAR | Casos de PSC y su adaptación
 #ForoEGovAR | Casos de PSC y su adaptación #ForoEGovAR | Casos de PSC y su adaptación
#ForoEGovAR | Casos de PSC y su adaptación
 

Ähnlich wie Stop overusing regular expressions!

Scala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love GameScala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love GameAntony Stubbs
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.Poznań Ruby User Group
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionskeeyre
 
name name2 n
name name2 nname name2 n
name name2 ncallroom
 
name name2 n
name name2 nname name2 n
name name2 ncallroom
 
name name2 n
name name2 nname name2 n
name name2 ncallroom
 
name name2 n2.ppt
name name2 n2.pptname name2 n2.ppt
name name2 n2.pptcallroom
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmersamiable_indian
 
name name2 n2
name name2 n2name name2 n2
name name2 n2callroom
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/ibrettflorio
 

Ähnlich wie Stop overusing regular expressions! (20)

Scala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love GameScala Language Intro - Inspired by the Love Game
Scala Language Intro - Inspired by the Love Game
 
How to check valid Email? Find using regex.
How to check valid Email? Find using regex.How to check valid Email? Find using regex.
How to check valid Email? Find using regex.
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Ruby quick ref
Ruby quick refRuby quick ref
Ruby quick ref
 
ppt7
ppt7ppt7
ppt7
 
ppt2
ppt2ppt2
ppt2
 
name name2 n
name name2 nname name2 n
name name2 n
 
test ppt
test ppttest ppt
test ppt
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt21
ppt21ppt21
ppt21
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt17
ppt17ppt17
ppt17
 
ppt30
ppt30ppt30
ppt30
 
name name2 n2.ppt
name name2 n2.pptname name2 n2.ppt
name name2 n2.ppt
 
ppt18
ppt18ppt18
ppt18
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
 
ppt9
ppt9ppt9
ppt9
 
name name2 n2
name name2 n2name name2 n2
name name2 n2
 
JavaScript.pptx
JavaScript.pptxJavaScript.pptx
JavaScript.pptx
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 

Kürzlich hochgeladen

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Kürzlich hochgeladen (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

Stop overusing regular expressions!

  • 1. Stop overusing regular expressions! Franklin Chen http://franklinchen.com/ Pittsburgh Tech Fest 2013 June 1, 2013
  • 2. Famous quote In 1997, Jamie Zawinski famously wrote: Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.
  • 3. Purpose of this talk Assumption: you already have experience using regexes Goals: Change how you think about and use regexes Introduce you to advantages of using parser combinators Show a smooth way to transition from regexes to parsers Discuss practical tradeoffs Show only tested, running code: https://github.com/franklinchen/ talk-on-overusing-regular-expressions Non-goals: Will not discuss computer science theory Will not discuss parser generators such as yacc and ANTLR
  • 4. Example code in which languages? Considerations: This is a polyglot conference Time limitations Decision: focus primarily on two representative languages. Ruby: dynamic typing Perl, Python, JavaScript, Clojure, etc. Scala: static typing: Java, C++, C#, F#, ML, Haskell, etc. GitHub repository also has similar-looking code for Perl JavaScript
  • 5. An infamous regex for email The reason for my talk! A big Perl regex for email address based on RFC 822 grammar: /(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+ )+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:r rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] ?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r] t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@ 31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[ ](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;: (?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^ (?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?: |(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ ?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 0 rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]| t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000- ?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)* )*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-03 t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]
  • 6. Personal story: my email address fiasco To track and prevent spam: FranklinChen+spam@cmu.edu Some sites wrongly claimed invalid (because of +) Other sites did allow registration I caught spam Unsubscribing failed! Some wrong claimed invalid (!?!) Some silently failed to unsubscribe Had to set up spam filter
  • 7. Problem: different regexes for email? Examples: which one is better? /A[^@]+@[^@]+z/ vs. /A[a-zA-Z]+@([^@.]+.)+[^@.]+z/
  • 8. Readability: first example Use x for readability! /A # match string begin [^@]+ # local part: 1+ chars of not @ @ # @ [^@]+ # domain: 1+ chars of not @ z/x # match string end matches FranklinChen+spam @ cmu.edu Advice: please write regexes in this formatted style!
  • 9. Readability: second example /A # match string begin [a-zA-Z]+ # local part: 1+ of Roman alphabetic @ # @ ( # 1+ of this group [^@.]+ # 1+ of not (@ or dot) . # dot )+ [^@.]+ # 1+ of not (@ or dot) z/x # match string end does not match FranklinChen+spam @ cmu.edu
  • 10. Don’t Do It Yourself: find libraries Infamous regex revisited: was automatically generated into the Perl module Mail::RFC822::Address based on the RFC 822 spec. If you use Perl and need to validate email addresses: # Install the library $ cpanm Mail::RFC822::Address Other Perl regexes: Regexp::Common
  • 11. Regex match success vs. failure Parentheses for captured grouping: if ($address =~ /A # match string begin ([a-zA-Z]+) # $1: local part: 1+ Roman alphabetic @ # @ ( # $2: entire domain (?: # 1+ of this non-capturing group [^@.]+ # 1+ of not (@ or dot) . # dot )+ ([^@.]+) # $3: 1+ of not (@ or dot) ) z/x) { # match string end print "local part: $1; domain: $2; top level: $3n"; } else { print "Regex match failed!n"; }
  • 12. Regexes do not report errors usefully Success: $ perl/extract_parts.pl ’prez@whitehouse.gov’ local part: prez; domain: whitehouse.gov; top level: gov But failure: $ perl/extract_parts.pl ’FranklinChen+spam@cmu.edu’ Regex match failed! We have a dilemma: Bug in the regex? The data really doesn’t match?
  • 13. Better error reporting: how? Would prefer something at least minimally like: [1.13] failure: ‘@’ expected but ‘+’ found FranklinChen+spam@cmu.edu ^ Features of parser libraries: Extraction of line, column information of error Automatically generated error explanation Hooks for customizing error explanation Hooks for error recovery
  • 14. Moving from regexes to grammar parsers Only discuss parser combinators, not generators Use the second email regex as a starting point Use modularization
  • 15. Modularized regex: Ruby Ruby allows interpolation of regexes into other regexes: class EmailValidator::PrettyRegexMatcher LOCAL_PART = /[a-zA-Z]+/x DOMAIN_CHAR = /[^@.]/x SUB_DOMAIN = /#{DOMAIN_CHAR}+/x AT = /@/x DOT = /./x DOMAIN = /( #{SUB_DOMAIN} #{DOT} )+ #{SUB_DOMAIN}/x EMAIL = /A #{LOCAL_PART} #{AT} #{DOMAIN} z/x def match(s) s =~ EMAIL end end
  • 16. Modularized regex: Scala object PrettyRegexMatcher { val user = """[a-zA-Z]+""" val domainChar = """[^@.]""" val domainSegment = raw"""(?x) $domainChar+""" val at = """@""" val dot = """.""" val domain = raw"""(?x) ($domainSegment $dot)+ $domainSegment""" val email = raw"""(?x) A $user $at $domain z""" val emailRegex = email.r def matches(s: String): Boolean = (emailRegex findFirstMatchIn s).nonEmpty end Triple quotes are for raw strings (no backslash interpretation) raw allows raw variable interpolation .r is a method converting String to Regex
  • 17. Email parser: Ruby Ruby does not have a standard parser combinator library. One that is popular is Parslet. require ’parslet’ class EmailValidator::Parser < Parslet::Parser rule(:local_part) { match[’a-zA-Z’].repeat(1) } rule(:domain_char) { match[’^@.’] } rule(:at) { str(’@’) } rule(:dot) { str(’.’) } rule(:sub_domain) { domain_char.repeat(1) } rule(:domain) { (sub_domain >> dot).repeat(1) >> sub_domain } rule(:email) { local_part >> at >> domain } end
  • 18. Email parser: Scala Scala comes with standard parser combinator library. import scala.util.parsing.combinator.RegexParsers object EmailParsers extends RegexParsers { override def skipWhitespace = false def localPart: Parser[String] = """[a-zA-Z]+""".r def domainChar = """[^@.]""".r def at = "@" def dot = "." def subDomain = rep1(domainChar) def domain = rep1(subDomain ~ dot) ~ subDomain def email = localPart ~ at ~ domain } Inheriting from RegexParsers allows the implicit conversions from regexes into parsers.
  • 19. Running and reporting errors: Ruby parser = EmailValidator::Parser.new begin parser.email.parse(address) puts "Successfully parsed #{address}" rescue Parslet::ParseFailed => error puts "Parse failed, and here’s why:" puts error.cause.ascii_tree end $ validate_email ’FranklinChen+spam@cmu.edu’ Parse failed, and here’s why: Failed to match sequence (LOCAL_PART AT DOMAIN) at line 1 char 13. ‘- Expected "@", but got "+" at line 1 char 13.
  • 20. Running and reporting errors: Scala def runParser(address: String) { import EmailParsers._ // parseAll is method inherited from RegexParsers parseAll(email, address) match { case Success(_, _) => println(s"Successfully parsed $address") case failure: NoSuccess => println("Parse failed, and here’s why:") println(failure) } } Parse failed, and here’s why: [1.13] failure: ‘@’ expected but ‘+’ found FranklinChen+spam@cmu.edu ^
  • 21. Infamous email regex revisited: conversion! Mail::RFC822::Address Perl regex: actually manually back-converted from a parser! Original module used Perl parser library Parse::RecDescent Regex author turned the grammar rules into a modularized regex Reason: speed Perl modularized regex source code excerpt: # ... my $localpart = "$word(?:.$lwsp*$word)*"; my $sub_domain = "(?:$atom|$domain_literal)"; my $domain = "$sub_domain(?:.$lwsp*$sub_domain)*"; my $addr_spec = "$localpart@$lwsp*$domain"; # ...
  • 22. Why validate email address anyway? Software development: always look at the bigger picture This guy advocates simply sending a user an activation email Engineering tradeoff: the email sending and receiving programs need to handle the email address anyway
  • 23. Email example wrapup It is possible to write a regex But first try to use someone else’s tested regex If you do write your own regex, modularize it For error reporting, use a parser: convert from modularized regex Do you even need to solve the problem?
  • 24. More complex parsing: toy JSON { "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } } Trick question: could you use a regex to parse this?
  • 25. Recursive regular expressions? 2007: Perl 5.10 introduced “recursive regular expressions” that can parse context-free grammars that have rules embedding a reference to themselves. $regex = qr/ ( # match an open paren ( ( # followed by [^()]+ # 1+ non-paren characters | # OR (??{$regex}) # the regex itself )* # repeated 0+ times ) # followed by a close paren ) /x; Not recommended! But, gateway to context-free grammars
  • 26. Toy JSON parser: Ruby To save time: No more Ruby code in this presentation Continuing with Scala since most concise Remember: concepts are language-independent! Full code for a toy JSON parser is available in the Parslet web site at https://github.com/kschiess/parslet/blob/master/ example/json.rb
  • 27. Toy JSON parser: Scala object ToyJSONParsers extends JavaTokenParsers { def value: Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false" def obj = "{" ~ repsep(member, ",") ~ "}" def arr = "[" ~ repsep(value, ",") ~ "]" def member = stringLiteral ~ ":" ~ value } Inherit from JavaTokenParsers: reuse stringLiteral and floatingPointNumber parsers ~ is overloaded operator to mean sequencing of parsers value parser returns Any (equivalent to Java Object) because we have not yet refined the parser with our own domain model
  • 28. Fancier toy JSON parser: domain modeling Want to actually query the data upon parsing, to use. { "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } } We may want to traverse the JSON to address and then to the second neighbor, to check whether it is "Y" Pseudocode after storing in domain model: data.address.neighbors[1] == "Y"
  • 29. Domain modeling as objects { "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } } A natural domain model: JObject( Map(distance -> JFloat(5.6), address -> JObject( Map(street -> JString(0 Nowhere Road), neighbors -> JArray(List(JString(X), JString(Y))), garage -> JNull))))
  • 30. Domain modeling: Scala // sealed means: can’t extend outside the source file sealed abstract class ToyJSON case class JObject(map: Map[String, ToyJSON]) extends ToyJSON case class JArray(list: List[ToyJSON]) extends ToyJSON case class JString(string: String) extends ToyJSON case class JFloat(float: Float) extends ToyJSON case object JNull extends ToyJSON case class JBoolean(boolean: Boolean) extends ToyJSON
  • 31. Fancier toy JSON parser: Scala def value: Parser[ToyJSON] = obj | arr | stringLiteralStripped ^^ { s => JString(s) } | floatingPointNumber ^^ { s => JFloat(s.toFloat) } | "null" ^^^ JNull | "true" ^^^ JBoolean(true) | "false" ^^^ JBoolean(false) def stringLiteralStripped: Parser[String] = stringLiteral ^^ { s => s.substring(1, s.length-1) } def obj: Parser[JObject] = "{" ~> (repsep(member, ",") ^^ { aList => JObject(aList.toMap) }) <~ "}" def arr: Parser[JArray] = "[" ~> (repsep(value, ",") ^^ { aList => JArray(aList) }) <~ "]" def member: Parser[(String, ToyJSON)] = ((stringLiteralStripped <~ ":") ~ value) ^^ { case s ~ v => s -> v }
  • 32. Running the fancy toy JSON parser val j = """{ "distance" : 5.6, "address" : { "street" : "0 Nowhere Road", "neighbors" : ["X", "Y"], "garage" : null } }""" parseAll(value, j) match { case Success(result, _) => println(result) case failure: NoSuccess => println("Parse failed, and here’s why:") println(failure) } Output: what we proposed when data modeling!
  • 33. JSON: reminder not to reinvent Use a standard JSON parsing library! Scala has one. All languages have a standard JSON parsing library. Shop around: alternate libraries have different tradeoffs. Other standard formats: HTML, XML, CSV, etc.
  • 34. JSON wrapup Parsing can be simple Domain modeling is trickier but still fit on one page Use an existing JSON parsing library already!
  • 35. Final regex example These slides use the Python library Pygments for syntax highlighting of all code (highly recommended) Experienced bugs in the highlighting of regexes in Perl code Found out why, in source code for class PerlLexer #TODO: give this to a perl guy who knows how to parse perl tokens = { ’balanced-regex’: [ (r’/(|[^]|[^/])*/[egimosx]*’, String.Regex, ’#p (r’!(|[^]|[^!])*![egimosx]*’, String.Regex, ’#p (r’(|[^])*[egimosx]*’, String.Regex, ’#pop’), (r’{(|[^]|[^}])*}[egimosx]*’, String.Regex, ’#p (r’<(|[^]|[^>])*>[egimosx]*’, String.Regex, ’#p (r’[(|[^]|[^]])*][egimosx]*’, String.Regex, (r’((|[^]|[^)])*)[egimosx]*’, String.Regex, (r’@(|[^]|[^@])*@[egimosx]*’, String.Regex, ’# (r’%(|[^]|[^%])*%[egimosx]*’, String.Regex, ’# (r’$(|[^]|[^$])*$[egimosx]*’, String.Regex,
  • 36. Conclusion Regular expressions No error reporting Flat data More general grammars Composable structure (using combinator parsers) Hierarchical, nested data Avoid reinventing Concepts are language-independent: find and use a parser library for your language All materials for this talk available at https://github.com/ franklinchen/talk-on-overusing-regular-expressions. The hyperlinks on the slide PDFs are clickable.