1. Deeper down the rabbit hole
Advanced Regular Expressions
Jakob Westhoff <jakob@php.net>
@jakobwesthoff
PHPBarcamp.at
May 3, 2010
http://westhoffswelt.de jakob@westhoffswelt.de slide: 1 / 26
2. About Me
Jakob Westhoff
PHP developer for several years
Computer science student at the TU Dortmund
Co-Founder of the PHP Usergroup Dortmund
Active in different Open Source projects
http://westhoffswelt.de jakob@westhoffswelt.de slide: 2 / 26
3. Asking the audience
Who does already work with regular expressions?
Regular expressions like this:
/ [ a−zA−Z]+/
Or like this:
( ? P<image >(? : none | i n h e r i t ) | ( ? : u r l ( s ∗ ( ? : ’ | ” )
? ( ? : [ ’ ” ) ] | [ ˆ ’ ” ) ] | [ ˆ ’ ” ) ] ) ∗ ( ? : ’ | ” ) ? s ∗ )
))
http://westhoffswelt.de jakob@westhoffswelt.de slide: 3 / 26
4. Asking the audience
Who does already work with regular expressions?
Regular expressions like this:
/ [ a−zA−Z]+/
Or like this:
( ? P<image >(? : none | i n h e r i t ) | ( ? : u r l ( s ∗ ( ? : ’ | ” )
? ( ? : [ ’ ” ) ] | [ ˆ ’ ” ) ] | [ ˆ ’ ” ) ] ) ∗ ( ? : ’ | ” ) ? s ∗ )
))
http://westhoffswelt.de jakob@westhoffswelt.de slide: 3 / 26
5. Asking the audience
Who does already work with regular expressions?
Regular expressions like this:
/ [ a−zA−Z]+/
Or like this:
( ? P<image >(? : none | i n h e r i t ) | ( ? : u r l ( s ∗ ( ? : ’ | ” )
? ( ? : [ ’ ” ) ] | [ ˆ ’ ” ) ] | [ ˆ ’ ” ) ] ) ∗ ( ? : ’ | ” ) ? s ∗ )
))
http://westhoffswelt.de jakob@westhoffswelt.de slide: 3 / 26
6. Goals of this session
Learn advanced techniques to use in (PCRE) regular
expressions
Assertions
Once only subpatterns
Conditional subpatterns
Pattern recursion
...
Learn howto to handle Unicode in your regular expressions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 4 / 26
7. Goals of this session
Learn advanced techniques to use in (PCRE) regular
expressions
Assertions
Once only subpatterns
Conditional subpatterns
Pattern recursion
...
Learn howto to handle Unicode in your regular expressions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 4 / 26
8. Goals of this session
Learn advanced techniques to use in (PCRE) regular
expressions
Assertions
Once only subpatterns
Conditional subpatterns
Pattern recursion
...
Learn howto to handle Unicode in your regular expressions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 4 / 26
9. What Regular Expressions are. . .
In theoretical computer science:
Express regular languages
Languages which can be described by deterministic finite state
automata
Type 3 grammars in the Chomsky hierarchy
http://westhoffswelt.de jakob@westhoffswelt.de slide: 5 / 26
10. What Regular Expressions are. . .
In theoretical computer science:
Express regular languages
Languages which can be described by deterministic finite state
automata
Type 3 grammars in the Chomsky hierarchy
http://westhoffswelt.de jakob@westhoffswelt.de slide: 5 / 26
11. What Regular Expressions are. . .
In theoretical computer science:
Express regular languages
Languages which can be described by deterministic finite state
automata
Type 3 grammars in the Chomsky hierarchy
http://westhoffswelt.de jakob@westhoffswelt.de slide: 5 / 26
12. What Regular Expressions are. . .
In practical day to day usage:
“[. . . ]regular expressions provide concise and flexible means for
identifying strings of text of interest, such as particular characters,
words, or patterns of characters.”
– Wikipedia [1]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
13. What Regular Expressions are. . .
In practical day to day usage:
“[. . . ]regular expressions provide concise and flexible means for
identifying strings of text of interest, such as particular characters,
words, or patterns of characters.”
– Wikipedia [1]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
14. What Regular Expressions are. . .
In practical day to day usage:
“[. . . ]regular expressions provide concise and flexible means for
identifying strings of text of interest, such as particular characters,
words, or patterns of characters.”
– Wikipedia [1]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
15. What Regular Expressions are. . .
In practical day to day usage:
“[. . . ]regular expressions provide concise and flexible means for
identifying strings of text of interest, such as particular characters,
words, or patterns of characters.”
– Wikipedia [1]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
16. What Regular Expressions are. . .
In practical day to day usage:
“[. . . ]regular expressions provide concise and flexible means for
identifying strings of text of interest, such as particular characters,
words, or patterns of characters.”
– Wikipedia [1]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 6 / 26
17. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
18. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
19. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
20. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
21. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
22. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
23. Building Blocks of a Regular Expression
Basic structure of every regular expression
/[a-z]+/im
Delimiter
Equal characters of arbitrary choice (must be escaped in
expression)
May be ( and ) in PCRE
Expression
Modifier
A sequence of characters providing processing instructions
http://westhoffswelt.de jakob@westhoffswelt.de slide: 7 / 26
24. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
25. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
26. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
27. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
28. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
29. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
30. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
31. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
32. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
33. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
34. Getting everybody up to speed
., .*, .+, .?, .{1,2} - Arbitrary characters and repetitions
^, $ - Start and end of subject (or line in multiline mode)
foo|bar - Logical Or
(foo)(bar) - Subpattern grouping
/(foo|bar)baz(1)/ - Backreferences
[a-z], [^a-z] - Character classes
http://westhoffswelt.de jakob@westhoffswelt.de slide: 8 / 26
35. Grouping Without Subpattern Creation
Grouping might be needed without creating a subpattern
/(?:foobar)*/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 9 / 26
36. Grouping Without Subpattern Creation
Grouping might be needed without creating a subpattern
/(?:foobar)*/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 9 / 26
37. Subpattern identification
Subpatterns are numbered by opening paranthesis
/(foo(bar)(baz))/
1 foobarbaz
2 bar
3 baz
Matches available from within PHP
$ma tc h e s = a r r a y (
0 => ” f o o b a r b a z ” ,
1 => ” f o o b a r b a z ” ,
2 => ” b a r ” ,
3 => ” baz ” ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
38. Subpattern identification
Subpatterns are numbered by opening paranthesis
/(foo(bar)(baz))/
1 foobarbaz
2 bar
3 baz
Matches available from within PHP
$ma tc h e s = a r r a y (
0 => ” f o o b a r b a z ” ,
1 => ” f o o b a r b a z ” ,
2 => ” b a r ” ,
3 => ” baz ” ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
39. Subpattern identification
Subpatterns are numbered by opening paranthesis
/(foo(bar)(baz))/
1 foobarbaz
2 bar
3 baz
Matches available from within PHP
$ma tc h e s = a r r a y (
0 => ” f o o b a r b a z ” ,
1 => ” f o o b a r b a z ” ,
2 => ” b a r ” ,
3 => ” baz ” ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
40. Subpattern identification
Subpatterns are numbered by opening paranthesis
/(foo(bar)(baz))/
1 foobarbaz
2 bar
3 baz
Matches available from within PHP
$ma tc h e s = a r r a y (
0 => ” f o o b a r b a z ” ,
1 => ” f o o b a r b a z ” ,
2 => ” b a r ” ,
3 => ” baz ” ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
41. Subpattern identification
Subpatterns are numbered by opening paranthesis
/(foo(bar)(baz))/
1 foobarbaz
2 bar
3 baz
Matches available from within PHP
$ma tc h e s = a r r a y (
0 => ” f o o b a r b a z ” ,
1 => ” f o o b a r b a z ” ,
2 => ” b a r ” ,
3 => ” baz ” ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
42. Subpattern identification
Subpatterns are numbered by opening paranthesis
/(foo(bar)(baz))/
1 foobarbaz
2 bar
3 baz
Matches available from within PHP
$ma tc h e s = a r r a y (
0 => ” f o o b a r b a z ” ,
1 => ” f o o b a r b a z ” ,
2 => ” b a r ” ,
3 => ” baz ” ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 10 / 26
43. Subpattern Naming
PCRE allows custom naming
/(?P<firstname>[A-Za-z]+) (?P<lastname>[A-Za-z]+)/
Result with input Jakob Westhoff
array (
0 => ’ Jakob W e s t h o f f ’ ,
’ f i r s t n a m e ’ => ’ Jakob ’ ,
1 => ’ Jakob ’ ,
’ l a s t n a m e ’ => ’ W e s t h o f f ’ ,
2 => ’ W e s t h o f f ’ ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 11 / 26
44. Subpattern Naming
PCRE allows custom naming
/(?P<firstname>[A-Za-z]+) (?P<lastname>[A-Za-z]+)/
Result with input Jakob Westhoff
array (
0 => ’ Jakob W e s t h o f f ’ ,
’ f i r s t n a m e ’ => ’ Jakob ’ ,
1 => ’ Jakob ’ ,
’ l a s t n a m e ’ => ’ W e s t h o f f ’ ,
2 => ’ W e s t h o f f ’ ,
)
http://westhoffswelt.de jakob@westhoffswelt.de slide: 11 / 26
45. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
46. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
47. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
48. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
49. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
50. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
51. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
52. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
53. Assertions
Formulate assertions on the matched string without
consuming them
Example
/foo(?=foo)/
Input
foofoofoo
Match
foofoofoo
http://westhoffswelt.de jakob@westhoffswelt.de slide: 12 / 26
54. Negative Assertions
Negative assertions are possible
foo not followed by another foo
/foo(?!foo)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 13 / 26
55. Negative Assertions
Negative assertions are possible
foo not followed by another foo
/foo(?!foo)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 13 / 26
56. Backward Assertions
bar preceeded by foo
////////// /
/(?=foo)bar// ?
////////// /
Backward assertion
/(?<=foo)bar/
Negative backward assertion
bar not preceeded by foo
/(?<!foo)bar/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
57. Backward Assertions
bar preceeded by foo
/(?=foo)bar/ ?
Backward assertion
/(?<=foo)bar/
Negative backward assertion
bar not preceeded by foo
/(?<!foo)bar/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
58. Backward Assertions
bar preceeded by foo
////////// /
/(?=foo)bar// ?
////////// /
Backward assertion
/(?<=foo)bar/
Negative backward assertion
bar not preceeded by foo
/(?<!foo)bar/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
59. Backward Assertions
bar preceeded by foo
////////// /
/(?=foo)bar// ?
////////// /
Backward assertion
/(?<=foo)bar/
Negative backward assertion
bar not preceeded by foo
/(?<!foo)bar/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 14 / 26
60. Inner workings of the PCRE matcher
PCRE uses backtracking to find matches
Pattern: /d+foo/
Subject: 123456789bar
1 Eat up all the numbers: 123456789
2 Try to match foo
3 Backtrack one number and try to match foo again
4 Repeat step 3 until a match is found or the subjects beginning
is reached
http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
61. Inner workings of the PCRE matcher
PCRE uses backtracking to find matches
Pattern: /d+foo/
Subject: 123456789bar
1 Eat up all the numbers: 123456789
2 Try to match foo
3 Backtrack one number and try to match foo again
4 Repeat step 3 until a match is found or the subjects beginning
is reached
http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
62. Inner workings of the PCRE matcher
PCRE uses backtracking to find matches
Pattern: /d+foo/
Subject: 123456789bar
1 Eat up all the numbers: 123456789
2 Try to match foo
3 Backtrack one number and try to match foo again
4 Repeat step 3 until a match is found or the subjects beginning
is reached
http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
63. Inner workings of the PCRE matcher
PCRE uses backtracking to find matches
Pattern: /d+foo/
Subject: 123456789bar
1 Eat up all the numbers: 123456789
2 Try to match foo
3 Backtrack one number and try to match foo again
4 Repeat step 3 until a match is found or the subjects beginning
is reached
http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
64. Inner workings of the PCRE matcher
PCRE uses backtracking to find matches
Pattern: /d+foo/
Subject: 123456789bar
1 Eat up all the numbers: 123456789
2 Try to match foo
3 Backtrack one number and try to match foo again
4 Repeat step 3 until a match is found or the subjects beginning
is reached
http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
65. Inner workings of the PCRE matcher
PCRE uses backtracking to find matches
Pattern: /d+foo/
Subject: 123456789bar
1 Eat up all the numbers: 123456789
2 Try to match foo
3 Backtrack one number and try to match foo again
4 Repeat step 3 until a match is found or the subjects beginning
is reached
http://westhoffswelt.de jakob@westhoffswelt.de slide: 15 / 26
66. Once only subpattern
Once only subpatterns prevent backtracking once a certain
pattern has acquired a match.
Applying a once only pattern to the shown example
/(?>d+)foo/
After matching the numbers and determining the following
string is not foo the matcher stops
123456789bar
Can massively improve regex speed if used correctly
http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
67. Once only subpattern
Once only subpatterns prevent backtracking once a certain
pattern has acquired a match.
Applying a once only pattern to the shown example
/(?>d+)foo/
After matching the numbers and determining the following
string is not foo the matcher stops
123456789bar
Can massively improve regex speed if used correctly
http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
68. Once only subpattern
Once only subpatterns prevent backtracking once a certain
pattern has acquired a match.
Applying a once only pattern to the shown example
/(?>d+)foo/
After matching the numbers and determining the following
string is not foo the matcher stops
123456789bar
Can massively improve regex speed if used correctly
http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
69. Once only subpattern
Once only subpatterns prevent backtracking once a certain
pattern has acquired a match.
Applying a once only pattern to the shown example
/(?>d+)foo/
After matching the numbers and determining the following
string is not foo the matcher stops
123456789bar
Can massively improve regex speed if used correctly
http://westhoffswelt.de jakob@westhoffswelt.de slide: 16 / 26
70. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
71. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
72. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
73. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
74. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
75. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
76. Conditional subpattern
If statement aquivalent in PCRE
/(?(condition)yes-pattern|no-pattern)/
Conditions can be direct matches or assertions
Numbers need to be followed by foo, while everything else
needs to be followed by bar
/(?(d+)foo|bar)/
http://westhoffswelt.de jakob@westhoffswelt.de slide: 17 / 26
77. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
78. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
79. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
80. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
81. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
82. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
83. Unicode: Character, code points and graphemes
Unicode consists of different code points
The letter a: U+0061
The mark ‘: U+0300
One character might consist of multiple code points
The letter a with the mark ‘ (`) : U+0061 U+0300
a
Some of these combinations exists as single code points
The letter `: U+00E0
a
http://westhoffswelt.de jakob@westhoffswelt.de slide: 18 / 26
84. Unicode: Pattern matching
Unicode processing is enabled using the u modifier
PCRE works on UTF-8 encoded strings
Each code point is handled as one character
Match any unicode code point: x{FFFF}
Remember the letter a with the mark ‘ (`)
a
/x{0061}x{0030}/U
http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
85. Unicode: Pattern matching
Unicode processing is enabled using the u modifier
PCRE works on UTF-8 encoded strings
Each code point is handled as one character
Match any unicode code point: x{FFFF}
Remember the letter a with the mark ‘ (`)
a
/x{0061}x{0030}/U
http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
86. Unicode: Pattern matching
Unicode processing is enabled using the u modifier
PCRE works on UTF-8 encoded strings
Each code point is handled as one character
Match any unicode code point: x{FFFF}
Remember the letter a with the mark ‘ (`)
a
/x{0061}x{0030}/U
http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
87. Unicode: Pattern matching
Unicode processing is enabled using the u modifier
PCRE works on UTF-8 encoded strings
Each code point is handled as one character
Match any unicode code point: x{FFFF}
Remember the letter a with the mark ‘ (`)
a
/x{0061}x{0030}/U
http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
88. Unicode: Pattern matching
Unicode processing is enabled using the u modifier
PCRE works on UTF-8 encoded strings
Each code point is handled as one character
Match any unicode code point: x{FFFF}
Remember the letter a with the mark ‘ (`)
a
/x{0061}x{0030}/U
http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
89. Unicode: Pattern matching
Unicode processing is enabled using the u modifier
PCRE works on UTF-8 encoded strings
Each code point is handled as one character
Match any unicode code point: x{FFFF}
Remember the letter a with the mark ‘ (`)
a
/x{0061}x{0030}/U
http://westhoffswelt.de jakob@westhoffswelt.de slide: 19 / 26
90. Unicode: Extended unicode sequences
How to match the single and multi code point character?
Remember: ` = U+0061 U+0300 oder U+00E0
a
Using escape for extended unicode sequences: X
X is aquivalent to (?>P{M}p{M}*)
Wait. What? → Unicode character properties
http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
91. Unicode: Extended unicode sequences
How to match the single and multi code point character?
Remember: ` = U+0061 U+0300 oder U+00E0
a
Using escape for extended unicode sequences: X
X is aquivalent to (?>P{M}p{M}*)
Wait. What? → Unicode character properties
http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
92. Unicode: Extended unicode sequences
How to match the single and multi code point character?
Remember: ` = U+0061 U+0300 oder U+00E0
a
Using escape for extended unicode sequences: X
X is aquivalent to (?>P{M}p{M}*)
Wait. What? → Unicode character properties
http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
93. Unicode: Extended unicode sequences
How to match the single and multi code point character?
Remember: ` = U+0061 U+0300 oder U+00E0
a
Using escape for extended unicode sequences: X
X is aquivalent to (?>P{M}p{M}*)
Wait. What? → Unicode character properties
http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
94. Unicode: Extended unicode sequences
How to match the single and multi code point character?
Remember: ` = U+0061 U+0300 oder U+00E0
a
Using escape for extended unicode sequences: X
X is aquivalent to (?>P{M}p{M}*)
Wait. What? → Unicode character properties
http://westhoffswelt.de jakob@westhoffswelt.de slide: 20 / 26
95. Unicode: Character properties
Every unicode code point has a certain property assigned
Characters may be matched by these properties
Escapes p and P are used for this:
p{xx}: All code points with the property xx
P{xx}: All code points without the property xx
Possible properties:
L: Letter
M: Mark
P: Punctation
Sc: Currency symbol
...
http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
96. Unicode: Character properties
Every unicode code point has a certain property assigned
Characters may be matched by these properties
Escapes p and P are used for this:
p{xx}: All code points with the property xx
P{xx}: All code points without the property xx
Possible properties:
L: Letter
M: Mark
P: Punctation
Sc: Currency symbol
...
http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
97. Unicode: Character properties
Every unicode code point has a certain property assigned
Characters may be matched by these properties
Escapes p and P are used for this:
p{xx}: All code points with the property xx
P{xx}: All code points without the property xx
Possible properties:
L: Letter
M: Mark
P: Punctation
Sc: Currency symbol
...
http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
98. Unicode: Character properties
Every unicode code point has a certain property assigned
Characters may be matched by these properties
Escapes p and P are used for this:
p{xx}: All code points with the property xx
P{xx}: All code points without the property xx
Possible properties:
L: Letter
M: Mark
P: Punctation
Sc: Currency symbol
...
http://westhoffswelt.de jakob@westhoffswelt.de slide: 21 / 26
99. Pattern Recursion
Recursion in regular expressions ?
Possible with PCRE
Validate BB-Code using PCRE
[b]Hello [i]World[/i]![/b]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 22 / 26
100. Pattern Recursion
Recursion in regular expressions ?
Possible with PCRE
Validate BB-Code using PCRE
[b]Hello [i]World[/i]![/b]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 22 / 26
101. Pattern Recursion
Recursion in regular expressions ?
Possible with PCRE
Validate BB-Code using PCRE
[b]Hello [i]World[/i]![/b]
http://westhoffswelt.de jakob@westhoffswelt.de slide: 22 / 26
108. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
109. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
110. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
111. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
112. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
113. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
114. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
115. Do NOT Parse Using Regular Expressions
Even though this is possible you do NOT want to do it
It is not maintainable
It is nearly impossible to find errors
Useful information extraction (building an AST) is not possible
Use regular expressions for
Match Patterns (not recursive structures)
Tokenizing strings
Validate really restricted input values
http://westhoffswelt.de jakob@westhoffswelt.de slide: 24 / 26
116. Thanks for listening
Questions, comments or annotations?
Slides: http://westhoffswelt.de/portfolio.htm
Contact: Jakob Westhoff <jakob@php.net>
Twitter: @jakobwesthoff
Please leave comments and vote at: http://joind.in/1620
http://westhoffswelt.de jakob@westhoffswelt.de slide: 25 / 26