3. Introduction
• Regular Expression (RE or regex) is a text
string that describes a search pattern
• Some similarity to wildcards (e.g. *.txt)
• Platform/language independent but some
minor differences among them
• Available in a.o. C#, Perl, Python, Java,
Javascript, Visual Studio, Notepad++, Linux
command line tools
3
4. Applications
• Syntax highlighting
• Find and replace
– Visual Studio (as usual: different syntax)
– Notepad++
• Text searching
– Unix tools: grep, sed, find
• Programming
– Pattern matching, filtering, replacing
4
6. Constructs: word groups
•
•
•
•
•
•
d is shorthand for Digits [0-9]
w is shorthand for Words [a-zA-Z0-9]
s is shorthand for whiteSpace [ trn]
D is shorthand for non-Digits [^0-9]
W is shorthand for non-Words [^a-zA-Z0-9]
S is shorthand for non-whiteSpace [^ trn]
6
7. Constructs: multiple character
Regex
Strings that match
Strings that don’t match
abc
“abcde”, “rabco”
“bac”, “ABC”, “bcde”
^abc
“abc”, “abcde”
“rabco”, “ABC”, “bcde”
Ed
“ME3”, “E85”, “EBE5” “ACME”, “E”, “E 8”
Ed+$
“ME35”, “EBE5”
“E85x”, “E3EB”, “E 8”
foo|bar
“foot”, “bart”
“ooba”
a(b|c)d
“Tabd”, “acdc”
“abcd”, “aed”, “ad”
^foo|bar
“foo1”, “Harbar”
“toofoo”
^(foo|bar)
“foo1”, “bart”
“toofoo”, “Harbar”
7
9. Constructs: advanced
Regex
String
First match
Greedy
<.+>
“This is a <B>first</B> test”
“<B>first</B>”
Lazy
<.+?>
“This is a <B>first</B> test”
“<B>”
Greedy/lazy repetition
Regex
Strings that match
Groups
^(a|b)c(d)$
“acd”, “bcd”
0:”acd” 1:”a” 2:”d”
^(?:a|b)c(d)$
same as above
0:”acd” 1:”d”
^(?<name>a|b)c(d)$
same as above
0:”acd” 1:”d” name:”a”
Grouping
9
10. Constructs: advanced
Regex
Strings that match
Strings that don’t match
^([a-c])x1x1
“axaxa”, “bxbxbyyyy”
“axaxb”, “bxaxc”
<(b)><(i)>.*?</2></1>
“<b><i>bla</i></b>”
“<b><i>bla</b></i>”
Backreferences: inside regex
Regex
Strings that match Replace pattern Result strings
^(var)(1|2)$
“var1”, “var2”
1iable
“variable”
^(a|b)c(d|e)
“acd”, “bcd”
2XXX1
“dXXXa”, “dXXXb”
Backreferences: find and replace
10
11. Constructs: advanced
Regex
Strings that match
Strings that don’t match
([a-b](?=x))
“blaax”, “bxa”
“ab”, “bacx”
((?<=x)[a-b])
“yyxa”, “bxb”
“ab”, “aax”
([a-b](?!x))
“bla”, “a”, “bxa”
“bxc”, “ax”
((?<!x)[a-b])
“ral”, “dbx”, “bxa”
“xa”, “lxb”
Look ahead/behind
11
12. Regexes in C#
• using System.Text.RegularExpressions
• Regex reg = new Regex(“a(b|c)d”);
–
–
–
–
–
reg.IsMatch(“abd”);
reg.Matches(“abdEacd”);
reg.Groups(“abdEabdC”);
reg.Split(“abdEabdC”);
reg.Replace(“abdEabdC”, “X”);
true
2 Matches
2 Groups per match
2 Strings (“E” and “C”)
“XEXC”
• Single/Multiline options: is linebreak(n)
special character
12
16. Pros and cons
• Advantages
–
–
–
–
–
Very flexible
Fast processing
Language independent
A lot of work in a single line of code
Often simpler than ‘substring+indexes’ approach
• Disadvantages
– Hard to read, for example ‘?’ has three meanings depending on
context
– Hard to debug: no info given when no match
– Compilation only at runtime
– Typos are very easily made (e.g. forget escape character)
16
17. Conclusions
“Some people, when confronted with a
problem, think ‘I know, I'll use regular
expressions.’ Now they have two problems.”
Jamie Zawinski
Don’t overuse it!
17
18. Conclusions
•
•
•
•
Very handy tool for string matching and replacing
Built-in support in most programming languages
Support in/for multiple applications
More info
– http://www.regular-expressions.info/
– http://msdn.microsoft.com/en-us/library/az24scfc
%28v=vs.110%29.aspx
• Fun
– http://regex.alf.nu/
– http://www.i-programmer.info/news/144-graphics-and-games/5450can-you-do-the-regular-expression-crossword.html
18