This document provides an overview of regular expressions (regexes), including their history, uses, syntax, and implementation in languages like JavaScript and .NET. It describes common regex constructs like character classes, quantifiers, groups, anchors, and lookarounds. It also discusses optimizing regex performance through interpreted vs compiled matching and instance vs static method calls. Tools for designing and testing regexes are also listed.
4. • We have a problem.
• Let’s use regexes !
• Now we have two problems.
5. What about you ?
• Can you read regexes ?
^[0-9]w*$
• Can you really read regexes ?
^[^)(]*((?>[^()]+|((?<p>)|)(?<-p>))*(?(p)(?!)))[^)(]*$
6. Language overview
•
Character classes
•
•
•
•
•
•
w (writable)
d (decimals)
s (spacing)
W (not w)
. (wildcard)
D (not d)
S (not s)
Character group [abc]
Negation [^a1]
Range [C-F] or [2-6A-D]
Differences [A-Z-[B]]
Anchors
•
^ (beginning of string or line)
$ (end of string or line)
b (word boundary)
B (not b)
7. Language overview
•
•
Quantifiers
•
•
•
•
Range : {n,m} , {n,}
Zero or more : * (can be written {0,})
One or more : + (can be written {1,})
Zero or one : ? (can be written {0,1})
Greedy vs Lazy
•
•
Greedy : the longest match (by default)
Lazy : the shortest match
•
*? , +? , ?? , {n,m}?
8. Language overview
•
•
Grouping constructs
•
•
•
•
Capturing group : (subexpression)
Named group : (?<group_name>subexpression)
Non capturing group : (?:subexpression)
Balancing groups : (?<name1-name2>subexpression)
Look around assertions (zero length)
•
•
•
•
Positive look ahead : (?=subexpression)
Negative look ahead : (?!subexpression)
Positive look behind : (?<=subexpression)
Negative look behind : (?<!subexpression)
9. Language overview
• Backreference constructs
•
groupnumber or k<groupname>
• Alternation constructs
•
•
•
(expression1|..|expressionn)
(?(expression)yes|no)
(?(referenced group)yes|no)
10. Format/Comment your code
As you do it when you write code…
public static void C(string an, string pn, string n, string nn) { RegexCompilationInfo[] re =
{
new
RegexCompilationInfo(pn,
RegexOptions.Compiled,
n,
nn,
true)
};
System.Reflection.AssemblyName asn = new System.Reflection.AssemblyName(); asn.Name = an;
Regex.CompileToAssembly(re, asn); }
Regexes can have inline comments:
(#comment)
And can be written in multiple lines (don’t forget the IgnorePatternWhitespace option ):
12. In .NET / C#
• A class to know : System.Text.RegularExpressions.Regex
• Represents the Regex engine
• A pattern is tightly coupled to the regex engine
• All regular expressions must be compiled (sooner or later)
• Initialization can be an expensive process
14. Instance or Static method calls ?
• Both provide the same matching/replacing methods
• Static method calls use caching (15 by default)
• Manage the cache size using Regex.CacheSize
• Only static calls use caching (since .NET 2.0)
15. Instance or Static method calls ?
•
new Regex(pattern).IsMatch(email)
Vs
•
Regex.IsMatch(email, pattern)
Data from:
http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regexobjects.aspx
16. Interpreted or compiled
•
Interpreted:
•
•
•
•
opcodes converted to MSIL and executed by the JIT when the method is called.
Startup time reduced but slower execution time
Compiled (RegexOptions.Compiled):
•
•
•
•
opcodes created on initialization (static or instance).
regex converted to MSIL code.
MSIL code executed by the JIT when the method is called.
Execution time reduced but slower startup time.
Compiled on design time:
•
•
•
Regex.CompileToAssembly
The regex is fixed and used only in instance calls.
Startup and execution time reduced at run-time but must be done design time.
17. Interpreted or compiled
Data from:
http://blogs.msdn.com/b/bclteam/archive/2010/06/25/opti
mizing-regular-expression-performance-part-i-workingwith-the-regex-class-and-regex-objects.aspx
18. Tools
• Regex Design
• Expresso
• The regex coach
• Regex buddy (not free)
• Rex (microsoft research)
• Visual Studio
19. Bonus
• Mail::RFC822::Address: regexp-based address validation
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
• A regular expression to check for prime numbers:
^1?$|^(11+?)1+$
http://montreal.pm.org/tech/neil_kandalgaonkar.shtml
• RegEx match open tags except XHTML self-contained tags (stackoverflow)
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
20. Regex optimization
•
•
•
•
•
Time out
Consider the input source
Capture only when necessary
Factorization
Backtracking
“In general, a Nondeterministic Finite Automaton (NFA) engine like the .NET
Framework regular expression engine places the responsibility for crafting
efficient, fast regular expressions on the developer.”