Presentation I gave to the local http://www.cocoacoder.org/ meeting on using Regular Expression in Cocoa code (although much of it applies to other languages as well).
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Using Regular Expressions and Staying Sane
1. Regular Expressions
How not to turn one problem into two.
Carl Brown
CarlB@PDAgent.com
2. “Common Wisdom”
“Some people, when confronted
with a problem, think ‘I know, I'll
use regular expressions.’ Now
they have two problems.”
*See http://regex.info/blog/2006-09-15/247 for source.
3. What is a ‘Regular
Expression’?
“...a concise and flexible means for
‘matching’ (specifying and recognizing) strings
of text, such as particular characters, words,
or patterns of characters” (So says Wikipedia)
“... a way of extracting substrings from text in a
‘usefully fuzzy’ way” (So says me)
4. ...so for example?
Pull out the host from a URL string:
http://([^/]*)/
find the date in a string
([0-9][0-9]*[-/][0-9][0-9]*[-/][0-9][0-9]*)
6. Two Kinds of (OOish)
Languages
Some languages, Like perl or ruby, have
Regex build into their strings, so they get used
often.
Most others, like Cocoa, Java, Python have
Regular Expression Objects, that are
complicated and a Pain in the Ass
16. But what about?
(?<!(=)|(="")|(='))(((http|ftp|
https)://)|(www.))+[w]+(.[w]+)
([w-.@?^=%&:/~+#]*[w-@?
^=%&/~+#])?(?!.*/a>)
*That* guy has two problems
Well, Actually, he has n! problems where,
n is the number of hyperlinks in the input string
17. How to keep that from
happening (my advice)
Limit yourself to only the basic meta-
characters.
Favor clarity over brevity.
Take more smaller bites.
Beware of greedy matching
20. PhraseBook pt 1
^.*
“the junk to the left of what I want”
This breaks down as ^ (the beginning of the string)
followed by .* any number of any character.
21. PhraseBook pt 1
^.*
“the junk to the left of what I want”
This breaks down as ^ (the beginning of the string)
followed by .* any number of any character.
.*$
“the junk to the right of what I want”
This breaks down as any number of any character .*
followed by $ (the end of the string)
22. PhraseBook pt 2
[0–9][0–9]*
“a number with at least one digit”
The brackets ([ and ]) mean “any of the characters contained
within the brackets”. So this means 1 character of 0–9 (so 0 1 2
3 4 5 6 7 8 or 9) followed by zero or more of the same character.
23. PhraseBook pt 2
[0–9][0–9]*
“a number with at least one digit”
The brackets ([ and ]) mean “any of the characters contained
within the brackets”. So this means 1 character of 0–9 (so 0 1 2
3 4 5 6 7 8 or 9) followed by zero or more of the same character.
[^A-Za-z]
“any character that’s not a letter”
The ^ as the first character inside the brackets means “not” so
instead of meaning “any letter” it means “anything not a letter”.
24. PhraseBook pt 3
.
“a literal period” (e.g. to match the dot in .com)
25. PhraseBook pt 3
.
“a literal period” (e.g. to match the dot in .com)
*
“a literal * ” (e.g. to match an asterisk)
26. PhraseBook pt 3
.
“a literal period” (e.g. to match the dot in .com)
*
“a literal * ” (e.g. to match an asterisk)
( ) or [ ]
“literal parenthesis/brackets” (in Cocoa, at least)
27. PhraseBook pt 3
.
“a literal period” (e.g. to match the dot in .com)
*
“a literal * ” (e.g. to match an asterisk)
( ) or [ ]
“literal parenthesis/brackets” (in Cocoa, at least)
( …stuff… )
“stuff I want to refer to later as $1” (in Cocoa, at least)
45. Beware Greedy
Matching
Remember this?
NSString *domainName = [myHTMLString
stringByReplacingRegexPattern:
@"^.*href=[”’]http://(.*)/.*$" withString:@"$1"
caseInsensitive:YES];
What does it do if given:
<a href=“http://1.example.com/”>This is a link</
a> but <a href=“http://2.example.com/”>This is a
link, too.</a>
46. Beware Greedy
Matching
Remember this?
NSString *domainName = [myHTMLString
stringByReplacingRegexPattern:
@"^.*href=[”’]http://(.*)/.*$" withString:@"$1"
caseInsensitive:YES];
What does it do if given:
<a href=“http://1.example.com/”>This is a link</
a> but <a href=“http://2.example.com/”>This is a
link, too.</a>
47. What you meant was:
After ‘http://’ up to but not including the next ‘/’
48. What you meant was:
After ‘http://’ up to but not including the next ‘/’
Which is:
http://([^/][^/]*)/
49. Remember this?
(?<!(=)|(="")|(='))(((http|ftp|
https)://)|(www.))+[w]+(.[w]+)
([w-.@?^=%&:/~+#]*[w-@?
^=%&/~+#])?(?!.*/a>)
Well, Actually, he has n! problems where,
n is the number of hyperlinks in the input string
50. So if you had
<p>Today’s Links:</p>
<UL>
<LI><A HREF=”http://example.com/1”>Link 1</A></LI>
<LI><A HREF=”http://example.com/2”>Link 2</A></LI>
<LI><A HREF=”http://example.com/3”>Link 3</A></LI>
<LI><A HREF=”http://example.com/4”>Link 4</A></LI>
<LI><A HREF=”http://example.com/5”>Link 5</A></LI>
<LI><A HREF=”http://example.com/6”>Link 6</A></LI>
</UL>
51. And tried to use:
(?<!(=)|(="")|(='))(((http|ftp|
https)://)|(www.))+[w]+(.[w]+)
([w-.@?^=%&:/~+#]*[w-@?
^=%&/~+#])?(?!.*/a>)
52. It would have to:
<p>Today’s Links:</p>
<UL>
<LI><A HREF=”http://example.com/1”>Link 1</A></LI>
<LI><A HREF=”http://example.com/2”>Link 2</A></LI>
<LI><A HREF=”http://example.com/3”>Link 3</A></LI>
<LI><A HREF=”http://example.com/4”>Link 4</A></LI>
<LI><A HREF=”http://example.com/5”>Link 5</A></LI>
<LI><A HREF=”http://example.com/6”>Link 6</A></LI>
</UL>
59. And so on:
<p>Today’s Links:</p>
<UL>
<LI><A HREF=”http://example.com/1”>Link 1</A></LI>
<LI><A HREF=”http://example.com/2”>Link 2</A></LI>
<LI><A HREF=”http://example.com/3”>Link 3</A></LI>
<LI><A HREF=”http://example.com/4”>Link 4</A></LI>
<LI><A HREF=”http://example.com/5”>Link 5</A></LI>
<LI><A HREF=”http://example.com/6”>Link 6</A></LI>
</UL>
60. But what are they
good for?
Encoding/decoding metadata from image file
names.
61. But what are they
good for?
Encoding/decoding metadata from image file
names.
Renaming files on the command line (@2x?)
62. But what are they
good for?
Encoding/decoding metadata from image file
names.
Renaming files on the command line (@2x?)
Grabbing the user’s first name from a Full
Name string (careful of Locales*)
*See http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
63. But what are they
good for?
Encoding/decoding metadata from image file
names.
Renaming files on the command line (@2x?)
Grabbing the user’s first name from a Full
Name string (careful of Locales)
Stripping crap I don’t want out of user input
(trailing spaces, anyone?)
64. But what are they
good for?
Encoding/decoding metadata from image file
names.
Renaming files on the command line (@2x?)
Grabbing the user’s first name from a Full
Name string (careful of Locales)
Stripping crap I don’t want out of user input
(trailing spaces, anyone?)
//.*[.* *release *] *;
This is not a talk about every possible thing you can do with regular expressions. In fact, it&#x2019;s exactly the opposite. This is about how to do a useful thing and do it without going crazy.\n
\n
So before I get too far, how many of you know what a regular expression is?\nHow many have used them before? How many feel comfortable with them?\n
So here&#x2019;s a quick example, just so those of you who haven&#x2019;t touched them have an idea what I&#x2019;m talking about between now and when we dig into examples later on.\n
Well, it depends...you see...\n
I&#x2019;m saying OOish because I have issues with perl&#x2019;s OO, but that&#x2019;s another talk.\nI went from Basic to Pascal to C to perl (to C++ to Lisp to Java to Ruby to Objective-C). I started learning perl in 1989 or so, and it was exactly what I needed at the time - it was a language that was really good at exactly what C made very painful: String handling. I have better alternatives than perl now, but it taught me regex&#x2019;s.\n
This is an example of a usage in a language where a Regex is a first-class citizen.\n
This is a WTF. And it brings to mind a bunch of questions...\n
and the most often asked question in Cocoa Regex...\n
\n
This is better (but you have to do the #import).\n
re.match in python implicitly anchors you to the beginning of a string. This is hideous.\n
Well, I&#x2019;d say no. I use them all the time.\n
This is a actual regex I found in a program I was once asked to find the performance problem in.\n
This is unmaintainable, and worse...\n
We&#x2019;ll come back to this one later\n
\n
\n
Let me do a quick phrasebook first.\n
Let me do a quick phrasebook first.\n
You can (and should) put whatever characters you are looking for in square brackets. \nIf you omit the first [0&#x2013;9] you might match nothing.\n\nLikewise, in the second part [^0-9] means &#x201C;anything that isn&#x2019;t a number&#x201D;.\n
You can (and should) put whatever characters you are looking for in square brackets. \nIf you omit the first [0&#x2013;9] you might match nothing.\n\nLikewise, in the second part [^0-9] means &#x201C;anything that isn&#x2019;t a number&#x201D;.\n
Anything else that you see that&#x2019;s special (like &#x2018;^&#x2019; or &#x2018;\\\\&#x2019;) gets matched with a &#x2018;\\&#x2019; in front of it, too.\n
Anything else that you see that&#x2019;s special (like &#x2018;^&#x2019; or &#x2018;\\\\&#x2019;) gets matched with a &#x2018;\\&#x2019; in front of it, too.\n
Anything else that you see that&#x2019;s special (like &#x2018;^&#x2019; or &#x2018;\\\\&#x2019;) gets matched with a &#x2018;\\&#x2019; in front of it, too.\n
Anything else that you see that&#x2019;s special (like &#x2018;^&#x2019; or &#x2018;\\\\&#x2019;) gets matched with a &#x2018;\\&#x2019; in front of it, too.\n
I mean it, I&#x2019;m done.\n
But there&#x2019;s all these other characters...\n
\n
\n
can you tell the difference between &#x2018;w&#x2019; and &#x2018;W&#x2019; every time, without looking?\n\nCan you promise you&#x2019;ll never get confused about whether &#x2018;w&#x2019; means &#x2018;word&#x2019; or &#x2018;whitespace&#x2019;?\n
Maximize the utility of your investment \nThere is a &#x2018;+&#x2019; operator that *Sometimes* means &#x201C;one or more&#x201D; like ::*. + works in Cocoa, but not in grep. If you stick to the ones that are the same everywhere, you will get more use out of it and be less confused\nSame with .*? to handle greedy matching\n
Maximize the utility of your investment \nThere is a &#x2018;+&#x2019; operator that *Sometimes* means &#x201C;one or more&#x201D; like ::*. + works in Cocoa, but not in grep. If you stick to the ones that are the same everywhere, you will get more use out of it and be less confused\nSame with .*? to handle greedy matching\n
\n
\n
\n
Note - regex&#x2019;s don&#x2019;t parse HTML/XML &#x201C;correctly&#x201D; so be careful\n
\n
\n
You get the HTML between the links, don&#x2019;t you?\n
You get the HTML between the links, don&#x2019;t you?\n
You get the HTML between the links, don&#x2019;t you?\n
Although you can use .*? at least on some platforms\n
Although you can use .*? at least on some platforms\n
This code was used in production on a project I was asked to consult on in a Content Management System (of sorts) to detect links that should be clickable on a web page, but weren&#x2019;t, and make them clickable.\n
And the customer fed that Content Management System a big list of links\n
note it&#x2019;s looking at http followed by :// followed by stuff, then anything, then /A.\n
The regex library grabs the longest string it can, first, to see if that&#x2019;s a match (because it&#x2019;s supposed to be greedy)\n
then, when that doesn&#x2019;t match, the next longest string\n
and so on\n
\n
\n
and then, when it&#x2019;s exhausted the shortest string for that beginning match,\n
It does it again for the next beginning match it finds\n
and so on there.\n\nBAD IDEA.\n
When I&#x2019;m doing Core Data on the iPhone, the images go in a directory (NEVER in the DB!!), and I put info I might need (like when I should refresh it) in the image name, so I can do maintenance without having to ask the DB.\n
\n
\n
And coming up next, my current favorite to use in XCode&#x2019;s search project box...\n
Which, of course, means the price just went up 25%.\n\nOnce you get comfortable with them, you start to see chances to use them everywhere.\n