Regular Expressions are highly technical. This training covers the basics of RegEx and also gives examples of how to use it.
Take some time to go through each example and try to figure it out on your own.
1. Regular Expressions for SEO
The Coolest Pattern Matching Search Language...
Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant
For Powered by Search Internal | October 2013
2. Weâre in business because we believe that great brands need
both voice and visibility in order to connect people with what
matters.
A boutique, full-service digital marketing agency in Toronto,
Powered by Search is a PROFIT HOT 50-ranked agency that
delivers search engine optimization, pay per click advertising,
local search, social media marketing, and online reputation
management services.
Some of our clients...
Featured in...
7. RegEx Basics
Use Sublime Text
This is the sexiest text editor / IDE youâll ever use. Itâs light weight, too.
Itâs the text editor youâll fall in love with.
8. RegEx Basics
Literal Matching
Text
I want to match this.
RegEx
match this
RegEx matches literal strings. This is like running a normal search in Word.
Pretty cool, huh?
9. RegEx Basics
Anchors
Text
I want this, I want that, I want I want I want
Text
I want this, I want that, I want I want I want
RegEx
^I want
RegEx
I want$
There are a couple of special characters called âAnchors.â The carret (^) represents
the beginning of a line. The dollar sign ($) represents the end of a line.
You see these a lot in .htaccess files.
10. RegEx Basics
Special Characters
There are also a series of other special characters. These are:
⢠[ - Starts a Character Class (More Later)
⢠- Escapes or modifies the character after it.
⢠. - Wildcard. It represents any character.
⢠| - OR, so (this|that|the other) means this, that, or the other.
⢠( - Starts a group.
⢠) - Ends a group.
To match any of these literal characters, put a backslash in front of it.
This also applies to ?+*^$ which weâve talked about or will get to later.
11. RegEx Basics
Quantifiers
A quantifier tells the expression how many times to match the
expression before it.
Text
Ahhhhhhhhhhh. A spider.
RegEx
A[h]+
⢠? - Zero or one time
⢠+ - One or more times
⢠{exactly} - Exactly this many times
⢠{min,max} - Between min and max times
⢠* - Zero or more times
12. RegEx Basics
Greedy vs. Non-Greedy
Quantitative expressions are greedy by default. Itâll repeat the
expression as many times as possible before giving up and continuing
with the rest of the RegEx.
This leads to unexpected issues. To make these quantifiers, *+{}, nongreedy, just add a question mark.
Text
<p>test</p>
Text
<p>test</p>
RegEx (Greedy)
<.+>
RegEx (Lazy)
<.+?>
13. RegEx Basics
Variations / Character Classes []
A variation is a set of literal characters that can possibly fill a space. For
example:
Text
Well then Iâm better than you.
RegEx
th[ea]n
The characters in the variation arenât a GROUP. What the following
RegEx is telling the computer is, âFind any of: a t, an h, an e, a pipe, a t,
an h, an a, or an n.â Thatâs not what we want.
Text
Well then Iâm better than you.
RegEx
[then|than]
14. RegEx Basics
Groups ()
In the case above, we could use a group to solve our problem.
Text
Well then Iâm better than you.
RegEx
(then|than)
A group isnât the best answer. Itâs for alternation and/or quantification.
Text
I like redredred apples.
RegEx
(blue|green|red)+
15. RegEx Basics
Variables / Captured Groups $1
When you use a group, it captures the information in a numbered
variable. They count up from $1. You can use the variable when doing a
find-replace.
Text
https://www.searchersforbeerfridges.com/?vote_number=9001
RegEx Find
.+?//(.*?)/.*
RegEx Replace
$1
New Text
www.searchersforbeerfridges.com
17. Practical SEO Uses
Google Analytics â Branded Organic
In Analytics I often want to find branded organic search traffic. Letâs
look at the GWT data in Analytics for our fictional client, Lett.Me.
Lett Me has a ton of common mis-typings and variations. They
get traffic from lm, lm.com, let me, lettme.com, letme.com,
let.me, and lett.me.
Whatâs the regular expression that captures all of that?
18. Practical SEO Uses
Google Analytics â Branded Organic
Hereâs the regular expression I came up with. It matches some funky
cases like let me.com but thatâs fine:
RegEx Find
(lm|let[t]?[ ]?[.]?me)(.com)?
You can also remove the square brackets, but I feel like itâs easier to
read with them in. Without them it looks like this:
RegEx Find
(lm|let{1,2} ?.?me)(.com)?
Now just save this RegEx in your reporting document and youâll never
have to type out the whole thing again.
Imagine what this could do for reporting on keyword groups!
19. Practical SEO Uses
Trim To Root
Trim to Root using Find Replace. Hereâs the list:
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/?pg=2
Whatâs the RegEx?
20. Practical SEO Uses
Trim To Root
Trim to Root using Find Replace. Hereâs the list:
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/?pg=2
Whatâs the RegEx?
RegEx Find
^ .*?//(.*?)/.*
RegEx Replace
$1
21. Practical SEO Uses
Fixing HTML â Nested Tags
I commonly get improperly formatted HTML. Hereâs an example:
<h2><b></b><i></i>I Wrote This In Microsoft Word!</h2>
<h2></h2>
<p>This is a great image!</p>
<p><img src=âhttp://site.com/sampleimage.pngâ /></p>
I want to remove all of the empty tags.
Whatâs the RegEx?
22. Practical SEO Uses
Fixing HTML â Nested Tags
I commonly get improperly formatted HTML. Hereâs an example:
<h2><b></b><i></i>I Wrote This In Microsoft Word!</h2>
<h2></h2>
<p>This is a great image!</p>
<p><img src=âhttp://site.com/sampleimage.pngâ /></p>
I want to remove all of the empty tags.
Whatâs the RegEx?
RegEx Find
<[a-z0-9]{1,6}></[a-z0-9]{1,6}>
RegEx Replace
23. Practical SEO Uses
Top Level Domains
Find only .bs and .spam top level domains. Hereâs the list:
http://www.spam.com/bs
http://bs.com/spam
http://spam.bs.com/balls
http://remove-this.bs/test
http://www.and-this.spam/
Whatâs the RegEx?
24. Practical SEO Uses
Top Level Domains
Find only .bs and .spam top level domains. Hereâs the list:
http://www.spam.com/bs
http://bs.com/spam
http://spam.bs.com/balls
http://remove-this.bs/test
http://www.and-this.spam/
Whatâs the RegEx?
RegEx Find
.*//(.*?).(bs|spam)/.*
RegEx Replace
$1
25. Practical SEO Uses
Finding Substrings in Domains
Does the domain contain the words âdirectoryâ or âarticleâ? The list:
http://directorylinks.com/spamspam
http://www.spammy.com/link-directory
http://shadyarticles.com/
http://newyorktimes.com/?article_id=744
https://bonusarticles.com
Whatâs the RegEx? (If you can match bonus articles without the trailing
slash, I salute you!)
26. Practical SEO Uses
Finding Substrings in Domains
Does the domain contain the words âdirectoryâ or âarticleâ? The list:
http://directorylinks.com/spamspam
http://www.spammy.com/link-directory
http://shadyarticles.com/
http://newyorktimes.com/?article_id=744
https://bonusarticles.com
Whatâs the RegEx? (If you can match bonus articles without the trailing
slash, I salute you!)
RegEx Find
^.*?//.*(directory|article).*?(/|..{2,3}$).*
27. Practical SEO Uses
Merging Lists
Does the list of URLs contain domains weâve already disavowed?
Say weâre doing a reconsideration request and we donât want to
consider any of the links weâve already disavowed. So, we have List A,
new links with some old links mixed in, that we want cleansed of any of
the domains in List B. Itâs a whole process. What do you think it is?
List A
List B
http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
http://directorylinks.com/article
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
28. Practical SEO Uses
Merging Lists
First Iâd use one of the tricks we learned already to format List B in an
easier to manipulate way. Iâve bolded it below.
What do you think the RegEx F/R is to get that?
List A
List B
http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
http://directorylinks.com/article
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
29. Practical SEO Uses
Merging Lists
First Iâd use one of the tricks we learned already to format List B in an
easier to manipulate way. Iâve bolded it below.
What do you think the RegEx F/R is to get that?
RegEx Find
^ .*?//(.*?)/.*
RegEx Replace
$1
List A
List B
http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
http://directorylinks.com/article
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
30. Practical SEO Uses
Merging Lists
Great. Now, weâve learned how to search for substrings (string is a
substring of substrings, if that isnât confusing). How might we turn List
B into a set of variations of substrings that we can search through List A
with? A tip: n is the newline character and you need it.
Whatâs the RegEx?
List A
List B
http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
directorylinks.com
spam.com
mafia-wars.com
192.233.111
31. Practical SEO Uses
Merging Lists
Great. Now, weâve learned how to search for substrings (string is a
substring of substrings, if that isnât confusing). How might we turn List
B into a set of variations of substrings that we can search through List A
with? A tip: n is the newline character and you need it.
RegEx Find
Whatâs the RegEx?
n
RegEx Replace
List A
List B
|
http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
directorylinks.com
spam.com
mafia-wars.com
192.233.111
32. Practical SEO Uses
Merging Lists
If you did it right, you should have what Iâve currently listed under List
B. Whatâs the final step we need to be able to search List A with the
substrings in List B?
List A
List B
http://globeandmail.com/
directorylinks.com|spam.com|m
http://directorylinks.com/?id=1 afia-wars.com|192.233.111
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
33. Practical SEO Uses
Merging Lists
If you did it right, you should have what Iâve currently listed under List
B. Whatâs the final step we need to be able to search List A with the
substrings in List B?
.*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).*
List A
List B
http://globeandmail.com/
directorylinks.com|spam.com|m
http://directorylinks.com/?id=1 afia-wars.com|192.233.111
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
34. Practical SEO Uses
Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One
use of this feature is finding out whether or not someone is actually
linking to your website or not, because all legitimate anchors share the
same format.
<a (any or no tags) href=âany variation of your URLâ (any or no
tags)>(possible other tags)anchor text(possible other tags)</a>
In the attached HTML document, find all 3 links to Mooz.com.
Bonus: Find only the 2 links to Mooz.com that contain the anchor text,
âCow Melkâ or âMilk.â
35. Practical SEO Uses
Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One
use of this feature is finding out whether or not someone is actually
linking to your website or not, because all legitimate anchors share the
same format.
<a (any or no tags) href=âany variation of your URLâ (any or no
tags)>(possible other tags)anchor text(possible other tags)</a>
In the attached HTML document, find all 3 links to Mooz.com.
Bonus: Find only the 2 links to Mooz.com that contain the anchor text,
âCow Melkâ or âMilk.â
RegEx Find
<a.{0,100}href=.{0,100}mooz.com
<a.{0,100}href=.{0,100}?mooz.com(.{0,100}?)(Cow Melk|Milk)
37. RegEx Puzzles for Homework
Resources
Sample HTML
https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing
Sample URLs
https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing
38. RegEx Puzzles for Homework
Puzzles
Some Puzzles:
⢠Show only the domain, no sub-domain, with a find-replace.
⢠Find all links that are obviously from a blog.
⢠Format a list of links as domains in a comma separated list.
39. RegEx Puzzles for Homework
No Sub-Domains
Show only the domain, no sub-domain, with a find-replace.
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/
http://screw.you.regex.net/
Whatâs the RegEx?
40. RegEx Puzzles for Homework
No Sub-Domains
Show only the domain, no sub-domain, with a find-replace.
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/
http://screw.you.regex.net/
Whatâs the RegEx?
RegEx Find
^.*?//(.*.)*(.*).(.{2,3})/.*
RegEx Replace
$2.$3
41. RegEx Puzzles for Homework
Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a
blog or RSS feed.
Whatâs the RegEx?
42. RegEx Puzzles for Homework
Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a
blog or RSS feed.
Whatâs the RegEx?
RegEx Find
.*(/blog|/article|feed.|/feed).*
43. RegEx Puzzles for Homework
Comma Separated Domains
Format a list of links as domains in a comma separated list. The links:
http://www.business2community.com/seo
http://www.buzzstream.com/blog/competitive-link-building.html
http://www.cansinmert.com/
http://www.canuckseo.com/index.php/2010
http://www.cio.com/article/738249/
http://www.clicktivist.org/
Should be:
www.domain.com, www.domain2.com, etc.
Whatâs the RegEx?
44. RegEx Puzzles for Homework
Comma Separated Domains
Format a list of links as domains in a comma separated list. The links:
http://www.business2community.com/seo
http://www.buzzstream.com/blog/competitive-link-building.html
http://www.cansinmert.com/
http://www.canuckseo.com/index.php/2010
RegEx Find
http://www.cio.com/article/738249/
(|n).*//(.*)/.*
http://www.clicktivist.org/
Replace With
$2,
Should be:
Delete trailing comma
www.domain.com, www.domain2.com, etc.
Whatâs the RegEx?