Regular Expressions (RegEx) for SEO

Regular Expressions for SEO

The Coolest Pattern Matching Search Language...

Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant

For Powered by Search Internal | October 2013

We’re in business because we believe that great brands need
both voice and visibility in order to connect people with what
matters.
A boutique, full-service digital marketing agency in Toronto,
Powered by Search is a PROFIT HOT 50-ranked agency that
delivers search engine optimization, pay per click advertising,
local search, social media marketing, and online reputation
management services.
Some of our clients...

Featured in...

RegEx Basics
Practical SEO Uses
RegEx Puzzles for Homework

RegEx Basics
Use Sublime Text

This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too.
It’s the text editor you’ll fall in love with.

RegEx Basics
Literal Matching
Text
I want to match this.
RegEx
match this

RegEx matches literal strings. This is like running a normal search in Word.
Pretty cool, huh?

RegEx Basics
Anchors
Text
I want this, I want that, I want I want I want

Text
I want this, I want that, I want I want I want

RegEx
^I want

RegEx
I want$

There are a couple of special characters called “Anchors.” The carret (^) represents
the beginning of a line. The dollar sign ($) represents the end of a line.
You see these a lot in .htaccess files.

RegEx Basics
Special Characters
There are also a series of other special characters. These are:
• [ - Starts a Character Class (More Later)
• - Escapes or modifies the character after it.
• . - Wildcard. It represents any character.
• | - OR, so (this|that|the other) means this, that, or the other.
• ( - Starts a group.
• ) - Ends a group.
To match any of these literal characters, put a backslash in front of it.
This also applies to ?+*^$ which we’ve talked about or will get to later.

RegEx Basics
Quantifiers
A quantifier tells the expression how many times to match the
expression before it.
Text
Ahhhhhhhhhhh. A spider.
RegEx
A[h]+
• ? - Zero or one time
• + - One or more times
• {exactly} - Exactly this many times
• {min,max} - Between min and max times
• * - Zero or more times

RegEx Basics
Greedy vs. Non-Greedy
Quantitative expressions are greedy by default. It’ll repeat the
expression as many times as possible before giving up and continuing
with the rest of the RegEx.
This leads to unexpected issues. To make these quantifiers, *+{}, nongreedy, just add a question mark.
Text
<p>test</p>

Text
<p>test</p>

RegEx (Greedy)
<.+>

RegEx (Lazy)
<.+?>

RegEx Basics
Variations / Character Classes []
A variation is a set of literal characters that can possibly fill a space. For
example:
Text
Well then I’m better than you.
RegEx
th[ea]n
The characters in the variation aren’t a GROUP. What the following
RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t,
an h, an a, or an n.” That’s not what we want.
Text
RegEx
[then|than]

RegEx Basics
Groups ()
In the case above, we could use a group to solve our problem.
Text
RegEx
(then|than)
A group isn’t the best answer. It’s for alternation and/or quantification.
Text
I like redredred apples.
RegEx
(blue|green|red)+

RegEx Basics
Variables / Captured Groups $1
When you use a group, it captures the information in a numbered
variable. They count up from $1. You can use the variable when doing a
find-replace.
Text
https://www.searchersforbeerfridges.com/?vote_number=9001
RegEx Find
.+?//(.*?)/.*
RegEx Replace
$1
New Text
www.searchersforbeerfridges.com

Practical SEO Uses
Google Analytics – Branded Organic
In Analytics I often want to find branded organic search traffic. Let’s
look at the GWT data in Analytics for our fictional client, Lett.Me.

Lett Me has a ton of common mis-typings and variations. They
get traffic from lm, lm.com, let me, lettme.com, letme.com,
let.me, and lett.me.
What’s the regular expression that captures all of that?

Practical SEO Uses
Google Analytics – Branded Organic
Here’s the regular expression I came up with. It matches some funky
cases like let me.com but that’s fine:
RegEx Find
(lm|let[t]?[ ]?[.]?me)(.com)?
You can also remove the square brackets, but I feel like it’s easier to
read with them in. Without them it looks like this:
RegEx Find
(lm|let{1,2} ?.?me)(.com)?
Now just save this RegEx in your reporting document and you’ll never
have to type out the whole thing again.
Imagine what this could do for reporting on keyword groups!

Practical SEO Uses
Trim To Root
Trim to Root using Find Replace. Here’s the list:
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/?pg=2
What’s the RegEx?

Practical SEO Uses
Trim To Root
Trim to Root using Find Replace. Here’s the list:
https://russian.eu/?pg=2
What’s the RegEx?
RegEx Find
^ .*?//(.*?)/.*
RegEx Replace
$1

Practical SEO Uses
Fixing HTML – Nested Tags
I commonly get improperly formatted HTML. Here’s an example:
<h2><b></b><i></i>I Wrote This In Microsoft Word!</h2>
<h2></h2>
<p>This is a great image!</p>
<p><img src=“http://site.com/sampleimage.png” /></p>
I want to remove all of the empty tags.
What’s the RegEx?

Practical SEO Uses
Fixing HTML – Nested Tags
I commonly get improperly formatted HTML. Here’s an example:
<h2><b></b><i></i>I Wrote This In Microsoft Word!</h2>
<h2></h2>
<p>This is a great image!</p>
<p><img src=“http://site.com/sampleimage.png” /></p>
I want to remove all of the empty tags.
What’s the RegEx?
RegEx Find
<[a-z0-9]{1,6}></[a-z0-9]{1,6}>
RegEx Replace

Practical SEO Uses
Top Level Domains
Find only .bs and .spam top level domains. Here’s the list:
http://www.spam.com/bs
http://bs.com/spam
http://spam.bs.com/balls
http://remove-this.bs/test
http://www.and-this.spam/
What’s the RegEx?

Practical SEO Uses
Top Level Domains
Find only .bs and .spam top level domains. Here’s the list:
http://www.spam.com/bs
http://bs.com/spam
http://spam.bs.com/balls
http://remove-this.bs/test
http://www.and-this.spam/
What’s the RegEx?
RegEx Find
.*//(.*?).(bs|spam)/.*
RegEx Replace
$1

Practical SEO Uses
Finding Substrings in Domains
Does the domain contain the words “directory” or “article”? The list:
http://directorylinks.com/spamspam
http://www.spammy.com/link-directory
http://shadyarticles.com/
http://newyorktimes.com/?article_id=744
https://bonusarticles.com
What’s the RegEx? (If you can match bonus articles without the trailing
slash, I salute you!)

Practical SEO Uses
Finding Substrings in Domains
Does the domain contain the words “directory” or “article”? The list:
http://directorylinks.com/spamspam
http://www.spammy.com/link-directory
http://shadyarticles.com/
http://newyorktimes.com/?article_id=744
https://bonusarticles.com
What’s the RegEx? (If you can match bonus articles without the trailing
slash, I salute you!)
RegEx Find
^.*?//.*(directory|article).*?(/|..{2,3}$).*

Practical SEO Uses
Merging Lists
Does the list of URLs contain domains we’ve already disavowed?
Say we’re doing a reconsideration request and we don’t want to
consider any of the links we’ve already disavowed. So, we have List A,
new links with some old links mixed in, that we want cleansed of any of
the domains in List B. It’s a whole process. What do you think it is?
List A

List B

http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/

http://directorylinks.com/article
http://192.233.111/

Practical SEO Uses
Merging Lists
First I’d use one of the tricks we learned already to format List B in an
easier to manipulate way. I’ve bolded it below.
What do you think the RegEx F/R is to get that?

List A

List B

http://192.233.111/

http://192.233.111/

Practical SEO Uses
Merging Lists
First I’d use one of the tricks we learned already to format List B in an
easier to manipulate way. I’ve bolded it below.
What do you think the RegEx F/R is to get that?

RegEx Find
^ .*?//(.*?)/.*
RegEx Replace
$1

List A

List B

http://192.233.111/

http://192.233.111/

Practical SEO Uses
Merging Lists
Great. Now, we’ve learned how to search for substrings (string is a
substring of substrings, if that isn’t confusing). How might we turn List
B into a set of variations of substrings that we can search through List A
with? A tip: n is the newline character and you need it.
What’s the RegEx?
List A

List B

http://192.233.111/

directorylinks.com
spam.com
mafia-wars.com
192.233.111

Practical SEO Uses
Merging Lists
Great. Now, we’ve learned how to search for substrings (string is a
substring of substrings, if that isn’t confusing). How might we turn List
B into a set of variations of substrings that we can search through List A
with? A tip: n is the newline character and you need it.
RegEx Find
What’s the RegEx?
n
RegEx Replace
List A
List B
|
http://192.233.111/

directorylinks.com
spam.com
mafia-wars.com
192.233.111

Practical SEO Uses
Merging Lists
If you did it right, you should have what I’ve currently listed under List
B. What’s the final step we need to be able to search List A with the
substrings in List B?

List A

List B

directorylinks.com|spam.com|m
http://directorylinks.com/?id=1 afia-wars.com|192.233.111
http://192.233.111/

Practical SEO Uses
Merging Lists
If you did it right, you should have what I’ve currently listed under List
B. What’s the final step we need to be able to search List A with the
substrings in List B?
.*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).*
List A

List B

directorylinks.com|spam.com|m
http://directorylinks.com/?id=1 afia-wars.com|192.233.111
http://192.233.111/

Practical SEO Uses
Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One
use of this feature is finding out whether or not someone is actually
linking to your website or not, because all legitimate anchors share the
same format.
<a (any or no tags) href=“any variation of your URL” (any or no
tags)>(possible other tags)anchor text(possible other tags)</a>
In the attached HTML document, find all 3 links to Mooz.com.
Bonus: Find only the 2 links to Mooz.com that contain the anchor text,
“Cow Melk” or “Milk.”

Practical SEO Uses
Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One
use of this feature is finding out whether or not someone is actually
linking to your website or not, because all legitimate anchors share the
same format.
<a (any or no tags) href=“any variation of your URL” (any or no
tags)>(possible other tags)anchor text(possible other tags)</a>
In the attached HTML document, find all 3 links to Mooz.com.
Bonus: Find only the 2 links to Mooz.com that contain the anchor text,
“Cow Melk” or “Milk.”
RegEx Find
<a.{0,100}href=.{0,100}mooz.com
<a.{0,100}href=.{0,100}?mooz.com(.{0,100}?)(Cow Melk|Milk)

Resources
Sample HTML
https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing
Sample URLs
https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing

Puzzles
Some Puzzles:
• Show only the domain, no sub-domain, with a find-replace.
• Find all links that are obviously from a blog.
• Format a list of links as domains in a comma separated list.

No Sub-Domains
Show only the domain, no sub-domain, with a find-replace.
https://russian.eu/
http://screw.you.regex.net/
What’s the RegEx?

No Sub-Domains
Show only the domain, no sub-domain, with a find-replace.
https://russian.eu/
http://screw.you.regex.net/
What’s the RegEx?
RegEx Find
^.*?//(.*.)*(.*).(.{2,3})/.*
RegEx Replace
$2.$3

Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a
blog or RSS feed.
What’s the RegEx?

Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a
blog or RSS feed.
What’s the RegEx?

RegEx Find
.*(/blog|/article|feed.|/feed).*

Comma Separated Domains
Format a list of links as domains in a comma separated list. The links:
http://www.business2community.com/seo
http://www.buzzstream.com/blog/competitive-link-building.html
http://www.cansinmert.com/
http://www.canuckseo.com/index.php/2010
http://www.cio.com/article/738249/
http://www.clicktivist.org/
Should be:
www.domain.com, www.domain2.com, etc.
What’s the RegEx?

Comma Separated Domains
Format a list of links as domains in a comma separated list. The links:
http://www.business2community.com/seo
http://www.buzzstream.com/blog/competitive-link-building.html
http://www.cansinmert.com/
http://www.canuckseo.com/index.php/2010
RegEx Find
http://www.cio.com/article/738249/
(|n).*//(.*)/.*
http://www.clicktivist.org/
Replace With
$2,
Should be:
Delete trailing comma
www.domain.com, www.domain2.com, etc.
What’s the RegEx?

Thanks for Hanging Out

Stay in Touch
Twitter: @troyfawkes
Google+: http://gplus.to/TroyFawkes
Email: troy@poweredbysearch.com
www.poweredbysearch.com
www.troyfawkes.com

Regular Expressions (RegEx) for SEO

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Regular Expressions (RegEx) for SEO