SlideShare a Scribd company logo
1 of 47
Download to read offline
Regular Expressions for SEO

The Coolest Pattern Matching Search Language...

Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant

For Powered by Search Internal | October 2013
We’re in business because we believe that great brands need
both voice and visibility in order to connect people with what
matters.
A boutique, full-service digital marketing agency in Toronto,
Powered by Search is a PROFIT HOT 50-ranked agency that
delivers search engine optimization, pay per click advertising,
local search, social media marketing, and online reputation
management services.
Some of our clients...

Featured in...
RegEx Basics
Practical SEO Uses
RegEx Puzzles for Homework
Regular Expressions for SEO
http://xkcd.com/
RegEx Basics
RegEx Basics
Use Sublime Text

This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too.
It’s the text editor you’ll fall in love with.
RegEx Basics
Literal Matching
Text
I want to match this.
RegEx
match this

RegEx matches literal strings. This is like running a normal search in Word.
Pretty cool, huh?
RegEx Basics
Anchors
Text
I want this, I want that, I want I want I want

Text
I want this, I want that, I want I want I want

RegEx
^I want

RegEx
I want$

There are a couple of special characters called “Anchors.” The carret (^) represents
the beginning of a line. The dollar sign ($) represents the end of a line.
You see these a lot in .htaccess files.
RegEx Basics
Special Characters
There are also a series of other special characters. These are:
• [ - Starts a Character Class (More Later)
•  - Escapes or modifies the character after it.
• . - Wildcard. It represents any character.
• | - OR, so (this|that|the other) means this, that, or the other.
• ( - Starts a group.
• ) - Ends a group.
To match any of these literal characters, put a backslash in front of it.
This also applies to ?+*^$ which we’ve talked about or will get to later.
RegEx Basics
Quantifiers
A quantifier tells the expression how many times to match the
expression before it.
Text
Ahhhhhhhhhhh. A spider.
RegEx
A[h]+
• ? - Zero or one time
• + - One or more times
• {exactly} - Exactly this many times
• {min,max} - Between min and max times
• * - Zero or more times
RegEx Basics
Greedy vs. Non-Greedy
Quantitative expressions are greedy by default. It’ll repeat the
expression as many times as possible before giving up and continuing
with the rest of the RegEx.
This leads to unexpected issues. To make these quantifiers, *+{}, nongreedy, just add a question mark.
Text
<p>test</p>

Text
<p>test</p>

RegEx (Greedy)
<.+>

RegEx (Lazy)
<.+?>
RegEx Basics
Variations / Character Classes []
A variation is a set of literal characters that can possibly fill a space. For
example:
Text
Well then I’m better than you.
RegEx
th[ea]n
The characters in the variation aren’t a GROUP. What the following
RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t,
an h, an a, or an n.” That’s not what we want.
Text
Well then I’m better than you.
RegEx
[then|than]
RegEx Basics
Groups ()
In the case above, we could use a group to solve our problem.
Text
Well then I’m better than you.
RegEx
(then|than)
A group isn’t the best answer. It’s for alternation and/or quantification.
Text
I like redredred apples.
RegEx
(blue|green|red)+
RegEx Basics
Variables / Captured Groups $1
When you use a group, it captures the information in a numbered
variable. They count up from $1. You can use the variable when doing a
find-replace.
Text
https://www.searchersforbeerfridges.com/?vote_number=9001
RegEx Find
.+?//(.*?)/.*
RegEx Replace
$1
New Text
www.searchersforbeerfridges.com
Practical SEO Uses
Practical SEO Uses
Google Analytics – Branded Organic
In Analytics I often want to find branded organic search traffic. Let’s
look at the GWT data in Analytics for our fictional client, Lett.Me.

Lett Me has a ton of common mis-typings and variations. They
get traffic from lm, lm.com, let me, lettme.com, letme.com,
let.me, and lett.me.
What’s the regular expression that captures all of that?
Practical SEO Uses
Google Analytics – Branded Organic
Here’s the regular expression I came up with. It matches some funky
cases like let me.com but that’s fine:
RegEx Find
(lm|let[t]?[ ]?[.]?me)(.com)?
You can also remove the square brackets, but I feel like it’s easier to
read with them in. Without them it looks like this:
RegEx Find
(lm|let{1,2} ?.?me)(.com)?
Now just save this RegEx in your reporting document and you’ll never
have to type out the whole thing again.
Imagine what this could do for reporting on keyword groups!
Practical SEO Uses
Trim To Root
Trim to Root using Find Replace. Here’s the list:
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/?pg=2
What’s the RegEx?
Practical SEO Uses
Trim To Root
Trim to Root using Find Replace. Here’s the list:
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/?pg=2
What’s the RegEx?
RegEx Find
^ .*?//(.*?)/.*
RegEx Replace
$1
Practical SEO Uses
Fixing HTML – Nested Tags
I commonly get improperly formatted HTML. Here’s an example:
<h2><b></b><i></i>I Wrote This In Microsoft Word!</h2>
<h2></h2>
<p>This is a great image!</p>
<p><img src=“http://site.com/sampleimage.png” /></p>
I want to remove all of the empty tags.
What’s the RegEx?
Practical SEO Uses
Fixing HTML – Nested Tags
I commonly get improperly formatted HTML. Here’s an example:
<h2><b></b><i></i>I Wrote This In Microsoft Word!</h2>
<h2></h2>
<p>This is a great image!</p>
<p><img src=“http://site.com/sampleimage.png” /></p>
I want to remove all of the empty tags.
What’s the RegEx?
RegEx Find
<[a-z0-9]{1,6}></[a-z0-9]{1,6}>
RegEx Replace
Practical SEO Uses
Top Level Domains
Find only .bs and .spam top level domains. Here’s the list:
http://www.spam.com/bs
http://bs.com/spam
http://spam.bs.com/balls
http://remove-this.bs/test
http://www.and-this.spam/
What’s the RegEx?
Practical SEO Uses
Top Level Domains
Find only .bs and .spam top level domains. Here’s the list:
http://www.spam.com/bs
http://bs.com/spam
http://spam.bs.com/balls
http://remove-this.bs/test
http://www.and-this.spam/
What’s the RegEx?
RegEx Find
.*//(.*?).(bs|spam)/.*
RegEx Replace
$1
Practical SEO Uses
Finding Substrings in Domains
Does the domain contain the words “directory” or “article”? The list:
http://directorylinks.com/spamspam
http://www.spammy.com/link-directory
http://shadyarticles.com/
http://newyorktimes.com/?article_id=744
https://bonusarticles.com
What’s the RegEx? (If you can match bonus articles without the trailing
slash, I salute you!)
Practical SEO Uses
Finding Substrings in Domains
Does the domain contain the words “directory” or “article”? The list:
http://directorylinks.com/spamspam
http://www.spammy.com/link-directory
http://shadyarticles.com/
http://newyorktimes.com/?article_id=744
https://bonusarticles.com
What’s the RegEx? (If you can match bonus articles without the trailing
slash, I salute you!)
RegEx Find
^.*?//.*(directory|article).*?(/|..{2,3}$).*
Practical SEO Uses
Merging Lists
Does the list of URLs contain domains we’ve already disavowed?
Say we’re doing a reconsideration request and we don’t want to
consider any of the links we’ve already disavowed. So, we have List A,
new links with some old links mixed in, that we want cleansed of any of
the domains in List B. It’s a whole process. What do you think it is?
List A

List B

http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/

http://directorylinks.com/article
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
Practical SEO Uses
Merging Lists
First I’d use one of the tricks we learned already to format List B in an
easier to manipulate way. I’ve bolded it below.
What do you think the RegEx F/R is to get that?

List A

List B

http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/

http://directorylinks.com/article
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
Practical SEO Uses
Merging Lists
First I’d use one of the tricks we learned already to format List B in an
easier to manipulate way. I’ve bolded it below.
What do you think the RegEx F/R is to get that?

RegEx Find
^ .*?//(.*?)/.*
RegEx Replace
$1

List A

List B

http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/

http://directorylinks.com/article
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
Practical SEO Uses
Merging Lists
Great. Now, we’ve learned how to search for substrings (string is a
substring of substrings, if that isn’t confusing). How might we turn List
B into a set of variations of substrings that we can search through List A
with? A tip: n is the newline character and you need it.
What’s the RegEx?
List A

List B

http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/

directorylinks.com
spam.com
mafia-wars.com
192.233.111
Practical SEO Uses
Merging Lists
Great. Now, we’ve learned how to search for substrings (string is a
substring of substrings, if that isn’t confusing). How might we turn List
B into a set of variations of substrings that we can search through List A
with? A tip: n is the newline character and you need it.
RegEx Find
What’s the RegEx?
n
RegEx Replace
List A
List B
|
http://globeandmail.com/
http://directorylinks.com/?id=1
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/

directorylinks.com
spam.com
mafia-wars.com
192.233.111
Practical SEO Uses
Merging Lists
If you did it right, you should have what I’ve currently listed under List
B. What’s the final step we need to be able to search List A with the
substrings in List B?

List A

List B

http://globeandmail.com/
directorylinks.com|spam.com|m
http://directorylinks.com/?id=1 afia-wars.com|192.233.111
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
Practical SEO Uses
Merging Lists
If you did it right, you should have what I’ve currently listed under List
B. What’s the final step we need to be able to search List A with the
substrings in List B?
.*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).*
List A

List B

http://globeandmail.com/
directorylinks.com|spam.com|m
http://directorylinks.com/?id=1 afia-wars.com|192.233.111
http://spam.com/article
http://mafia-wars.com/torrentz
http://192.233.111/
http://tomsdiner.net/article
https://thediner.pl/
Practical SEO Uses
Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One
use of this feature is finding out whether or not someone is actually
linking to your website or not, because all legitimate anchors share the
same format.
<a (any or no tags) href=“any variation of your URL” (any or no
tags)>(possible other tags)anchor text(possible other tags)</a>
In the attached HTML document, find all 3 links to Mooz.com.
Bonus: Find only the 2 links to Mooz.com that contain the anchor text,
“Cow Melk” or “Milk.”
Practical SEO Uses
Finding Client Anchor in HTML
Screaming Frog lets you use Regular Expressions in your searches. One
use of this feature is finding out whether or not someone is actually
linking to your website or not, because all legitimate anchors share the
same format.
<a (any or no tags) href=“any variation of your URL” (any or no
tags)>(possible other tags)anchor text(possible other tags)</a>
In the attached HTML document, find all 3 links to Mooz.com.
Bonus: Find only the 2 links to Mooz.com that contain the anchor text,
“Cow Melk” or “Milk.”
RegEx Find
<a.{0,100}href=.{0,100}mooz.com
<a.{0,100}href=.{0,100}?mooz.com(.{0,100}?)(Cow Melk|Milk)
RegEx Puzzles for Homework
RegEx Puzzles for Homework
Resources
Sample HTML
https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing
Sample URLs
https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing
RegEx Puzzles for Homework
Puzzles
Some Puzzles:
• Show only the domain, no sub-domain, with a find-replace.
• Find all links that are obviously from a blog.
• Format a list of links as domains in a comma separated list.
RegEx Puzzles for Homework
No Sub-Domains
Show only the domain, no sub-domain, with a find-replace.
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/
http://screw.you.regex.net/
What’s the RegEx?
RegEx Puzzles for Homework
No Sub-Domains
Show only the domain, no sub-domain, with a find-replace.
http://www.georgebrown.com/www-non-www
http://blog.russian.me/
https://russian.eu/
http://screw.you.regex.net/
What’s the RegEx?
RegEx Find
^.*?//(.*.)*(.*).(.{2,3})/.*
RegEx Replace
$2.$3
RegEx Puzzles for Homework
Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a
blog or RSS feed.
What’s the RegEx?
RegEx Puzzles for Homework
Blog or RSS
In the attached sample-urls.txt, find all links that are obviously from a
blog or RSS feed.
What’s the RegEx?

RegEx Find
.*(/blog|/article|feed.|/feed).*
RegEx Puzzles for Homework
Comma Separated Domains
Format a list of links as domains in a comma separated list. The links:
http://www.business2community.com/seo
http://www.buzzstream.com/blog/competitive-link-building.html
http://www.cansinmert.com/
http://www.canuckseo.com/index.php/2010
http://www.cio.com/article/738249/
http://www.clicktivist.org/
Should be:
www.domain.com, www.domain2.com, etc.
What’s the RegEx?
RegEx Puzzles for Homework
Comma Separated Domains
Format a list of links as domains in a comma separated list. The links:
http://www.business2community.com/seo
http://www.buzzstream.com/blog/competitive-link-building.html
http://www.cansinmert.com/
http://www.canuckseo.com/index.php/2010
RegEx Find
http://www.cio.com/article/738249/
(|n).*//(.*)/.*
http://www.clicktivist.org/
Replace With
$2,
Should be:
Delete trailing comma
www.domain.com, www.domain2.com, etc.
What’s the RegEx?
http://www.smbc-comics.com/
Questions?
Thanks for Hanging Out

Stay in Touch
Twitter: @troyfawkes
Google+: http://gplus.to/TroyFawkes
Email: troy@poweredbysearch.com
www.poweredbysearch.com
www.troyfawkes.com

More Related Content

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Regular Expressions (RegEx) for SEO

  • 1. Regular Expressions for SEO The Coolest Pattern Matching Search Language... Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant For Powered by Search Internal | October 2013
  • 2. We’re in business because we believe that great brands need both voice and visibility in order to connect people with what matters. A boutique, full-service digital marketing agency in Toronto, Powered by Search is a PROFIT HOT 50-ranked agency that delivers search engine optimization, pay per click advertising, local search, social media marketing, and online reputation management services. Some of our clients... Featured in...
  • 3. RegEx Basics Practical SEO Uses RegEx Puzzles for Homework
  • 7. RegEx Basics Use Sublime Text This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too. It’s the text editor you’ll fall in love with.
  • 8. RegEx Basics Literal Matching Text I want to match this. RegEx match this RegEx matches literal strings. This is like running a normal search in Word. Pretty cool, huh?
  • 9. RegEx Basics Anchors Text I want this, I want that, I want I want I want Text I want this, I want that, I want I want I want RegEx ^I want RegEx I want$ There are a couple of special characters called “Anchors.” The carret (^) represents the beginning of a line. The dollar sign ($) represents the end of a line. You see these a lot in .htaccess files.
  • 10. RegEx Basics Special Characters There are also a series of other special characters. These are: • [ - Starts a Character Class (More Later) • - Escapes or modifies the character after it. • . - Wildcard. It represents any character. • | - OR, so (this|that|the other) means this, that, or the other. • ( - Starts a group. • ) - Ends a group. To match any of these literal characters, put a backslash in front of it. This also applies to ?+*^$ which we’ve talked about or will get to later.
  • 11. RegEx Basics Quantifiers A quantifier tells the expression how many times to match the expression before it. Text Ahhhhhhhhhhh. A spider. RegEx A[h]+ • ? - Zero or one time • + - One or more times • {exactly} - Exactly this many times • {min,max} - Between min and max times • * - Zero or more times
  • 12. RegEx Basics Greedy vs. Non-Greedy Quantitative expressions are greedy by default. It’ll repeat the expression as many times as possible before giving up and continuing with the rest of the RegEx. This leads to unexpected issues. To make these quantifiers, *+{}, nongreedy, just add a question mark. Text <p>test</p> Text <p>test</p> RegEx (Greedy) <.+> RegEx (Lazy) <.+?>
  • 13. RegEx Basics Variations / Character Classes [] A variation is a set of literal characters that can possibly fill a space. For example: Text Well then I’m better than you. RegEx th[ea]n The characters in the variation aren’t a GROUP. What the following RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t, an h, an a, or an n.” That’s not what we want. Text Well then I’m better than you. RegEx [then|than]
  • 14. RegEx Basics Groups () In the case above, we could use a group to solve our problem. Text Well then I’m better than you. RegEx (then|than) A group isn’t the best answer. It’s for alternation and/or quantification. Text I like redredred apples. RegEx (blue|green|red)+
  • 15. RegEx Basics Variables / Captured Groups $1 When you use a group, it captures the information in a numbered variable. They count up from $1. You can use the variable when doing a find-replace. Text https://www.searchersforbeerfridges.com/?vote_number=9001 RegEx Find .+?//(.*?)/.* RegEx Replace $1 New Text www.searchersforbeerfridges.com
  • 17. Practical SEO Uses Google Analytics – Branded Organic In Analytics I often want to find branded organic search traffic. Let’s look at the GWT data in Analytics for our fictional client, Lett.Me. Lett Me has a ton of common mis-typings and variations. They get traffic from lm, lm.com, let me, lettme.com, letme.com, let.me, and lett.me. What’s the regular expression that captures all of that?
  • 18. Practical SEO Uses Google Analytics – Branded Organic Here’s the regular expression I came up with. It matches some funky cases like let me.com but that’s fine: RegEx Find (lm|let[t]?[ ]?[.]?me)(.com)? You can also remove the square brackets, but I feel like it’s easier to read with them in. Without them it looks like this: RegEx Find (lm|let{1,2} ?.?me)(.com)? Now just save this RegEx in your reporting document and you’ll never have to type out the whole thing again. Imagine what this could do for reporting on keyword groups!
  • 19. Practical SEO Uses Trim To Root Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?
  • 20. Practical SEO Uses Trim To Root Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx? RegEx Find ^ .*?//(.*?)/.* RegEx Replace $1
  • 21. Practical SEO Uses Fixing HTML – Nested Tags I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?
  • 22. Practical SEO Uses Fixing HTML – Nested Tags I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx? RegEx Find <[a-z0-9]{1,6}></[a-z0-9]{1,6}> RegEx Replace
  • 23. Practical SEO Uses Top Level Domains Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?
  • 24. Practical SEO Uses Top Level Domains Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx? RegEx Find .*//(.*?).(bs|spam)/.* RegEx Replace $1
  • 25. Practical SEO Uses Finding Substrings in Domains Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)
  • 26. Practical SEO Uses Finding Substrings in Domains Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!) RegEx Find ^.*?//.*(directory|article).*?(/|..{2,3}$).*
  • 27. Practical SEO Uses Merging Lists Does the list of URLs contain domains we’ve already disavowed? Say we’re doing a reconsideration request and we don’t want to consider any of the links we’ve already disavowed. So, we have List A, new links with some old links mixed in, that we want cleansed of any of the domains in List B. It’s a whole process. What do you think it is? List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
  • 28. Practical SEO Uses Merging Lists First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that? List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
  • 29. Practical SEO Uses Merging Lists First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that? RegEx Find ^ .*?//(.*?)/.* RegEx Replace $1 List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
  • 30. Practical SEO Uses Merging Lists Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: n is the newline character and you need it. What’s the RegEx? List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ directorylinks.com spam.com mafia-wars.com 192.233.111
  • 31. Practical SEO Uses Merging Lists Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: n is the newline character and you need it. RegEx Find What’s the RegEx? n RegEx Replace List A List B | http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ directorylinks.com spam.com mafia-wars.com 192.233.111
  • 32. Practical SEO Uses Merging Lists If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B? List A List B http://globeandmail.com/ directorylinks.com|spam.com|m http://directorylinks.com/?id=1 afia-wars.com|192.233.111 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
  • 33. Practical SEO Uses Merging Lists If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B? .*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).* List A List B http://globeandmail.com/ directorylinks.com|spam.com|m http://directorylinks.com/?id=1 afia-wars.com|192.233.111 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
  • 34. Practical SEO Uses Finding Client Anchor in HTML Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”
  • 35. Practical SEO Uses Finding Client Anchor in HTML Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.” RegEx Find <a.{0,100}href=.{0,100}mooz.com <a.{0,100}href=.{0,100}?mooz.com(.{0,100}?)(Cow Melk|Milk)
  • 36. RegEx Puzzles for Homework
  • 37. RegEx Puzzles for Homework Resources Sample HTML https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing Sample URLs https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing
  • 38. RegEx Puzzles for Homework Puzzles Some Puzzles: • Show only the domain, no sub-domain, with a find-replace. • Find all links that are obviously from a blog. • Format a list of links as domains in a comma separated list.
  • 39. RegEx Puzzles for Homework No Sub-Domains Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?
  • 40. RegEx Puzzles for Homework No Sub-Domains Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx? RegEx Find ^.*?//(.*.)*(.*).(.{2,3})/.* RegEx Replace $2.$3
  • 41. RegEx Puzzles for Homework Blog or RSS In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?
  • 42. RegEx Puzzles for Homework Blog or RSS In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx? RegEx Find .*(/blog|/article|feed.|/feed).*
  • 43. RegEx Puzzles for Homework Comma Separated Domains Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?
  • 44. RegEx Puzzles for Homework Comma Separated Domains Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 RegEx Find http://www.cio.com/article/738249/ (|n).*//(.*)/.* http://www.clicktivist.org/ Replace With $2, Should be: Delete trailing comma www.domain.com, www.domain2.com, etc. What’s the RegEx?
  • 47. Thanks for Hanging Out Stay in Touch Twitter: @troyfawkes Google+: http://gplus.to/TroyFawkes Email: troy@poweredbysearch.com www.poweredbysearch.com www.troyfawkes.com