3. NEXT LEVEL!
XPATH-ING!
Scrape partial What we know:"
matches using 1) Link will contain"
http://www.eppie.net in the "
XPath’s “contains” href attribute"
function to find
2) Some people like to hurt the internet
inexact data.
by capitalizing URLs, so we’ll need
to account for that"
3) People who link to you don’t care
about your desire for
canonicalization
4. DO YOU LINK!
TO ME?!
//a[contains(@href,'http://www.eppie.net’)]
PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
5. Add translate() to normalize case
//a[contains(translate(@href,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno
pqrstuvwxyz'),'http://www.eppie.net’)]
DO YOU LINK!
TO ME?!
6. How you can use this:
Get notified when a link is removed
+ Make contact to potentially save dropping link (friendly
reminder, buy expiring domain, recreate dead resource)
Integrate into link outreach process
+ Get notification when link goes live
DO YOU LINK!
TO ME?!
8. NEXT LEVEL!
XPATH-ING!
What we know:"
Combine attribute
selectors to more 1) External links all contain http://"
accurately target 2) Internal links can also use http://"
useful information
3) So we need to exclude http:// links
to the current domain
10. How you can use this:
Identify if a page is too spammed out to bother with by
pulling external link counts
Find expired or expiring domains being linked to from
authority sites. Purchase and rebuild or redirect those
sites.
Broken link building automation
SCRAPE ALL!
EXTERNAL LINKS!
12. LINK TYPE!
IDENTIFICATION!
XPath’s ancestor What we know:"
axis lets us A link inside a containing element with
leverage semantic an id or class name including the word
“comment,” “footer,” or “blogroll” is
markup to ID link highly suggestive of type
types.
13. LINK TYPE!
IDENTIFICATION!
"//a[@href='h,p://randfishkin.com/blog']/
ancestor::*[contains(@id|
@class,'comment')]"
ment-
Wa s Rand com
ay to
spa mming his w E
the top ? This + 0S
y...
tells the stor
14. Why you might use this:
Analyze competitors’ strategies for acquiring links
Find what types of links are being used to get good anchor
text
Improve workflow: Ignore placed links (comments, directory
submissions, article submissions, blog networks, etc) and
work on a smaller subset of EARNED links for manual
analysis
SCRAPE ALL!
EXTERNAL LINKS!
15. REGEX TO!
THE RESCUE!
Use Case 4:
I’ve scraped some data, now I need to
extract some small portion of it that
XPath can’t do on its own (easily)
16. REGEX TO!
THE RESCUE!
Use regular
Example:
expressions to
pattern match Extract all @mentions of a specific user
from a tweet or page
structured text
22. Why you might use this:
Pull contact information from a web site (Twitter username,
email address) to improve outreach efforts
Extract code fragments (like Analytics IDs and AdSense IDs)
for improved competitive research
REGEX TO!
THE RESCUE!
23. BEYOND THE !
SPREADSHEET!
Use Case 5:
I want to chain processes together,
process lots of data, or allow multiple
users to leverage what I build.
24. BEYOND THE !
SPREADSHEET!
Scraping outside PHP Scraping Overview:
the spreadsheet
1) CURL target page
allows for more 2) Convert to DOM Object
complex systems 3) Run Xpath Queries
4) Store Data or Hit API
to be built.