The SEO's Guide to Scraping Everything

•

16 likes•23,072 views

eppievojt

Technology Design

NEXT LEVEL!
XPATH-ING!

Use Case 1:
Does site x link to any page on
eppie.net?

NEXT LEVEL!
XPATH-ING!
Scrape partial What we know:"

matches using 1)  Link will contain"
http://www.eppie.net in the "
XPath’s “contains” href attribute"
function to ﬁnd
2)  Some people like to hurt the internet
inexact data.
by capitalizing URLs, so we’ll need
to account for that"

3)  People who link to you don’t care
about your desire for
canonicalization

DO YOU LINK!
TO ME?!

//a[contains(@href,'http://www.eppie.net’)]

PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY

Add translate() to normalize case
//a[contains(translate(@href,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno
pqrstuvwxyz'),'http://www.eppie.net’)]

DO YOU LINK!
TO ME?!

How you can use this:
Get notiﬁed when a link is removed
+ Make contact to potentially save dropping link (friendly
reminder, buy expiring domain, recreate dead resource)

Integrate into link outreach process
+ Get notiﬁcation when link goes live

DO YOU LINK!
TO ME?!

NEXT LEVEL!
XPATH-ING!

Use Case 2:
Find every external link from cnn.com

NEXT LEVEL!
XPATH-ING!
What we know:"
Combine attribute
selectors to more 1)  External links all contain http://"

accurately target 2)  Internal links can also use http://"
useful information
3)  So we need to exclude http:// links
to the current domain

SCRAPE ALL!
EXTERNAL LINKS!

//a[contains(@href,'http://') and not
(contains(@href,'cnn.com'))]

How you can use this:
Identify if a page is too spammed out to bother with by
pulling external link counts

Find expired or expiring domains being linked to from
authority sites. Purchase and rebuild or redirect those
sites.

Broken link building automation

SCRAPE ALL!
EXTERNAL LINKS!

LINK TYPE!
IDENTIFICATION!

Use Case 3:
How are they ranking? What kind of links
do they have?

LINK TYPE!
IDENTIFICATION!
XPath’s ancestor What we know:"
axis lets us A link inside a containing element with
leverage semantic an id or class name including the word
“comment,” “footer,” or “blogroll” is
markup to ID link highly suggestive of type
types.

LINK TYPE!
IDENTIFICATION!

"//a[@href='h,p://randﬁshkin.com/blog']/
ancestor::*[contains(@id|
@class,'comment')]"

ment-
Wa s Rand com
ay to
spa mming his w E
the top ? This + 0S
y...
tells the stor

Why you might use this:
Analyze competitors’ strategies for acquiring links

Find what types of links are being used to get good anchor
text

Improve workﬂow: Ignore placed links (comments, directory
submissions, article submissions, blog networks, etc) and
work on a smaller subset of EARNED links for manual
analysis

SCRAPE ALL!
EXTERNAL LINKS!

REGEX TO!
THE RESCUE!

Use Case 4:
I’ve scraped some data, now I need to
extract some small portion of it that
XPath can’t do on its own (easily)

REGEX TO!
THE RESCUE!

Use regular
Example:
expressions to
pattern match Extract all @mentions of a speciﬁc user
from a tweet or page
structured text

EXTRACT!
@ MENTIONS!

/(?:^|s)@([A-z0-9_]+)/gi

Why you might use this:
Pull contact information from a web site (Twitter username,
email address) to improve outreach efforts

Extract code fragments (like Analytics IDs and AdSense IDs)
for improved competitive research

REGEX TO!
THE RESCUE!

BEYOND THE !
SPREADSHEET!

Use Case 5:
I want to chain processes together,
process lots of data, or allow multiple
users to leverage what I build.

BEYOND THE !
SPREADSHEET!
Scraping outside PHP Scraping Overview:
the spreadsheet
1)  CURL target page
allows for more 2)  Convert to DOM Object
complex systems 3)  Run Xpath Queries
4)  Store Data or Hit API
to be built.

BEYOND THE !
SPREADSHEET!

Simple PHP Scraper Class:
http://www.scrapeeverything.com

SHOW!
SOME LOVE!

I’m @eppievojt and I work for @jplcreative "

eppie.net
linkdetective.com
jplcreative.com

What's hot

Screaming Frog PPT

SoftProdigy - We know software!

Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...

Dawn Anderson MSc DigM

This talk will cover a few key “Aha” moments that you should have about the way WordPress works. We’ll talk about things like the template hierarchy, where WordPress content is stored, how posts and pages and custom post types are represented in the database, what folks are talking about when they talk about hooks and filters, and just generally review the “behind the scenes” mechanics of how WordPress works. We’ll also touch on a few “tricks of the trade” that you might not realize are out there, mainly version control, development environments, and Vagrant.

Things you should know about WordPress (but were always too afraid to ask): W...

Michael McNeill

If you haven’t heard of crawl budget, you should! It is a precious commodity in SEO. The higher your PageRank, the bigger the crawl budget. Search engines are data hungry robots and can often chew up crawl budget crawling useless URLs and pages of your website. In this session, learn how to control what search engine robots can and can’t crawl. Find out crawl optimisation opportunities and keep your website lean and mean!

Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU

Jason Mun

Tools are a must for serious SEOs; they deliver the flexibility and capability to tackle jobs of any size. Knowing which ones best fit your needs, budget and the scale of the sites you work on is critical. In this clinic, our veteran SEOs open their own tool chests, share with you their favorites (both free and paid) and take your questions about how to use them (and others) effectively. These are tools that have earned the loyalty of our speakers thanks to their utility, features and ability to help maximize time - no sponsored advice here!

SMX East - SEO Tools Panel

Abby Hamilton

The New Renaissance of JavaScript

Hamlet Batista

WordPress SEO & Optimisation

Joost de Valk

SEO Presentation - The 42nd Estate - BRA - City of Boston

The 42nd Estate

Web Performance Optimisation

Chris Burgess

On site audit with screaming frog gdi

Glen Dimaandal

WordPress SEO in 2014 - WordCamp Baltimore 2014

Arsham Mirshah

Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO

Gerry White

Solving Complex JavaScript Issues and Leveraging Semantic HTML5

Hamlet Batista

SEO for Large Websites

Dominic Woodman

With the latest major update recently launched by Google, the Penguin update, webmasters and SEO's have been desperately trying to figure out how to fix the issues caught by the algorithm. Using backlink exports from Open Site Explorer, Link Detective, and the amazing power of Excel, I break down my process for analyzing a site's backlink profile and identifying problematic areas for SEOs and webmasters to fix.

Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables

Mark Ginsberg

Technical SEO "Overoptimization"

Hamlet Batista

What's hot (16)

Screaming Frog PPT

Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...

Things you should know about WordPress (but were always too afraid to ask): W...

Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU

SMX East - SEO Tools Panel

The New Renaissance of JavaScript

WordPress SEO & Optimisation

SEO Presentation - The 42nd Estate - BRA - City of Boston

Web Performance Optimisation

On site audit with screaming frog gdi

WordPress SEO in 2014 - WordCamp Baltimore 2014

Use Google Docs to monitor SEO by pulling in Google Analytics #BrightonSEO

Solving Complex JavaScript Issues and Leveraging Semantic HTML5

SEO for Large Websites

Kahenacon 2012 - Penguin Backlink Analysis with Pivot Tables

Technical SEO "Overoptimization"

Similar to The SEO's Guide to Scraping Everything

Site Architecture Best Practices for Search Findability - Adam Audette

Adam Audette

Information Architecture for SEO

iProspect Canada

Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...

Dealmaker Media

Seo audit fitpass.co.in via Nikola Minkov / Serpact

Nikola Minkov

SPC Master Power User SharePoint & Office 365

Benjamin Niaulin

Website Security

Carlos Z

Website Security

MODxpo

Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...

Tin180 VietNam

Best-kept Secrets to Search Engine Optimization Success: the Art and the Science

LaSandra Brill

Driving Volunteers to your Website: Online Marketing 101

WO Strategies

Advanced Seo Web Development Tech Ed 2008

Nathan Buggia

TeamPage Beginner to Jedi, Jordan Frank

Traction Software

SEO Training in Mahabubnagar

Subhash Malgam

"Stop Blogging a Dead Horse" is a presentation I originally gave back at the June 2015 SEO Melbourne Meetup and inspired by an internal SEO training session I gave back while working for my last company. I think it's common practice these days that everyone knows they need to blog, but have they even thought for a second on: - Why they're actually doing it? - Knowing what to write - How to maximise it's impact? - Is it working? Hopefully through this presentation I can give you a few useful tips and guidance through the eyes on a SEO person. If i can make sure that at least 1 blog post goes out with an optimised <title> tag, my work here is done! I have also created another presentation a while ago called "The SEOs Guide to Making a Website". Make sure you check it out. http://www.slideshare.net/holidaypointau/the-se-os-guide-to-making-a-website-michael-jones

SEO Practices for Blogs - Stop Blogging a Dead Horse

Michael Jones

Speaker: Seth Vargo Language: English Although not officially coined until 2009, DevOps ideals have been explicitly discussed since at least 2006. Recently, however, the term "DevOps" has gained increasing popularity across a variety of fields and industries. DevOps is not a development methodology or technology; DevOps is an ideology. It is a way to facilitate organizational prosperity and growth while increasing each individual employee's happiness along the way. As DevOps has gained in prominence, a gap has been created between the original definition of DevOps and this new "enterprise-ready" buzzword. For organizations beginning DevOps practices, this talk will provide a 10,000ft view of DevOps and how you can properly implement DevOps practices in your organization. For organizations that are currently practicing DevOps, this talk will cover common pitfalls, ways to sustain a happy culture, and new tips to foster organizational prosperity. Visit our website: http://atmosphere-conference.com/

Atmosphere Conference 2015: The 10 Myths of DevOps

PROIDEA

It's all about getting ahead of the competition and winning the war on the web. Learn how to scrape your competitors top performing content and keywords, analyze the text with AI tools to find tone, style and consistent themes, and apply that intelligence to develop your own content strategy rooted in performance that will better appeal to your readers and fans and deliver results. Attend this session to learn advanced optimization secrets: •Key elements of a web page that can be extracted for research. •Top discovery tools to quickly find optimized topics, titles and tags. •How to use XPath and Screaming Frog Web Crawler to fuel research. •New tools to analyze content and predict the big five characteristics. •Sneak peek at some new tools for advanced search engine optimization.

#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.

Mel Sciorra

Diagnosing Technical Issues With Search Engine Optimization

Nine By Blue

How to get up and running in minutes with the lean, scalable, and easy to maintain Python web framework, Flask. Attendees will get to see how Flask acts as the sturdy glue between your database framework, front-end templates and operating system. Keep an eye out for tips/tricks using SQLite, Jinja2, and Werkzeug. Neil is a software developer with a background in 3D graphics programming and management information systems. Presently he's working with Image Engine on feature-film visual effects projects like Teenage Mutant Ninja Turtles, Elysium, Fast & Furious. He's also a co-founder of ComboMash Entertainment, an independent game studio based in Vancouver.

BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY

CodeCore

A complete digital marketing sop divay jain ( profshine tech )

Divay Jain

SEO: Optimizing Sites for People (and search engines)

kdmcBerkeley at UC Berkeley

Similar to The SEO's Guide to Scraping Everything (20)

Site Architecture Best Practices for Search Findability - Adam Audette

Information Architecture for SEO

Searching for Users: SEO as an Engine for Customer Acquisition (Stephan Spenc...

Seo audit fitpass.co.in via Nikola Minkov / Serpact

SPC Master Power User SharePoint & Office 365

Website Security

Best Kept Secrets To Search Engine Optimization Success The Art And The Scie...

Best-kept Secrets to Search Engine Optimization Success: the Art and the Science

Driving Volunteers to your Website: Online Marketing 101

Advanced Seo Web Development Tech Ed 2008

TeamPage Beginner to Jedi, Jordan Frank

SEO Training in Mahabubnagar

SEO Practices for Blogs - Stop Blogging a Dead Horse

Atmosphere Conference 2015: The 10 Myths of DevOps

#CMC2019: Advanced SEO: Competitive intelligence, Web Scraping, and More.

Diagnosing Technical Issues With Search Engine Optimization

BUILDING MODERN PYTHON WEB FRAMEWORKS USING FLASK WITH NEIL GREY

A complete digital marketing sop divay jain ( profshine tech )

SEO: Optimizing Sites for People (and search engines)

Recently uploaded

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Scaling API-first – The story of a global engineering organization

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Boost Fertility New Invention Ups Success Rates.pdf

How to Troubleshoot Apps for the Modern Connected Worker

What Are The Drone Anti-jamming Systems Technology?

How to Troubleshoot Apps for the Modern Connected Worker

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

A Year of the Servo Reboot: Where Are We Now?

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Driving Behavioral Change for Information Management through Data-Driven Gree...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

🐬 The future of MySQL is Postgres 🐘

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Artificial Intelligence: Facts and Myths

Powerful Google developer tools for immediate impact! (2023-24 C)

The SEO's Guide to Scraping Everything

1. the SEO’s guide to: ! SCRAPING! EVERYTHING! @eppievojt! digital marketing consultant, JPL!

2. NEXT LEVEL! XPATH-ING! Use Case 1: Does site x link to any page on eppie.net?

3. NEXT LEVEL! XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to ﬁnd 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization

4. DO YOU LINK! TO ME?! //a[contains(@href,'http://www.eppie.net’)] PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY

5. Add translate() to normalize case //a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno pqrstuvwxyz'),'http://www.eppie.net’)] DO YOU LINK! TO ME?!

6. How you can use this: Get notiﬁed when a link is removed + Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource) Integrate into link outreach process + Get notiﬁcation when link goes live DO YOU LINK! TO ME?!

7. NEXT LEVEL! XPATH-ING! Use Case 2: Find every external link from cnn.com

8. NEXT LEVEL! XPATH-ING! What we know:" Combine attribute selectors to more 1)  External links all contain http://" accurately target 2)  Internal links can also use http://" useful information 3)  So we need to exclude http:// links to the current domain

9. SCRAPE ALL! EXTERNAL LINKS! //a[contains(@href,'http://') and not (contains(@href,'cnn.com'))]

10. How you can use this: Identify if a page is too spammed out to bother with by pulling external link counts Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites. Broken link building automation SCRAPE ALL! EXTERNAL LINKS!

11. LINK TYPE! IDENTIFICATION! Use Case 3: How are they ranking? What kind of links do they have?

12. LINK TYPE! IDENTIFICATION! XPath’s ancestor What we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.

13. LINK TYPE! IDENTIFICATION! "//a[@href='h,p://randﬁshkin.com/blog']/ ancestor::*[contains(@id| @class,'comment')]" ment- Wa s Rand com ay to spa mming his w E the top ? This + 0S y... tells the stor

14. Why you might use this: Analyze competitors’ strategies for acquiring links Find what types of links are being used to get good anchor text Improve workﬂow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!

15. REGEX TO! THE RESCUE! Use Case 4: I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)

16. REGEX TO! THE RESCUE! Use regular Example: expressions to pattern match Extract all @mentions of a speciﬁc user from a tweet or page structured text

17. REGEX TO! THE RESCUE!

18. REGEX TO! THE RESCUE!

19. REGEX TO! THE RESCUE!

20. REGEX TO! THE RESCUE!

21. EXTRACT! @ MENTIONS! /(?:^|s)@([A-z0-9_]+)/gi

22. Why you might use this: Pull contact information from a web site (Twitter username, email address) to improve outreach efforts Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!

23. BEYOND THE ! SPREADSHEET! Use Case 5: I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.

24. BEYOND THE ! SPREADSHEET! Scraping outside PHP Scraping Overview: the spreadsheet 1)  CURL target page allows for more 2)  Convert to DOM Object complex systems 3)  Run Xpath Queries 4)  Store Data or Hit API to be built.

25. BEYOND THE ! SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com

26. SHOW! SOME LOVE! I’m @eppievojt and I work for @jplcreative " eppie.net linkdetective.com jplcreative.com

The SEO's Guide to Scraping Everything

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to The SEO's Guide to Scraping Everything

Similar to The SEO's Guide to Scraping Everything (20)

Recently uploaded

Recently uploaded (20)

The SEO's Guide to Scraping Everything