SlideShare a Scribd company logo
1 of 38
JR Oakes | @jroakes | #TechSEOBoost
JR Oakes
Building a Simple Crawler on
a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
About Me
Senior Director, Technical SEO Research, at
@LocomotiveSEO
Passionate about:
• Development
• Learning
• Community
• Technology
JR Oakes | @jroakes | #TechSEOBoost
About Me
• Write some and do the Twitter thing.
• Share as much as I can on Github.
• Love to organize meetups
• Always testing something
• Love the brilliant team at Locomotive
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
• Overview of Crawling Landscape
• Key Components of Crawler
• Building a Toy Internet
• Building a Crawler and Renderer
JR Oakes | @jroakes | #TechSEOBoost
Overview of Crawling
Landscape
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
We have worked on sites with as many as a
billion potential pages. Google only crawls
(or knows about) a fraction of those.
• Crawled
• Want to Crawl (frontier)
• Unseen (or not wanted to be seen)
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
PageRank (or node popularity metrics) is a
good way to measure how deep to go.
Hypothesis is that a measurement of node
popularity can deprioritize links from very
unpopular nodes.
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
Google has over 25 BILLION results in
their inverted index.
JR Oakes | @jroakes | #TechSEOBoost
What a crawler must do
• Be robust. Handle spider traps and malicious behavior.
• Be distributed. Run across many machines.
• Be scalable. Easy to add more machines.
• Be efficient. Use network and processing resources wisely.
• Prioritize. Know the quality and priority of pages.
• Operate continuously.
• Be adaptable. Easy to change with new data / web needs.
• Be a good citizen. Respect robots.txt and server load.
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
Key Components of
Crawler
JR Oakes | @jroakes | #TechSEOBoost
Basic Crawl Architecture
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
Hard to believe Google is wasting
resources to render something
that has not changed in 40 years.
JR Oakes | @jroakes | #TechSEOBoost
Key Learnings
• Frontier is broken into two sections, a Front Queue, that manages priority, and a Back
Queue that manages politeness
• All queues are FIFO
• Each host has its own Back Queue
• Min Hashes (Sketches) are an effective way of deduping content
• Duplicates vs Near Duplicates measured by edit distance
• Everything is cached to reduce latency
• URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/)
• There are interesting things that can happen in the DOM rather than just parsing
retrieved URL
JR Oakes | @jroakes | #TechSEOBoost
Building a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Build quickly with topically similar pages for
each site
• Exist on separate domains
• Linked to each other, but not to any other
pages on the internet
• Contain basic SEO elements like title,
description, canonical, etc
JR Oakes | @jroakes | #TechSEOBoost
Solution
• Github Pages
• Jekyll
• Wikipedia
• Python
• search-engine-optimization-blog.github.io
• data-science-blog.github.io
• python-software.github.io
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
Building a Crawler and
Renderer
JR Oakes | @jroakes | #TechSEOBoost
Step One
I have no idea how to start. So
let’s do some research.
I <3 Github
JR Oakes | @jroakes | #TechSEOBoost
Step Two
I don’t want to reinvent the wheel,
so let’s see what is already out
there that I can use.
JR Oakes | @jroakes | #TechSEOBoost
Step Three
A lot of coffee
… and some beer.
JR Oakes | @jroakes | #TechSEOBoost
A little help along the way
Streamlit is the first app
framework specifically for
Machine Learning and
Data Science teams.
So you can stop spending time on
frontend development and get
back to what you do best.
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Use existing libraries where possible
• Be hardy enough to crawl my toy internet
• Make it as simple and approachable as possible (e.g. I use Pandas
a lot)
• Try to be true (as possible) to what is known that Google does
• Process linearly. No threading or extra services
• Include unit testing
• Include a Jupyter Notebook
• Include READMEs
• Include a simple indexer and search apparatus to play with results
(Thanks John M.!)
JR Oakes | @jroakes | #TechSEOBoost
Parts
• PageRank
• Chrome Headless Rendering
• Text NLP Normalization
• Bert Embeddings
• Robots
• Duplicate Content Shingling
• URL Hashing
• Document Frequency Functions (BM25 and TFIDF)
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
Embeddings
https://github.com/huggingface/transformers
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things waaaaayy simpler than they would be in real life.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things way simpler than they would be in real life.
• Sentencepiece and BPE encoding is revolutionary for indexes and NLG
• A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog.
• Minhash comparison made checking rendering to crawled comparison, easy.
JR Oakes | @jroakes | #TechSEOBoost
Result
A crawler written in Python that we are releasing as
open source.
Keep in mind:
1. This was written in a month
2. Google engineers would laugh at it
3. It probably has bugs
4. It is really fun to play around with
JR Oakes | @jroakes | #TechSEOBoost
Result
We also built a simple UI in
Streamlit so you can play
around with the results and
parameters.
JR Oakes | @jroakes | #TechSEOBoost
Result
Complete with Ads!
JR Oakes | @jroakes | #TechSEOBoost
Thank You
Start playing at the link below
https://locomotive.agency/coal-crawler-renderer-indexer-caboose
–
Find me on Twitter at: @jroakes

More Related Content

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Building a Simple Crawler on a Toy Internet

  • 1. JR Oakes | @jroakes | #TechSEOBoost JR Oakes Building a Simple Crawler on a Toy Internet
  • 2. JR Oakes | @jroakes | #TechSEOBoost About Me Senior Director, Technical SEO Research, at @LocomotiveSEO Passionate about: • Development • Learning • Community • Technology
  • 3. JR Oakes | @jroakes | #TechSEOBoost About Me • Write some and do the Twitter thing. • Share as much as I can on Github. • Love to organize meetups • Always testing something • Love the brilliant team at Locomotive
  • 4. JR Oakes | @jroakes | #TechSEOBoost What we will learn
  • 5. JR Oakes | @jroakes | #TechSEOBoost What we will learn • Overview of Crawling Landscape • Key Components of Crawler • Building a Toy Internet • Building a Crawler and Renderer
  • 6. JR Oakes | @jroakes | #TechSEOBoost Overview of Crawling Landscape
  • 7. JR Oakes | @jroakes | #TechSEOBoost The Web is Big We have worked on sites with as many as a billion potential pages. Google only crawls (or knows about) a fraction of those. • Crawled • Want to Crawl (frontier) • Unseen (or not wanted to be seen) Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 8. JR Oakes | @jroakes | #TechSEOBoost The Web is Big PageRank (or node popularity metrics) is a good way to measure how deep to go. Hypothesis is that a measurement of node popularity can deprioritize links from very unpopular nodes.
  • 9. JR Oakes | @jroakes | #TechSEOBoost The Web is Big Google has over 25 BILLION results in their inverted index.
  • 10. JR Oakes | @jroakes | #TechSEOBoost What a crawler must do • Be robust. Handle spider traps and malicious behavior. • Be distributed. Run across many machines. • Be scalable. Easy to add more machines. • Be efficient. Use network and processing resources wisely. • Prioritize. Know the quality and priority of pages. • Operate continuously. • Be adaptable. Easy to change with new data / web needs. • Be a good citizen. Respect robots.txt and server load. Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 11. JR Oakes | @jroakes | #TechSEOBoost Key Components of Crawler
  • 12. JR Oakes | @jroakes | #TechSEOBoost Basic Crawl Architecture Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 13. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture
  • 14. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture Hard to believe Google is wasting resources to render something that has not changed in 40 years.
  • 15. JR Oakes | @jroakes | #TechSEOBoost Key Learnings • Frontier is broken into two sections, a Front Queue, that manages priority, and a Back Queue that manages politeness • All queues are FIFO • Each host has its own Back Queue • Min Hashes (Sketches) are an effective way of deduping content • Duplicates vs Near Duplicates measured by edit distance • Everything is cached to reduce latency • URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/) • There are interesting things that can happen in the DOM rather than just parsing retrieved URL
  • 16. JR Oakes | @jroakes | #TechSEOBoost Building a Toy Internet
  • 17. JR Oakes | @jroakes | #TechSEOBoost Criteria • Build quickly with topically similar pages for each site • Exist on separate domains • Linked to each other, but not to any other pages on the internet • Contain basic SEO elements like title, description, canonical, etc
  • 18. JR Oakes | @jroakes | #TechSEOBoost Solution • Github Pages • Jekyll • Wikipedia • Python • search-engine-optimization-blog.github.io • data-science-blog.github.io • python-software.github.io
  • 19. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 20. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 21. JR Oakes | @jroakes | #TechSEOBoost Building a Crawler and Renderer
  • 22. JR Oakes | @jroakes | #TechSEOBoost Step One I have no idea how to start. So let’s do some research. I <3 Github
  • 23. JR Oakes | @jroakes | #TechSEOBoost Step Two I don’t want to reinvent the wheel, so let’s see what is already out there that I can use.
  • 24. JR Oakes | @jroakes | #TechSEOBoost Step Three A lot of coffee … and some beer.
  • 25. JR Oakes | @jroakes | #TechSEOBoost A little help along the way Streamlit is the first app framework specifically for Machine Learning and Data Science teams. So you can stop spending time on frontend development and get back to what you do best.
  • 26. JR Oakes | @jroakes | #TechSEOBoost Criteria • Use existing libraries where possible • Be hardy enough to crawl my toy internet • Make it as simple and approachable as possible (e.g. I use Pandas a lot) • Try to be true (as possible) to what is known that Google does • Process linearly. No threading or extra services • Include unit testing • Include a Jupyter Notebook • Include READMEs • Include a simple indexer and search apparatus to play with results (Thanks John M.!)
  • 27. JR Oakes | @jroakes | #TechSEOBoost Parts • PageRank • Chrome Headless Rendering • Text NLP Normalization • Bert Embeddings • Robots • Duplicate Content Shingling • URL Hashing • Document Frequency Functions (BM25 and TFIDF)
  • 28. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content.
  • 29. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 30. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible.
  • 31. JR Oakes | @jroakes | #TechSEOBoost Learnings Embeddings https://github.com/huggingface/transformers
  • 32. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things waaaaayy simpler than they would be in real life.
  • 33. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 34. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things way simpler than they would be in real life. • Sentencepiece and BPE encoding is revolutionary for indexes and NLG • A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog. • Minhash comparison made checking rendering to crawled comparison, easy.
  • 35. JR Oakes | @jroakes | #TechSEOBoost Result A crawler written in Python that we are releasing as open source. Keep in mind: 1. This was written in a month 2. Google engineers would laugh at it 3. It probably has bugs 4. It is really fun to play around with
  • 36. JR Oakes | @jroakes | #TechSEOBoost Result We also built a simple UI in Streamlit so you can play around with the results and parameters.
  • 37. JR Oakes | @jroakes | #TechSEOBoost Result Complete with Ads!
  • 38. JR Oakes | @jroakes | #TechSEOBoost Thank You Start playing at the link below https://locomotive.agency/coal-crawler-renderer-indexer-caboose – Find me on Twitter at: @jroakes