SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
JR Oakes | @jroakes | #TechSEOBoost
#TechSEOBoost | @CatalystSEM
THANK YOU TO THIS YEAR’S SPONSORS
What I Learned Building a Toy Example to
Crawl & Render like Google
JR Oakes, Locomotive
JR Oakes | @jroakes | #TechSEOBoost
JR Oakes
Building a Simple Crawler on
a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
About Me
Senior Director, Technical SEO Research, at
@LocomotiveSEO
Passionate about:
• Development
• Learning
• Community
• Technology
JR Oakes | @jroakes | #TechSEOBoost
About Me
• Write some and do the Twitter thing.
• Share as much as I can on Github.
• Love to organize meetups
• Always testing something
• Love the brilliant team at Locomotive
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
• Overview of Crawling Landscape
• Key Components of Crawler
• Building a Toy Internet
• Building a Crawler and Renderer
JR Oakes | @jroakes | #TechSEOBoost
Overview of Crawling
Landscape
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
We have worked on sites with as many as a
billion potential pages. Google only crawls
(or knows about) a fraction of those.
• Crawled
• Want to Crawl (frontier)
• Unseen (or not wanted to be seen)
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
PageRank (or node popularity metrics) is a
good way to measure how deep to go.
Hypothesis is that a measurement of node
popularity can deprioritize links from very
unpopular nodes.
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
Google has over 25 BILLION results in
their inverted index.
JR Oakes | @jroakes | #TechSEOBoost
What a crawler must do
• Be robust. Handle spider traps and malicious behavior.
• Be distributed. Run across many machines.
• Be scalable. Easy to add more machines.
• Be efficient. Use network and processing resources wisely.
• Prioritize. Know the quality and priority of pages.
• Operate continuously.
• Be adaptable. Easy to change with new data / web needs.
• Be a good citizen. Respect robots.txt and server load.
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
Key Components of
Crawler
JR Oakes | @jroakes | #TechSEOBoost
Basic Crawl Architecture
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
Hard to believe Google is wasting
resources to render something
that has not changed in 40 years.
JR Oakes | @jroakes | #TechSEOBoost
Key Learnings
• Frontier is broken into two sections, a Front Queue, that manages priority, and a Back
Queue that manages politeness
• All queues are FIFO
• Each host has its own Back Queue
• Min Hashes (Sketches) are an effective way of deduping content
• Duplicates vs Near Duplicates measured by edit distance
• Everything is cached to reduce latency
• URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/)
• There are interesting things that can happen in the DOM rather than just parsing
retrieved URL
JR Oakes | @jroakes | #TechSEOBoost
Building a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Build quickly with topically similar pages for
each site
• Exist on separate domains
• Linked to each other, but not to any other
pages on the internet
• Contain basic SEO elements like title,
description, canonical, etc
JR Oakes | @jroakes | #TechSEOBoost
Solution
• Github Pages
• Jekyll
• Wikipedia
• Python
• search-engine-optimization-blog.github.io
• data-science-blog.github.io
• python-software.github.io
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
Building a Crawler and
Renderer
JR Oakes | @jroakes | #TechSEOBoost
Step One
I have no idea how to start. So
let’s do some research.
I <3 Github
JR Oakes | @jroakes | #TechSEOBoost
Step Two
I don’t want to reinvent the wheel,
so let’s see what is already out
there that I can use.
JR Oakes | @jroakes | #TechSEOBoost
Step Three
A lot of coffee
… and some beer.
JR Oakes | @jroakes | #TechSEOBoost
A little help along the way
Streamlit is the first app
framework specifically for
Machine Learning and
Data Science teams.
So you can stop spending time on
frontend development and get
back to what you do best.
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Use existing libraries where possible
• Be hardy enough to crawl my toy internet
• Make it as simple and approachable as possible (e.g. I use Pandas
a lot)
• Try to be true (as possible) to what is known that Google does
• Process linearly. No threading or extra services
• Include unit testing
• Include a Jupyter Notebook
• Include READMEs
• Include a simple indexer and search apparatus to play with results
(Thanks John M.!)
JR Oakes | @jroakes | #TechSEOBoost
Parts
• PageRank
• Chrome Headless Rendering
• Text NLP Normalization
• Bert Embeddings
• Robots
• Duplicate Content Shingling
• URL Hashing
• Document Frequency Functions (BM25 and TFIDF)
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
Embeddings
https://github.com/huggingface/transformers
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things waaaaayy simpler than they would be in real life.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things way simpler than they would be in real life.
• Sentencepiece and BPE encoding is revolutionary for indexes and NLG
• A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog.
• Minhash comparison made checking rendering to crawled comparison, easy.
JR Oakes | @jroakes | #TechSEOBoost
Result
A crawler written in Python that we are releasing as
open source.
Keep in mind:
1. This was written in a month
2. Google engineers would laugh at it
3. It probably has bugs
4. It is really fun to play around with
JR Oakes | @jroakes | #TechSEOBoost
Result
We also built a simple UI in
Streamlit so you can play
around with the results and
parameters.
JR Oakes | @jroakes | #TechSEOBoost
Result
Complete with Ads!
JR Oakes | @jroakes | #TechSEOBoost
Thank You
Start playing at the link below
https://locomotive.agency/coal-crawler-renderer-indexer-caboose
–
Find me on Twitter at: @jroakes
JR Oakes | @jroakes | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/

Weitere ähnliche Inhalte

Was ist angesagt?

TechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEOTechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEOCatalyst
 
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword ResearchSearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword ResearchDistilled
 
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based WebsitesTechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based WebsitesCatalyst
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionCatalyst
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Ruth Everett
 
Advanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data ScienceAdvanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data ScienceTyler Reardon
 
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroPaul Shapiro
 
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...Ruth Everett
 
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Ruth Everett
 
TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionCatalyst
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis Sanders
 
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...Ruth Everett
 
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile LandscapeMax Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile LandscapeMax Prin
 
Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Petra Kis-Herczegh
 
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...Ruth Everett
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...WeLoveSEO
 
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)Turing Fest
 

Was ist angesagt? (19)

TechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEOTechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEO
 
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword ResearchSearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
 
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based WebsitesTechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
 
Advanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data ScienceAdvanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data Science
 
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
 
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
 
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
 
TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research Competition
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot X
 
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
 
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
 
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
 
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile LandscapeMax Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
 
Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?
 
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...
 
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
 

Ähnlich wie What I Learned Building a Toy Example to Crawl & Render like Google

How Search Works
How Search WorksHow Search Works
How Search WorksAhrefs
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Paul Withers
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stoxpatrickstox
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your websitehernanibf
 
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...Search Engine Journal
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your WebsiteAcquia
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDavide Mauri
 
How to write a web framework
How to write a web frameworkHow to write a web framework
How to write a web frameworkNgoc Dao
 
Il semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problemaIl semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problemaLaura Sacco
 
October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101Eric Sembrat
 
Surviving in a Microservices Environment
Surviving in a Microservices EnvironmentSurviving in a Microservices Environment
Surviving in a Microservices EnvironmentSteve Pember
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Derek Jacoby
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16Christian Berg
 
WordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress WebappsWordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress Webappstjasko
 
Student Industrial Training Presentation Slide
Student Industrial Training Presentation SlideStudent Industrial Training Presentation Slide
Student Industrial Training Presentation SlideKhairul Filhan
 
Performance tuning
Performance tuningPerformance tuning
Performance tuningEric Phan
 
Exploring Content API Options - March 23rd 2016
Exploring Content API Options - March 23rd 2016Exploring Content API Options - March 23rd 2016
Exploring Content API Options - March 23rd 2016Jani Tarvainen
 

Ähnlich wie What I Learned Building a Toy Example to Crawl & Render like Google (20)

How Search Works
How Search WorksHow Search Works
How Search Works
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website
 
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website
 
Bollean Search - NageshRao
Bollean Search - NageshRaoBollean Search - NageshRao
Bollean Search - NageshRao
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
How to write a web framework
How to write a web frameworkHow to write a web framework
How to write a web framework
 
Il semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problemaIl semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problema
 
October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101
 
Surviving in a Microservices Environment
Surviving in a Microservices EnvironmentSurviving in a Microservices Environment
Surviving in a Microservices Environment
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16
 
WordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress WebappsWordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress Webapps
 
Student Industrial Training Presentation Slide
Student Industrial Training Presentation SlideStudent Industrial Training Presentation Slide
Student Industrial Training Presentation Slide
 
Django course
Django courseDjango course
Django course
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
Exploring Content API Options - March 23rd 2016
Exploring Content API Options - March 23rd 2016Exploring Content API Options - March 23rd 2016
Exploring Content API Options - March 23rd 2016
 

Mehr von Catalyst

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Catalyst
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessCatalyst
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationCatalyst
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...Catalyst
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...Catalyst
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing ProgrammaticCatalyst
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...Catalyst
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeCatalyst
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartCatalyst
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandCatalyst
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningCatalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Catalyst
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookCatalyst
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020Catalyst
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesCatalyst
 
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptCatalyst
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchCatalyst
 
The Ultimate Pagination for SEO
The Ultimate Pagination for SEOThe Ultimate Pagination for SEO
The Ultimate Pagination for SEOCatalyst
 
Crawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl BudgetCrawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl BudgetCatalyst
 

Mehr von Catalyst (20)

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO Experimentation
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing Programmatic
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel Imperative
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All Languages
 
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive Search
 
The Ultimate Pagination for SEO
The Ultimate Pagination for SEOThe Ultimate Pagination for SEO
The Ultimate Pagination for SEO
 
Crawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl BudgetCrawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl Budget
 

Kürzlich hochgeladen

Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptxDigital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptxZACGaming
 
The+State+of+Careers+In+Retention+Marketing-2.pdf
The+State+of+Careers+In+Retention+Marketing-2.pdfThe+State+of+Careers+In+Retention+Marketing-2.pdf
The+State+of+Careers+In+Retention+Marketing-2.pdfSocial Samosa
 
Unraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptxUnraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptxelizabethella096
 
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesInstant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesMedia Logic
 
Unlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich ManuscriptUnlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich Manuscriptelizabethella096
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxelizabethella096
 
How to utilize calculated properties in your HubSpot setups
How to utilize calculated properties in your HubSpot setupsHow to utilize calculated properties in your HubSpot setups
How to utilize calculated properties in your HubSpot setupsssuser4571da
 
BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15SearchNorwich
 
What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?riteshhsociall
 

Kürzlich hochgeladen (20)

Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptxDigital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
 
The+State+of+Careers+In+Retention+Marketing-2.pdf
The+State+of+Careers+In+Retention+Marketing-2.pdfThe+State+of+Careers+In+Retention+Marketing-2.pdf
The+State+of+Careers+In+Retention+Marketing-2.pdf
 
Unraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptxUnraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptx
 
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesInstant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
 
Unlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich ManuscriptUnlocking the Mystery of the Voynich Manuscript
Unlocking the Mystery of the Voynich Manuscript
 
Generative AI Content Creation - Andrew Jenkins
Generative AI Content Creation - Andrew JenkinsGenerative AI Content Creation - Andrew Jenkins
Generative AI Content Creation - Andrew Jenkins
 
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale BertrandSEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptx
 
Digital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew RupertDigital Strategy Master Class - Andrew Rupert
Digital Strategy Master Class - Andrew Rupert
 
How to Create a Social Media Plan Like a Pro - Jordan Scheltgen
How to Create a Social Media Plan Like a Pro - Jordan ScheltgenHow to Create a Social Media Plan Like a Pro - Jordan Scheltgen
How to Create a Social Media Plan Like a Pro - Jordan Scheltgen
 
How to utilize calculated properties in your HubSpot setups
How to utilize calculated properties in your HubSpot setupsHow to utilize calculated properties in your HubSpot setups
How to utilize calculated properties in your HubSpot setups
 
BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort Service
 
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
 
Campfire Stories - Matching Content to Audience Context - Ryan Brock
Campfire Stories - Matching Content to Audience Context - Ryan BrockCampfire Stories - Matching Content to Audience Context - Ryan Brock
Campfire Stories - Matching Content to Audience Context - Ryan Brock
 
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting GroupSEO Master Class - Steve Wiideman, Wiideman Consulting Group
SEO Master Class - Steve Wiideman, Wiideman Consulting Group
 
No Cookies No Problem - Steve Krull, Be Found Online
No Cookies No Problem - Steve Krull, Be Found OnlineNo Cookies No Problem - Steve Krull, Be Found Online
No Cookies No Problem - Steve Krull, Be Found Online
 
Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...
Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...
Navigating the SEO of Tomorrow, Competitive Benchmarking, China as an e-Comme...
 
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel LeminTurn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
 
What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?
 
Foundation First - Why Your Website and Content Matters - David Pisarek
Foundation First - Why Your Website and Content Matters - David PisarekFoundation First - Why Your Website and Content Matters - David Pisarek
Foundation First - Why Your Website and Content Matters - David Pisarek
 

What I Learned Building a Toy Example to Crawl & Render like Google

  • 1. JR Oakes | @jroakes | #TechSEOBoost #TechSEOBoost | @CatalystSEM THANK YOU TO THIS YEAR’S SPONSORS What I Learned Building a Toy Example to Crawl & Render like Google JR Oakes, Locomotive
  • 2. JR Oakes | @jroakes | #TechSEOBoost JR Oakes Building a Simple Crawler on a Toy Internet
  • 3. JR Oakes | @jroakes | #TechSEOBoost About Me Senior Director, Technical SEO Research, at @LocomotiveSEO Passionate about: • Development • Learning • Community • Technology
  • 4. JR Oakes | @jroakes | #TechSEOBoost About Me • Write some and do the Twitter thing. • Share as much as I can on Github. • Love to organize meetups • Always testing something • Love the brilliant team at Locomotive
  • 5. JR Oakes | @jroakes | #TechSEOBoost What we will learn
  • 6. JR Oakes | @jroakes | #TechSEOBoost What we will learn • Overview of Crawling Landscape • Key Components of Crawler • Building a Toy Internet • Building a Crawler and Renderer
  • 7. JR Oakes | @jroakes | #TechSEOBoost Overview of Crawling Landscape
  • 8. JR Oakes | @jroakes | #TechSEOBoost The Web is Big We have worked on sites with as many as a billion potential pages. Google only crawls (or knows about) a fraction of those. • Crawled • Want to Crawl (frontier) • Unseen (or not wanted to be seen) Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 9. JR Oakes | @jroakes | #TechSEOBoost The Web is Big PageRank (or node popularity metrics) is a good way to measure how deep to go. Hypothesis is that a measurement of node popularity can deprioritize links from very unpopular nodes.
  • 10. JR Oakes | @jroakes | #TechSEOBoost The Web is Big Google has over 25 BILLION results in their inverted index.
  • 11. JR Oakes | @jroakes | #TechSEOBoost What a crawler must do • Be robust. Handle spider traps and malicious behavior. • Be distributed. Run across many machines. • Be scalable. Easy to add more machines. • Be efficient. Use network and processing resources wisely. • Prioritize. Know the quality and priority of pages. • Operate continuously. • Be adaptable. Easy to change with new data / web needs. • Be a good citizen. Respect robots.txt and server load. Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 12. JR Oakes | @jroakes | #TechSEOBoost Key Components of Crawler
  • 13. JR Oakes | @jroakes | #TechSEOBoost Basic Crawl Architecture Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 14. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture
  • 15. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture Hard to believe Google is wasting resources to render something that has not changed in 40 years.
  • 16. JR Oakes | @jroakes | #TechSEOBoost Key Learnings • Frontier is broken into two sections, a Front Queue, that manages priority, and a Back Queue that manages politeness • All queues are FIFO • Each host has its own Back Queue • Min Hashes (Sketches) are an effective way of deduping content • Duplicates vs Near Duplicates measured by edit distance • Everything is cached to reduce latency • URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/) • There are interesting things that can happen in the DOM rather than just parsing retrieved URL
  • 17. JR Oakes | @jroakes | #TechSEOBoost Building a Toy Internet
  • 18. JR Oakes | @jroakes | #TechSEOBoost Criteria • Build quickly with topically similar pages for each site • Exist on separate domains • Linked to each other, but not to any other pages on the internet • Contain basic SEO elements like title, description, canonical, etc
  • 19. JR Oakes | @jroakes | #TechSEOBoost Solution • Github Pages • Jekyll • Wikipedia • Python • search-engine-optimization-blog.github.io • data-science-blog.github.io • python-software.github.io
  • 20. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 21. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 22. JR Oakes | @jroakes | #TechSEOBoost Building a Crawler and Renderer
  • 23. JR Oakes | @jroakes | #TechSEOBoost Step One I have no idea how to start. So let’s do some research. I <3 Github
  • 24. JR Oakes | @jroakes | #TechSEOBoost Step Two I don’t want to reinvent the wheel, so let’s see what is already out there that I can use.
  • 25. JR Oakes | @jroakes | #TechSEOBoost Step Three A lot of coffee … and some beer.
  • 26. JR Oakes | @jroakes | #TechSEOBoost A little help along the way Streamlit is the first app framework specifically for Machine Learning and Data Science teams. So you can stop spending time on frontend development and get back to what you do best.
  • 27. JR Oakes | @jroakes | #TechSEOBoost Criteria • Use existing libraries where possible • Be hardy enough to crawl my toy internet • Make it as simple and approachable as possible (e.g. I use Pandas a lot) • Try to be true (as possible) to what is known that Google does • Process linearly. No threading or extra services • Include unit testing • Include a Jupyter Notebook • Include READMEs • Include a simple indexer and search apparatus to play with results (Thanks John M.!)
  • 28. JR Oakes | @jroakes | #TechSEOBoost Parts • PageRank • Chrome Headless Rendering • Text NLP Normalization • Bert Embeddings • Robots • Duplicate Content Shingling • URL Hashing • Document Frequency Functions (BM25 and TFIDF)
  • 29. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content.
  • 30. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 31. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible.
  • 32. JR Oakes | @jroakes | #TechSEOBoost Learnings Embeddings https://github.com/huggingface/transformers
  • 33. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things waaaaayy simpler than they would be in real life.
  • 34. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 35. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things way simpler than they would be in real life. • Sentencepiece and BPE encoding is revolutionary for indexes and NLG • A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog. • Minhash comparison made checking rendering to crawled comparison, easy.
  • 36. JR Oakes | @jroakes | #TechSEOBoost Result A crawler written in Python that we are releasing as open source. Keep in mind: 1. This was written in a month 2. Google engineers would laugh at it 3. It probably has bugs 4. It is really fun to play around with
  • 37. JR Oakes | @jroakes | #TechSEOBoost Result We also built a simple UI in Streamlit so you can play around with the results and parameters.
  • 38. JR Oakes | @jroakes | #TechSEOBoost Result Complete with Ads!
  • 39. JR Oakes | @jroakes | #TechSEOBoost Thank You Start playing at the link below https://locomotive.agency/coal-crawler-renderer-indexer-caboose – Find me on Twitter at: @jroakes
  • 40. JR Oakes | @jroakes | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/