Crawl Budget - Some Insights & Ideas @ seokomm 2015

12.012 Aufrufe

Veröffentlicht am

The slides of my talk at the seokomm in Salzburg, Austria (November 2015).

It covers the basics of web crawling, with focus of search engine bots. It can be settled in the SEO space aswell as in the general webmaster world. Besides quotes of important influencers of Bing, Yandex and Google, it gives actionable advices on how you can influence the Crawl Budget allocation of search engine spiders.

In context of ajax / javascript crawling there is a minor excursion into the world of angularjs.

The talk closes on some insights on how we wrote our own CMS in order to cover all the SEO needs we are facing (multilanguage, multi-template, caching, if-modified-since, etc.).

If you like this talk, please follow me on twitter:
http://twitter.com/jhmjacob

And to miss to sign up for the Free Account of OnPage.org:
http://onpa.ge/V141p

Last but not least -> visit the seokomm next year if you are around - its worth it :)
http://www.seokomm.at/

Veröffentlicht in: Internet
0 Kommentare
30 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
12.012
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
5.698
Aktionen
Geteilt
0
Downloads
0
Kommentare
0
Gefällt mir
30
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Crawl Budget - Some Insights & Ideas @ seokomm 2015

  1. 1. Crawl Budget Some Insights + Ideas Jan Hendrik Merlin Jacob Founder + CTO ! @jhmjacob
 " jhm@onpage.org
 # linkedin.com/in/jhmjacob

  2. 2. ! @jhmjacob Agenda » Philosophy » Parameters to influence Crawl Budget » Best practice & next steps
  3. 3. ! @jhmjacob Crawl Budget Definition The resources (aka money) Google invests in 
 your website by sending its crawlers
  4. 4. ! @jhmjacob Philosophy What would you do,
 if you were Google?
  5. 5. ! @jhmjacob Primary Target: 
 Make money!
 Secondary Target:
 The best search results Philosophy
  6. 6. ! @jhmjacob With their crawlers Google invests money, to find the “best” webpages -
 in order provide the best search results. Philosophy
  7. 7. ! @jhmjacob Problem 1: 
 The size of the web is infinite
 Problem 2: 
 Even Googles resources are limited Philosophy
  8. 8. ! @jhmjacob Source: Netcraft
  9. 9. ! @jhmjacob Size of the Google index: Something between 5 billion
 and 1 trillion documents* (means: around 5-1000 pages per domain) * = As a matter of fact, there is no real data on this. 
 Probably even Google doesn’t know. Philosophy
  10. 10. ! @jhmjacob Conclusion Search engines like Google have to
 constantly decide if they continue
 spending resources on the 
 current website or rather go to another.
  11. 11. ! @jhmjacob What is bing saying about this? “By providing clear, deep, easy to find 
 content on your website, we are more likely 
 to index and show your content in search results.” More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  12. 12. ! @jhmjacob “clear” » Distinct Canonical settings » Valid redirects (not via Meta-Refresh!) » Exactly one main headline (H1) per page » Title, description, alt, links to relevant (!) content » Standard HTML links (“No Rich Media like JS or Flash”) » Clean and readable HTML site-navigation » Clean and normalized URL structure » “Clear keyword focus” What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  13. 13. ! @jhmjacob “deep” » No “Thin content” » “Do not copy from other websites” » Be as relevant as possible for one topic (“Holistic”) » Keep your pages updated (“freshness”) What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  14. 14. ! @jhmjacob “easy to find” » Clean and up-to-date Sitemap.xml (last-mod!) » “keep valuable content close to the home page” 
 (aka short click-path aka “page level”) » “use targeted keywords wherever possible”
 (regarding internal linking) » Well structured navigation
 (found in URL + Breadcrumbs) What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  15. 15. ! @jhmjacob Between the lines: » Sitemap.xml is used to identify new articles and get 
 them indexed asap. » If the system recognizes regular updates on a page,
 it will be crawled more frequently. » Relevancy of a page is calculated based on internal 
 (& external) links as well as the “click distance from
 the homepage” (aka “page-level”). » Pagespeed matters: Otherwise Bounce-Rate can
 have negative effects on crawl budget (+ rankings) What is bing saying about this? More: https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
  16. 16. ! @jhmjacob What is Yandex saying about this? More: https://yandex.com/support/webmaster/yandex-indexing/webmaster-advice.xml Summary of the Webmaster Guidelines: » Do not use cloaking » Do not use auto-generated / gibberish text » No thin content » No hidden text » Popups + Downunders = Bad Quality Indicator » Do not do “User Behaviour Emulation”
  17. 17. ! @jhmjacob What is Google saying about this? “The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.” More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  18. 18. ! @jhmjacob Reminder » Internal Links are responsible
 for passing Pagerank through your
 pages 
 (Some believe Pagerank is only
 generated out of external Backlinks) » Pagerank “0 to 10” is just a simplified
 display for humans. In reality this score
 is way more precise.
  19. 19. ! @jhmjacob “Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. There are a large number of pages on the web that have very little or close to zero PageRank. The pages that get linked to a lot tend to get discovered and crawled quite quickly. The lower PageRank pages are likely to be crawled not quite as often.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  20. 20. ! @jhmjacob “If we can only take two pages from a site at any given time, and we are only crawling over a certain period of time, that can then set some sort of upper bound on how many pages we are able to fetch from that host.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  21. 21. ! @jhmjacob “Imagine we crawl three pages from a site, and then we discover that the two other pages were duplicates of the third page. We’ll drop two out of the three pages and keep only one, and that’s why it looks like it has less good content. So we might tend to not crawl quite as much from that site.
 …
 If there are a large number of pages that we consider low value, then we might not crawl quite as many pages from that site, but that is independent of rel=canonical.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  22. 22. ! @jhmjacob “If you link to three pages that are duplicates, a search engine might be able to realize that those three pages are duplicates and transfer the incoming link juice to those merged pages.” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  23. 23. ! @jhmjacob “There are some things that we will run a HEAD for. For example, our image crawl may use HEAD requests because images might be much, much larger in content than web pages…In terms of crawling the web and text content and HTML, we’ll typically just use a GET and not run a HEAD query first” What is Google saying about this? More: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
  24. 24. ! @jhmjacob » “There is also not a hard limit on our crawl.” » Pages with higher Pagerank will get crawled more often » Free crawling resources will be spend on low-PR pages,
 but chances the bot will leave the page are higher
 (how are they chosen?!) » You compete against all other pages. Give the bots
 reasons to stay. » Limitation is not based on “Amount of URLs”, rather in 
 form of “Machine-Hours” (time-based limits)
 (Loadtime matters!) » Bad page-quality + bad content metrics can scare away bots
 (Exit-Condition like “Amount of Unique Content / Time”) » Google tries to avoid waste of bandwith
 (HEAD Requests for images + if-modified-since) What is Google saying about this?
  25. 25. ! @jhmjacob Google Search Console
  26. 26. ! @jhmjacob Searchability Definitions (aka Findability)
  27. 27. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability)
  28. 28. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability) Crawlability Is your Webpage (URL) accessible for crawlers?
  29. 29. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability) Indexability Should the crawled, extracted and interpreted content be added to a search index?
  30. 30. ! @jhmjacob ility! Crawlability + Indexability + Rankability = Searchability (aka Findability) Rankability Should a particular page
 be displayed in the 
 search results for a
 particular keyword 
 (search phrase).
  31. 31. ! @jhmjacob Crawlability + Indexability + Rankability have a direct or indirect influence on the Crawl Budget
  32. 32. ! @jhmjacob Technical SEO Buzzword Bingo “Crawlability" “Indexability" “Rankability” robots.txt robots Directive
 (Response Header / Meta Tag) rel=prev
 (Response Header / Meta Tag) Status Code 
 (Response Header) Canonical 
 (Response Header / Meta Tag) hreflang Directives
 (Response Header / Meta Tag / Sitemap) Ladezeit
 (DNS+Server) Redirects 
 (Response Header / Meta Tag) Device Directives
 (Response Header / Meta Tag) Fragment aka Ajax Crawling
 (Meta Tag) Unique Content
 (Content) Content Quality
 (Content) URL-Structure (URL) Encoding
 (Content) Rendertime
 (Server+Content) Vary
 (Response Header) File Size
 (Content) Location Directives
 (Content) if-modified-since Support
 (Response Header) Rendering
 (CSS+JS)
  33. 33. ! @jhmjacob Analyzed by OnPage.org “Crawlability" “Indexability” “Rankability” robots.txt robots Directive
 (Response Header / Meta Tag) rel=prev
 (Response Header / Meta Tag) Status Code 
 (Response Header) Canonical 
 (Response Header / Meta Tag) hreflang Directives
 (Response Header / Meta Tag / Sitemap) Ladezeit
 (DNS+Server) Redirects 
 (Response Header) Device Directives
 (Response Header / Meta Tag) Fragment aka Ajax Crawling
 (Meta Tag) Unique Content
 (Content) Content Quality
 (Content) URL-Structure (URL) Encoding
 (Content) Rendertime
 (Server+Content) Vary
 (Response Header) File Size
 (Content) Location Directives
 (Content) if-modified-since Support
 (Response Header) Rendering
 (CSS+JS) We offer the most comprehensive analysis on website quality assurance!
  34. 34. ! @jhmjacob robots.txt This is obvious! » Learn how to setup your robots.txt file » Block irrelevant URLs, so the bots don’t waste 
 their time on those pages » Basics: https://en.onpage.org/wiki/Robots.txt Always remember: If a page is blocked via robots.txt,
 the bots can’t see additional settings like 
 Canonicals or “noindex” directives.
  35. 35. ! @jhmjacob Even though a page might look well - under the hood it can be still
 broken as hell. Status Code
  36. 36. ! @jhmjacob 200 Valid Page 
 301 Permanent redirect (after Redesigns) 302 Temporary Redirect 303 Alternative Version 304 Page did not change since last visit
 403 Access forbidden
 404 Page does not exist Status Code
  37. 37. ! @jhmjacob Loadtime Nice - only
 82 Milliseconds until Googlebot got
 the sourcecode of the
 page Not so nice - in average 
 1.76 Seconds until the sourcecode
 has been transfered Page A Page B
  38. 38. ! @jhmjacob 0 3.5 7 10.5 14 Page A Page B 0.59 Pages / Second 12.2 Pages / Second Loadtime
  39. 39. ! @jhmjacob Page A Page B Per Second 12.2 Pages 0.59 Pages Per Minute 731.71 Pages 35.29 Pages Per Hour 43,902.44 Pages 2,117.65 Pages Per Day 1,053,658.54 Pages 50,823.53 Pages Loadtime ouch!
  40. 40. ! @jhmjacob Fragment aka Ajax Crawling More: https://angularjs.org/ excursion
  41. 41. ! @jhmjacob Why angularjs? Tries to achieve a better User Experience, 
 by transferring only small segments
 instead of complete pages. Provides testing functionalities. Fragment aka Ajax Crawling excursion
  42. 42. ! @jhmjacob Fragment aka Ajax Crawling easy way to identify
 a angularjs site (“ng-app”)
  43. 43. ! @jhmjacob Fragment aka Ajax Crawling
  44. 44. ! @jhmjacob <!DOCTYPE html> <!--<html lang="en" data-ng-app="MainApp">--> <html lang="en" id="ng-app" data-ng-app="MainApp"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, …”> <meta name="keywords" content="mercedes films, mercedes clip, …”> <link href="assets/images/favicon.ico" type="image/x-icon" rel="shortcut icon"> <title>Mercedes-Benz Video Channel</title> <meta name="keywords" content="{{keywords}}"/> </head>
 … Fragment aka Ajax Crawling
  45. 45. ! @jhmjacob <!DOCTYPE html> <!--<html lang="en" data-ng-app="MainApp">--> <html lang="en" id="ng-app" data-ng-app="MainApp"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, …”> <meta name="keywords" content="mercedes films, mercedes clip, …”> <link href="assets/images/favicon.ico" type="image/x-icon" rel="shortcut icon"> <title>Mercedes-Benz Video Channel</title> <meta name="keywords" content="{{keywords}}"/> </head>
 … Fragment aka Ajax Crawling angularjs placeholder
  46. 46. ! @jhmjacob Fragment aka Ajax Crawling
  47. 47. ! @jhmjacob Fragment aka Ajax Crawling
  48. 48. ! @jhmjacob Fragment aka Ajax Crawling
  49. 49. ! @jhmjacobMore: https://developers.facebook.com/tools/debug/ Fragment aka Ajax Crawling
  50. 50. ! @jhmjacobMore: https://cards-dev.twitter.com/validator Fragment aka Ajax Crawling
  51. 51. ! @jhmjacob Fragment aka Ajax Crawling
  52. 52. ! @jhmjacob Why? 1) There are also other JS testing frameworks
 Jasmine / PhantomJS 2) WallabyJS 
 Nice Plugin for realtime JS Unit Tests 3) IMO: AngularJS is rather suited 
 for web-apps
 Not so well for content based sites which 
 rely on their fit in the web eco-system
  53. 53. ! @jhmjacob Ajax Crawling Scheme 1) Within <head> Tag
 <meta name="fragment" content="!"/> 2) Hashbang URLs (“#!”)
 https://www.seokomm.at/#!agenda + Snapshot URL with “real” HTML
  54. 54. ! @jhmjacob Ajax Crawling Scheme What happens here? GET http://video.mercedes-benz.co.uk/#!/ Complete Sourcecode (9kb) 1
  55. 55. ! @jhmjacob Ajax Crawling Scheme GET http://video.mercedes-benz.co.uk/?_escaped_fragment_=/ Complete Sourcecode (9kb) without AngularJS placeholders 2 Two requests were required to gather the valid HTML code! What happens here?
  56. 56. ! @jhmjacob Ajax Crawling Scheme Support of Ajax Crawling “Ajax Crawling Scheme” Native Ajax Crawling Google Yes, but “deprecated” Yes Bing Yes Nope OnPage.org Yes Nope Facebook Nope Nope Twitter Nope Nope Pinterest Nope Nope
  57. 57. ! @jhmjacob URL Structure 1) Speaking URLs (aka Hackable URLs)
 https://www.ccc.de/events/2015/congress 2) Sort GET Parameter (predefined order)
 https://de.onpage.org/?currency=de&lang=de 3) Relevant Content on top tier (subfolder),
 should correlate with Pagerank flow 4) Session IDs in URLs are a No-Go!
 If no other way: Remove them via GSC
  58. 58. ! @jhmjacob Vary Response Header 1) Does the page provide compression? (Must!!!) 
 Vary: Compression 2) Do Cookies (notably) change the content?
 Vary: Cookie 3) Is the page multi-lingual? (Within same URL!)
 Vary: Accept-Language
  59. 59. ! @jhmjacob “if-modified-since” Workflow
  60. 60. ! @jhmjacob 11/01/2015:
 GoogleBot calls en.onpage.org Server response: Complete Sourcecode (10,3kb)
 + Response Header “Last-Modified” “if-modified-since” Workflow
  61. 61. ! @jhmjacob 11/5/2015:
 GoogleBot calls en.onpage.org again 
 and includes an additional
 Request Header “Last-Modified” Server response: Empty body (0kb)
 + Response Header “304 Not Modified” “if-modified-since” Workflow
  62. 62. ! @jhmjacob » Dramatically reduces downloaded file size 
 for unchanged content » Enables bots + users to download more relevant
 content within the same timespan » Requires good Infrastructure / CMS 
 like Page-Caching - more on that later! “if-modified-since” Workflow
  63. 63. ! @jhmjacob robots Directive 1) Within <head> Tag
 <meta name="robots" content="noindex,follow"/> 2) Via Response Header
 X-Robots-Tag: noindex,follow Remember: A lot of “noindex” pages have a negativ effect
 on the crawl budget … because resources are wasted
 to find out that the URL has no real content.
  64. 64. ! @jhmjacob robots Directive: “unavailable-after” More: https://googleblog.blogspot.de/2007/07/robots-exclusion-protocol-now-with-even.html 1) Within <head> Tag
 <meta name="robots" content="unavailable_after: 20-Nov-2015 15:35:00 CET"> 2) Via Response Header
 X-Robots-Tag: unavailable_after: 20 Nov 2015 15:35:00 CET
  65. 65. ! @jhmjacob Canonical 1) Within <head> Tag
 <link rel="canonical" href="https://de.onpage.org/"/> 2) Via Response Header
 Link: <https://de.onpage.org/>; rel="canonical" The Response Header Version can also be used for
 PDF files and images (yummy). Remember: A lot of “canonicalized” pages (canonical to 
 other URL) have a negativ effect on the crawl budget … 
 because resources are wasted to find out that the URL 
 has no real content.
  66. 66. ! @jhmjacob Redirects 1) Via Response Header
 Status Code: 301
 Location: https://de.onpage.org/ 2) Im <head> Bereich
 <meta http-equiv="refresh" content="5; url=http://example.com/"> Redirect-Chains should be avoided. 
 Best practice is to avoid internal redirects at all.
 Rather update old links and point them to the new URL. Search Engines do not like redirects with Meta Tags or Javascript. 
 These should only be used with caution to navigate users.
 Semantically correct way is the response header (“301 vs 302”)
  67. 67. ! @jhmjacob Unique & Relevant Content 1) No thin content 2) No duplicate content 3) No auto-translated pages In terms of indexability
  68. 68. ! @jhmjacob Crawler: Behind the Scenes Bloomfilter De-Duplication Index
  69. 69. ! @jhmjacob The Challenge: Big Data Scale » Was a given URL already crawled?
 (if so: Does a reload make sense?)
 Solution: Bloomfilter + Key-Value Store » Is the content of a crawled URL 
 valuable enough to be added in the index? 
 Solution: Content-Fingerprinting + Hamming Distance
  70. 70. ! @jhmjacob “Most algorithms for near-duplicate detection run in batch- mode over the entire collection of documents. For web crawling, an online algorithm is necessary because the decision to ignore the hyper-links in a recently- crawled page has to be made quickly” More: http://www2007.cpsc.ucalgary.ca/papers/paper215.pdf Crawler: Behind the Scenes
  71. 71. ! @jhmjacob Encoding 1) Via Response Header
 Content-Type: text/html; charset=UTF-8 2) Within <head> Tag
 <meta charset="UTF-8" /> Charset should always be defined.
 Try to work with UTF-8 - saves a lot of headaches in the long run.
  72. 72. ! @jhmjacob Encoding This is how an
 encoding f*ckup looks like
  73. 73. ! @jhmjacob File Size 1) Within “Google Search Appliance”: Max. 20 MB
 But thats the Enterprise version of Google 2) In the wild the limit is probably way lower
 (something around 500 KB and 1 MB) The bigger the file, the longer it takes to download.
 Rule of thumb: The smaller, the better!
  74. 74. ! @jhmjacob Rendering 1) Javascript and CSS files have to be accessible 
 for GoogleBot
 OnPage.org provides good reports on that 2) If Google has issues rendering the page, indexation
 is at risk 3) Also make sure that the rendering does not take too long
 (Pagespeed Test). 4) Does the rendering on mobile devices look fine?
 (Viewport Tag)
  75. 75. ! @jhmjacob rel=prev 1) Within <head> Tag
 <link rel="prev" href="http://abc.com/article?page=1" />
 <link rel="next" href="http://abc.com/article?page=3" /> 2) Im Response Header
 Link: <http://abc.com/article?page=1>; rel="prev"
 Link: <http://abc.com/article?page=3>; rel="next" More: http://googlewebmastercentral.blogspot.co.at/2011/09/pagination-with-relnext-and-relprev.html
  76. 76. ! @jhmjacob rel=prev » Semantic Markup 
 for Paginations
 Groups multiple pages
 into one ranking » Intended for multi-page
 articles (newspapers).
 But Google now also
 shows product-listings
 as use case. More: http://googlewebmastercentral.blogspot.co.at/2011/09/pagination-with-relnext-and-relprev.html
  77. 77. ! @jhmjacob rel=prev alternative: “show all page” More: http://googlewebmastercentral.blogspot.co.at/2011/09/view-all-in-search-results.html
  78. 78. ! @jhmjacob Already sleepy?!
 ;)
  79. 79. ! @jhmjacob hreflang Directives More: https://moz.com/blog/using-the-correct-hreflang-tag-a-new-generator-tool
  80. 80. ! @jhmjacob hreflang Directives More: https://moz.com/blog/using-the-correct-hreflang-tag-a-new-generator-tool Article XYZ
 (“de” = German) Article XYZ
 (“es” = Spanish) hreflang=“es” hreflang=“de” Article XYZ
 (English) hreflang=“x-default” hreflang=“de” hreflang=“es” hreflang=“x-default”
  81. 81. ! @jhmjacob Device Directives 1) Viewport Tag
 <meta name="viewport" content="width=device-width, initial- scale=1.0" /> 2) Media Queries
 <link rel="stylesheet" media="only screen and (max-width: 800px)" href="/mobile.min.css" /> 3) Dedicated URL for mobile devices
 <link rel="alternate" media="only screen and (max-width: 640px)” href="http://m.example.com/page-1" >
  82. 82. ! @jhmjacob Content Quality 1) The basics 
 Title, Description etc. 2) Zero tolerance for broken pages 3) Avoid internal redirects
 Update links instead 4) Lightweight Sourcecode
 Get rid of unnecessary inline JS + CSS, remove Whitespaces, Line Breaks, Tabs, etc.
  83. 83. ! @jhmjacob Location Directives 1) schema.org Markup (“LocalBusiness”)
 Seems to be used by Google for “Local Search” 2) Address / Telephone
 So your websites also matches Query-Modifications 3) Dublin Core Markup
 Not really relevant for SEO, but does not hurt (semantic!) More: https://plus.google.com/+JohnMueller/posts/1EwfjTuCzPQ More: http://schema.org/LocalBusiness
  84. 84. ! @jhmjacob Outlook
  85. 85. ! @jhmjacob Static CMS
  86. 86. ! @jhmjacob Static CMS More: https://www.staticgen.com/
  87. 87. ! @jhmjacob Static CMS
  88. 88. ! @jhmjacobMore: https://www.getkirby.com/ Static CMS
  89. 89. ! @jhmjacob Wordpress is kind of
 the Internet Explorer
 in the CMS space Static CMS
  90. 90. ! @jhmjacob Static File System in the Wild
  91. 91. ! @jhmjacob if-modified-since: OnPage.org 1. First download of the page: The system generates the final sourcecode
  92. 92. ! @jhmjacob 2. An optimized version of the sourecode gets saved on disk (“Page-Caching”). 
 The cache filename is generated based on relevant cookie values.
 (in our case: language + currency of visitor) if-modified-since: OnPage.org
  93. 93. ! @jhmjacob 3. The same URL (+ same cookie settings) gets called again.
 Search Engines will append the “Last-Modified” value (from the previous request) to the Request Header. if-modified-since: OnPage.org
  94. 94. ! @jhmjacob 4. The response for the second call is just taken from the cache file
 Means: Ultra fast Time to First Byte, because server doesn’t need to “think” We dropped irrelevant characters (newlines, tabs, spaces) when we saved the cache file. -> We have seen clients who reduced 30% (!) of their filesizes with that simple step
 -> This results in better loadtimes if-modified-since: OnPage.org
  95. 95. ! @jhmjacob 5. Part of the returned response was the “Last-Modified” setting. 
 It was calculated based on the cache file timestamp. if-modified-since: OnPage.org
  96. 96. ! @jhmjacob » Super fast Time to First Byte
 When the file is cached » Sends optimized sourcecode to reduce
 bandwith usage
 for both parties: Our servers + Google Crawlers » If the file was loaded before, only send what’s
 really required 
 => “304 Not modified” aka 
 “Everything is cool, you have the latest version in your index” » Bonus: This workflow enables us to set
 the last-mod attribute in sitemap.xml if-modified-since: OnPage.org
  97. 97. ! @jhmjacob Other design principles of 
 our homegrown static CMS
  98. 98. ! @jhmjacob Static CMS: Design Principles 1) File-Position: Folders in URL are the same 
 as on the filesystem
 Authors are conditioned to build a clean structure + file-hierarchy
  99. 99. ! @jhmjacob 2) Separation of Code, Design and Content
 Every member of the team sees his part Static CMS: Design Principles For Designers: affiliate.tpl
  100. 100. ! @jhmjacob 2) Separation of Code, Design and Content
 Every member of the team sees his part Static CMS: Design Principles For Texters: affiliate.de.json
  101. 101. ! @jhmjacob 2) Separation of Code, Design and Content
 Makes MS Word etc. redundant.
 
 If a new translation needs to be added, the translator gets the
 english version. Renames the file, translates the contents, uploads the file. 
 
 Bam! It’s online.
 
 Text updates, Design changes and new images are versioned by git. Static CMS: Design Principles
  102. 102. ! @jhmjacob 3) Multilinguality by nature
 If a new translations is uploaded, the system starts a couple of 
 cool things Static CMS: Design Principles
  103. 103. ! @jhmjacob Editor Friendliness + File-Management 3) Multilinguality by nature
 If a user navigates to the wrong language version of a page, he will
 see a friendly reminder that there is a localized version for him
  104. 104. ! @jhmjacob Editor Friendliness + File-Management 3) Multilinguality by nature
 Links to the translated versions of the current page are automatically added to the footer
  105. 105. ! @jhmjacob Editor Friendliness + File-Management 3) Multilinguality by nature
 And hreflang markup is automatically added to the <head> section of the document
  106. 106. ! @jhmjacob Editor Friendliness + File-Management 4) Fast + Secure
 No Database which slows down server responses! Git keeps track of changes and provides rollback functionalities! No other dependencies / services which might cause security holes
  107. 107. ! @jhmjacob Editor Friendliness + File-Management 5) Transparent und logical structure
 Images reside where they belong: In the same folder as the article itself - like its template, translations, additional script logic.
 
 Cleaning up made easy: If an article needs to be deleted, just remove the folder -> All files are gone, no more deserted files in 
 “images” folders or localization databases, etc.
  108. 108. ! @jhmjacob Outlook What we want to build next
  109. 109. ! @jhmjacob Outlook » Multi-Language Images
 The same URL for all localized versions of an image https://en.onpage.org/beispiel/teaser.jpg https://en.onpage.org/beispiel/teaser.jpg
  110. 110. ! @jhmjacob Outlook Be careful: This is untested freestyle code - just to give you an idea :) » Multi-Language Images
 htaccess file detects that an image file is requested
  111. 111. ! @jhmjacob Outlook » Multi-Language Images
 The browser exposes the preferred languages of the user
  112. 112. ! @jhmjacob Outlook » Multi-Language 
 Images
 A script takes the 
 request, checks if a localized
 version exists and returns
 the value (or the default image).
 
 Result is cached in 
 browser cache. Be careful: 
 This is untested freestyle 
 code - just to give 
 you an idea :)
  113. 113. ! @jhmjacob Outlook » Last-Modified Logging
 To find out how popular a page is among search engines
  114. 114. ! @jhmjacob Outlook » Last-Modified Logging
 To find out how popular a page is among search engines » By setting the last-modified response header, Search engines will include its value in the next request of the page 
 (for if-modified-since checks)
 Knowing this, we can calculate the timespan between this visit and the last one.
  115. 115. ! @jhmjacob Outlook » Low timespan
 = URL seems to relevant for the search engine
 = Good chances to rank » High timespan
 = URL seems to be rather irrelevant for the SE
 = Less chances to rank
 = Alerting based on the importance of the page
  116. 116. ! @jhmjacob “It’s not that Google will penalize you, it’s the opportunity cost for dirty architecture based on a finite crawl budget.” More: http://www.blindfiveyearold.com/crawl-optimization Last words
  117. 117. Thanks! OnPage.org GmbH ! http://twitter.com/onpage_org
 $ http://fb.me/onpage.org
 % https://en.onpage.org
 Jan Hendrik Merlin Jacob Founder + CTO ! https://twitter.com/jhmjacob
 " jhm@onpage.org
 # http://linkedin.com/in/jhmjacob
 http://onpa.ge/V141p

×