Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Hours of Technical SEO

Leveraging Insights from
Server Logs
#5hoursoftechnicalSEO

To be crawled, indexed,
and ranked.
All SEOs share a common goal:

How can we answer all these questions?
● Which pages is Googlebot crawling?
● What user-agent is it using?
● Is Googlebot crawl mirroring our understanding of site
structure and assets?
● How’s the sites tech health?

Logs are a record
of every request
a server receives.

Aggregate
Validate
Googlebot
Translate
Parse logs for
meaningful search
and analysis
Translate
Log Source 1

Logs can come from multiple places in
your stack.
Web Server 1 Web Server 2 Web Server 3
CDN
DDOS Mitigation/Bot Manager
Load Balancer

You want
enough log data
to get an
accurate
picture.

Check your CDN on data on edge node
(cached) vs server (uncached) hits

Internal Log Requests
Ask: Is there already a log management platform in place?
Be Clear: We do not want Personal Identiﬁcation
Information (PII) and request it be removed
Be speciﬁc: Exported as .csv, please!

DIY Log Access
Apache (Linux Server)
NGINX (Linux Server)
IIS log ﬁles (Windows Server)
AWS Load Balancer (Load Balancer)
Google Cloud Load Balancer (Load Balancer)
AWS Cloudfront (CDN)
Accessing CloudFare log ﬁles (CDN, Enterprise account required)
Incapsula (CDN/DDoS Mitigation)
Akamai logs (CDN/DDoS Mitigation)

Standard Wordpress site?
Log into your hosting provider and look for Raw Access

Many tools, many languages
Paid: DeepCrawl, Botify, Logz.io, Sumo Logic, Splunk
Free(mium): SEMRush, Screaming Frog Log Analyzer, Big
Query
Code savvy: Python, JP
Masochistic: Excel, Command Line

Leverage the tools and
functionalities already in place.

Manually validate Googlebot IPs
Run a reverse DNS lookup on the accessing IP
address from your logs, using the host command.
jammer@Hypatia ~ % host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com

Bulk validate Googlebot IPs with Scripts
Source: Shell Script to Detect if the IP Address Is Googlebot, Dzone

Validate Googlebot IPs with Tool

Aggregate
Validate
Googlebot
Translate
Parse logs for
meaningful
search and
analysis
Translate
Log Source 1

216.150.168.131 [07/Mar/2018:16:11:58 -0800]
66.249.66.1 GET
/twiki/bin/view/TWiki/WikiSyntax HTTP/1.1
www.arrow.com 200 7352 616 -
Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu
ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge
cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+
(compatible;+Googlebot/2.1;++http://www.google
.com/bot.html) https://www.arrow.com/en/
indiegogo
The values captured in logs is unique to
each site.
Make a new engineering friend to learn
exactly what they mean.

Unlock logs ≤ 6 lines
1. Data Source
2. Condition
3. Segments
4. Grouping
5. Sort
6. Limit*

Use Case (Basic Query)
Legacy code being brought kicking
and screaming into mobile-only index

Query: Are we migrating to mobile-only index?
1. Data Source: Your aggregated logs
2. Condition: where the requester
is (veriﬁed) Googlebot
3. Group by: User-agent
4. Count: Number of hits (desc)
5. Limit: Start with ~10 results.

(Query with grouping)
Use case: Google chose
a different canonical

Query: Are non-canonical hostnames being
crawled?
1. Data Source: Aggregated logs
2. Condition: where Googlebot
3. Group by: Hostname
5. Limit: 10

(Query with creative segments)
Use case: Launching content in a
new language.

Segmentation = pattern matching/creative
thinking
Happy path: Consistent URL structure
Plan b: HTTP Entity header Content-Language

Query: Which languages are being crawled?
3. Group by: Language
5. Limit: 10

(Query with parsed segments)
Use case: Low index coverage

Build on the ﬂy segments by parsing URL structure
/en/products/blam-o/log-12345
}Language
App
}
Manufacturer
}
SKU
}

Query: Which subfolders are being crawled?
3. Parse: subfolder
4. Aggregate: by Subfolder

(Parsed Segments AND Conditions)
Use case: Sudden crawl ﬂux

Even search engines need to CYA
Googlebot is designed to be a good citizen of the web...
For Googlebot a speedy site is a sign of healthy servers...
If the site slows down or responds with server errors, the
[crawl rate] limit goes down and Googlebot crawls less.
Official Google Webmaster Central Blog: What Crawl Budget Means for Googlebot

Starting query: What HTTP status codes are we returning?
3. Aggregate: by HTTP Status

Iterative query: What resources are returning 5XX?
AND
3. Condition: where 5XX
4. Parse: subfolder

Advanced Use Cases +
Blended Data

Query: Non-indexable pages with bot hits

Query: Indexable pages without bot hits

Query: Bot hits by indexability

Query: In sitemaps with no bot hits

Query: Empty dynamically generated pages

|￣￣￣￣￣￣￣￣|
IT'S CHAOS.
BE KIND.
|＿＿＿＿＿＿＿＿|
(__/) ||
(•ㅅ•) ||
/ 　づ

I'm a mentor @ United Search
Want to take stage as an SEO speaker?
Want to stay in the audience but see more diversity in SEO events?
United Search is an SEO speaker accelerator designed to speciﬁcally aid
underrepresented groups, at no cost to students.
● Application - unitedsearch.org/apply
● Mentors - unitedsearch.org/mentors
● Mission - unitedsearch.org/about-us
For more info check out unitedsearch.org or @search_united on Twitter.

Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Hours of Technical SEO

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Hours of Technical SEO

Ähnlich wie Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Hours of Technical SEO (20)

Mehr von Semrush

Mehr von Semrush (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Jamie Alberico — How to Leverage Insights from Your Site’s Server Logs | 5 Hours of Technical SEO