This document outlines plans for a web crawling startup called SEOCRAWLER.CO. The startup aims to build a crawler that downloads pages from websites and stores metadata in reports. This will allow marketers and site owners to analyze competitor websites and find broken pages. The founders have already crawled over 2.5 million pages and plan to improve the crawler, reports, and user interface. They are hosting the application on their own servers but may move components to cloud services like Azure or Redis for scaling.
2. TEAM
Goran Čandrlić
Conversion, Google AdWords &
Internet Marketing Specialist
Webiny Cofounder
Hrvoje Hudoletnjak
Software developer
Microsoft ASP.NET/IIS MVP
3. WHY?
Target market
Web masters, site owners
Marketers
Usage scenarios
Get broken pages, redirects, non-index, non-follow, ...
On-site SQL quality
Crawl competitor pages and find out what are they doing
Business model
Free
Pay as you go
Share and get credits
4. THE PLAN
Let’s build a crawler
MVP version: download CSV file of all pages
Public launch: browsing crawled pages online, payments
Let’s spread the word
Use social channel to attract more users
Let’s see what we’re missing, what can be done better
Find out what would people like to pay
Iterate, find new niche markets, ask and listen to people
5. GETTING HANDS DIRTY
ENGINE DEV
Basic engine: 2 days
Production ready (horizontal scalability, disaster recovery, ...): 60+ days
Find edge cases (broken HTML), keep crawler running for days/weeks without crashing
Analysis (tags and content)
Store reports for user filtering and browsing
WEB APP
Landing page + admin UI (Themeforest)
Communication with crawlers
Browse reports, filters
Payment gateway integration (Paypal)
Ticketing support system
6. CURRENT STATUS
2,5m pages crawled
150GB transfered
800 registered users
Most important things:
we (think we) know what should we do next
polished some edge cases, made more stable service
got the word spread
got speaking slot at WebCampZg!!
11. FRONT END / ADMIN UI
Landing page + admin theme from Themeforest
ASP.NET MVC 4
Entity Framework 5 (POCO, EF migrations)
DotNetOpenAuth for Social login
EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg
SignalR (fullduplex: WebSockets – Ajax pooling duplex)
KnockoutJS, jQuery, Toastr
StructureMap IOC/DI, Automapper (db entities <> DTO)
12. RABBIT MQ
ADO.NET / EF
LOG
CONTROLLER
COMMAND/QUERY BUS
(CQS)
CRAWLER
CRAWLER WORKER
CRAWLER WORKER
CRAWLER WORKER
...
13. CRAWLER SERVICE
Multi-threaded Crawler (vs evented crawler)
Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk
Insert
EasyNetQ, RabbitMQ, CQS pattern
Structuremap, HTMLAgilityPack, NLog
Protobuf
14. CRAWLER WORKER PROCESS
Start or Resume
Resume: load state (SQL, serialized)
Get next page from queue (RabbitMQ, durable store)
Download HTML (200ms – 5sec delay), HEAD req for external
Check statuses, canonical, redirects
Run page analysers, extract data for report, prepare for bulk insert
Find links
Check duplicated, blacklisted
Check Robots.txt
Check if visited – cache & db
Normalize & store to queue (RabbitMQ)
Save state every N pages (Serialize with Protobuf, store byte[] to Db)
16. COMMAND BUS (MEDIATOR)
bool alreadyVisited =
_bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));
_bus.Execute(new SavePageCommand(pageData, webPage));
public class SavePageReportHandler : IHandle<SavePageCommand>
{
// implementation
}
Encapsulate command / query into classes
IOC / DI for finding and matching handler with command/query types
Easy unit testing
AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
17. ISSUES
Everything will crash: net connection, db, thread, VM, ...
Resuming / saving states
Memory issue/leaks with some frameworks
Don’t optimize before profiling (memory, db)
Log everything
DB indexes: how to store for fast filtering, paging
DB as queueing system (don’t)
CQS: command / query separation
Broken HTML, crazy links
Cloud services: connections fail
18. LEARNED
ORM
Go low level (raw SQL, bulk insert, SP) if needed
Profile: memory, SQL queries
Watch for 1st level cache (ORM unit of work or session)
NoSQL?
Caching
in process – in memory
Plan moving to separate service (Redis, ...)
SOA
Pipeline design
Pub/Sub, CQS pattern (Mediator)
Unit testing
Cloud resiliance
19. HOSTING
Hosting:
All on one server for now
Started with EC2
Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free
inbound traffic!
Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)
Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4
Load & stress testing (crawl 500k URLs)
Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)
Will scale when needed
20. FUTURE PLANS
Fancy reports
Brand new web user interface
Integration with 3th party services (MajesticSEO, ...)
Special page analysis
NoSQL (RavenDb or Redis) for caching
Warehouse Db for browsing crawled pages
Lucene for full text search (RavenDb)
Refactor crawler, pipeline design, async evented design