SlideShare ist ein Scribd-Unternehmen logo
1 von 21
My weekend startup
project
SEOCRAWLER.CO
TEAM

Goran Čandrlić
Conversion, Google AdWords &
Internet Marketing Specialist
Webiny Cofounder

Hrvoje Hudoletnjak
Software developer
Microsoft ASP.NET/IIS MVP
WHY?
Target market
Web masters, site owners
Marketers

Usage scenarios
Get broken pages, redirects, non-index, non-follow, ...
On-site SQL quality
Crawl competitor pages and find out what are they doing

Business model
Free
Pay as you go
Share and get credits
THE PLAN
Let’s build a crawler
MVP version: download CSV file of all pages
Public launch: browsing crawled pages online, payments
Let’s spread the word
Use social channel to attract more users
Let’s see what we’re missing, what can be done better
Find out what would people like to pay
Iterate, find new niche markets, ask and listen to people
GETTING HANDS DIRTY
ENGINE DEV
Basic engine: 2 days
Production ready (horizontal scalability, disaster recovery, ...): 60+ days
Find edge cases (broken HTML), keep crawler running for days/weeks without crashing
Analysis (tags and content)
Store reports for user filtering and browsing

WEB APP
Landing page + admin UI (Themeforest)
Communication with crawlers
Browse reports, filters
Payment gateway integration (Paypal)
Ticketing support system
CURRENT STATUS
2,5m pages crawled
150GB transfered
800 registered users
Most important things:
we (think we) know what should we do next
polished some edge cases, made more stable service
got the word spread
got speaking slot at WebCampZg!!
CLOUD
STORAGE

RABBIT MQ
HTML, CSS
AJAX / WEBSOCKETS

USER

FRONT END WEB APP

CRAWLERS

DB
FRONT END / ADMIN UI
Landing page + admin theme from Themeforest
ASP.NET MVC 4
Entity Framework 5 (POCO, EF migrations)
DotNetOpenAuth for Social login
EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg
SignalR (fullduplex: WebSockets – Ajax pooling duplex)
KnockoutJS, jQuery, Toastr
StructureMap IOC/DI, Automapper (db entities <> DTO)
RABBIT MQ

ADO.NET / EF

LOG

CONTROLLER

COMMAND/QUERY BUS
(CQS)

CRAWLER
CRAWLER WORKER

CRAWLER WORKER

CRAWLER WORKER

...
CRAWLER SERVICE
Multi-threaded Crawler (vs evented crawler)
Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk
Insert
EasyNetQ, RabbitMQ, CQS pattern
Structuremap, HTMLAgilityPack, NLog
Protobuf
CRAWLER WORKER PROCESS
Start or Resume
Resume: load state (SQL, serialized)

Get next page from queue (RabbitMQ, durable store)
Download HTML (200ms – 5sec delay), HEAD req for external
Check statuses, canonical, redirects
Run page analysers, extract data for report, prepare for bulk insert
Find links
Check duplicated, blacklisted
Check Robots.txt
Check if visited – cache & db
Normalize & store to queue (RabbitMQ)

Save state every N pages (Serialize with Protobuf, store byte[] to Db)
RABBITMQ + EASYNETQ
ADMIN UI
rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id)));

SERVICE
rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message =>
{
_commandBus.Execute(new MakeReportCommand(message.ProjectId));
});
COMMAND BUS (MEDIATOR)
bool alreadyVisited =
_bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash));

_bus.Execute(new SavePageCommand(pageData, webPage));
public class SavePageReportHandler : IHandle<SavePageCommand>
{
// implementation
}

Encapsulate command / query into classes
IOC / DI for finding and matching handler with command/query types
Easy unit testing
AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
ISSUES
Everything will crash: net connection, db, thread, VM, ...
Resuming / saving states
Memory issue/leaks with some frameworks
Don’t optimize before profiling (memory, db)
Log everything
DB indexes: how to store for fast filtering, paging
DB as queueing system (don’t)
CQS: command / query separation
Broken HTML, crazy links
Cloud services: connections fail
LEARNED
ORM
Go low level (raw SQL, bulk insert, SP) if needed
Profile: memory, SQL queries
Watch for 1st level cache (ORM unit of work or session)
NoSQL?

Caching
in process – in memory
Plan moving to separate service (Redis, ...)

SOA
Pipeline design
Pub/Sub, CQS pattern (Mediator)
Unit testing
Cloud resiliance
HOSTING
Hosting:
All on one server for now
Started with EC2
Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free
inbound traffic!
Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m)

Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4
Load & stress testing (crawl 500k URLs)
Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB)

Will scale when needed
FUTURE PLANS
Fancy reports
Brand new web user interface
Integration with 3th party services (MajesticSEO, ...)
Special page analysis
NoSQL (RavenDb or Redis) for caching
Warehouse Db for browsing crawled pages
Lucene for full text search (RavenDb)
Refactor crawler, pipeline design, async evented design
THANK YOU! QUESTIONS?
Hrvoje Hudoletnjak
m: hrvoje@hudoletnjak.com
t: twitter.com/hhrvoje

Goran Čandrlid
m: gorancandrlic@gmail.com
t: twitter.com/chande

Weitere ähnliche Inhalte

Was ist angesagt?

Developing, deploying and monitoring Java applications using Google App Engine
Developing, deploying and monitoring Java applications using Google App EngineDeveloping, deploying and monitoring Java applications using Google App Engine
Developing, deploying and monitoring Java applications using Google App EngineIndicThreads
 
An Introduction to Umbraco
An Introduction to UmbracoAn Introduction to Umbraco
An Introduction to UmbracoJeremy Branham
 
Resource Registries: Plone Conference 2014
Resource Registries: Plone Conference 2014Resource Registries: Plone Conference 2014
Resource Registries: Plone Conference 2014Rob Gietema
 
Ajax assignment help
Ajax assignment helpAjax assignment help
Ajax assignment helpjohn mayer
 
Berlin.JAR: Web future without web frameworks
Berlin.JAR: Web future without web frameworksBerlin.JAR: Web future without web frameworks
Berlin.JAR: Web future without web frameworksStephan Schmidt
 
Advantages and disadvantages of an ajax based client application
Advantages and disadvantages of an ajax based client applicationAdvantages and disadvantages of an ajax based client application
Advantages and disadvantages of an ajax based client applicationPlacinta Alin
 
Evaluating and Choosing ZK Framework
Evaluating and Choosing ZK FrameworkEvaluating and Choosing ZK Framework
Evaluating and Choosing ZK Frameworksoutherncrossie
 
Introducing MongoBase
Introducing MongoBaseIntroducing MongoBase
Introducing MongoBaser1dotmy
 
Introduction to ajax
Introduction  to  ajaxIntroduction  to  ajax
Introduction to ajaxPihu Goel
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMoshe Kaplan
 
Introducing MongoPress
Introducing MongoPressIntroducing MongoPress
Introducing MongoPressMark Smalley
 
Full stack JavaScript - the folly of choice
Full stack JavaScript - the folly of choiceFull stack JavaScript - the folly of choice
Full stack JavaScript - the folly of choiceFDConf
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN StackRob Davarnia
 
qooxdoo - Open Source Ajax Framework
qooxdoo - Open Source Ajax Frameworkqooxdoo - Open Source Ajax Framework
qooxdoo - Open Source Ajax Frameworkecker
 

Was ist angesagt? (20)

Developing, deploying and monitoring Java applications using Google App Engine
Developing, deploying and monitoring Java applications using Google App EngineDeveloping, deploying and monitoring Java applications using Google App Engine
Developing, deploying and monitoring Java applications using Google App Engine
 
An Introduction to Umbraco
An Introduction to UmbracoAn Introduction to Umbraco
An Introduction to Umbraco
 
Resource Registries: Plone Conference 2014
Resource Registries: Plone Conference 2014Resource Registries: Plone Conference 2014
Resource Registries: Plone Conference 2014
 
Introduction to Umbraco
Introduction to UmbracoIntroduction to Umbraco
Introduction to Umbraco
 
Divide et impera
Divide et imperaDivide et impera
Divide et impera
 
Ajax assignment help
Ajax assignment helpAjax assignment help
Ajax assignment help
 
AD102 - Break out of the Box
AD102 - Break out of the BoxAD102 - Break out of the Box
AD102 - Break out of the Box
 
Berlin.JAR: Web future without web frameworks
Berlin.JAR: Web future without web frameworksBerlin.JAR: Web future without web frameworks
Berlin.JAR: Web future without web frameworks
 
Advantages and disadvantages of an ajax based client application
Advantages and disadvantages of an ajax based client applicationAdvantages and disadvantages of an ajax based client application
Advantages and disadvantages of an ajax based client application
 
Evaluating and Choosing ZK Framework
Evaluating and Choosing ZK FrameworkEvaluating and Choosing ZK Framework
Evaluating and Choosing ZK Framework
 
Introducing MongoBase
Introducing MongoBaseIntroducing MongoBase
Introducing MongoBase
 
Introduction to ajax
Introduction  to  ajaxIntroduction  to  ajax
Introduction to ajax
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introducing MongoPress
Introducing MongoPressIntroducing MongoPress
Introducing MongoPress
 
Metarefresh
MetarefreshMetarefresh
Metarefresh
 
Offline Html5 3days
Offline Html5 3daysOffline Html5 3days
Offline Html5 3days
 
AJAX
AJAXAJAX
AJAX
 
Full stack JavaScript - the folly of choice
Full stack JavaScript - the folly of choiceFull stack JavaScript - the folly of choice
Full stack JavaScript - the folly of choice
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
 
qooxdoo - Open Source Ajax Framework
qooxdoo - Open Source Ajax Frameworkqooxdoo - Open Source Ajax Framework
qooxdoo - Open Source Ajax Framework
 

Andere mochten auch

Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Andere mochten auch (7)

Open source and .net
Open source and .netOpen source and .net
Open source and .net
 
Cairns
CairnsCairns
Cairns
 
EF6 and DDD
EF6 and DDDEF6 and DDD
EF6 and DDD
 
ATD9 2013 One ASP.NET
ATD9 2013 One ASP.NETATD9 2013 One ASP.NET
ATD9 2013 One ASP.NET
 
Project K, Vnext and Owin
Project K, Vnext and OwinProject K, Vnext and Owin
Project K, Vnext and Owin
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Ähnlich wie My weekend startup: seocrawler.co

Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesUdita Plaha
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbonezonathen
 
Sofea and SOUI - Web future without web frameworks
Sofea and SOUI - Web future without web frameworksSofea and SOUI - Web future without web frameworks
Sofea and SOUI - Web future without web frameworksAndré Neubauer
 
StrongLoop Overview
StrongLoop OverviewStrongLoop Overview
StrongLoop OverviewShubhra Kar
 
Cloud State of the Union for Java Developers
Cloud State of the Union for Java DevelopersCloud State of the Union for Java Developers
Cloud State of the Union for Java DevelopersBurr Sutter
 
Challenges of angular in production (Tasos Bekos) - GreeceJS #17
Challenges of angular in production (Tasos Bekos) - GreeceJS #17Challenges of angular in production (Tasos Bekos) - GreeceJS #17
Challenges of angular in production (Tasos Bekos) - GreeceJS #17GreeceJS
 
Teaching old java script new tricks
Teaching old java script new tricksTeaching old java script new tricks
Teaching old java script new tricksSimon Sturmer
 
Node.js Enterprise Middleware
Node.js Enterprise MiddlewareNode.js Enterprise Middleware
Node.js Enterprise MiddlewareBehrad Zari
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyRobert Dempsey
 
Intro to mobile web application development
Intro to mobile web application developmentIntro to mobile web application development
Intro to mobile web application developmentzonathen
 
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978David Chou
 
Normalizing x pages web development
Normalizing x pages web development Normalizing x pages web development
Normalizing x pages web development Shean McManus
 
Drizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceDrizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceBrian Aker
 
Google Dev Day2007
Google Dev Day2007Google Dev Day2007
Google Dev Day2007lucclaes
 
StackOverflow Architectural Overview
StackOverflow Architectural OverviewStackOverflow Architectural Overview
StackOverflow Architectural OverviewFolio3 Software
 
Technology Stack Discussion
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack DiscussionZaiyang Li
 
Above the cloud joarder kamal
Above the cloud   joarder kamalAbove the cloud   joarder kamal
Above the cloud joarder kamalJoarder Kamal
 

Ähnlich wie My weekend startup: seocrawler.co (20)

Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
 
Sofea and SOUI - Web future without web frameworks
Sofea and SOUI - Web future without web frameworksSofea and SOUI - Web future without web frameworks
Sofea and SOUI - Web future without web frameworks
 
StrongLoop Overview
StrongLoop OverviewStrongLoop Overview
StrongLoop Overview
 
Cloud State of the Union for Java Developers
Cloud State of the Union for Java DevelopersCloud State of the Union for Java Developers
Cloud State of the Union for Java Developers
 
Challenges of angular in production (Tasos Bekos) - GreeceJS #17
Challenges of angular in production (Tasos Bekos) - GreeceJS #17Challenges of angular in production (Tasos Bekos) - GreeceJS #17
Challenges of angular in production (Tasos Bekos) - GreeceJS #17
 
Teaching old java script new tricks
Teaching old java script new tricksTeaching old java script new tricks
Teaching old java script new tricks
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Node.js Enterprise Middleware
Node.js Enterprise MiddlewareNode.js Enterprise Middleware
Node.js Enterprise Middleware
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
 
Intro to mobile web application development
Intro to mobile web application developmentIntro to mobile web application development
Intro to mobile web application development
 
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
Building Highly Scalable Java Applications on Windows Azure - JavaOne S313978
 
Normalizing x pages web development
Normalizing x pages web development Normalizing x pages web development
Normalizing x pages web development
 
CG_CS25010_Lecture
CG_CS25010_LectureCG_CS25010_Lecture
CG_CS25010_Lecture
 
ASP.NET: Present and future
ASP.NET: Present and futureASP.NET: Present and future
ASP.NET: Present and future
 
Drizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceDrizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's Conference
 
Google Dev Day2007
Google Dev Day2007Google Dev Day2007
Google Dev Day2007
 
StackOverflow Architectural Overview
StackOverflow Architectural OverviewStackOverflow Architectural Overview
StackOverflow Architectural Overview
 
Technology Stack Discussion
Technology Stack DiscussionTechnology Stack Discussion
Technology Stack Discussion
 
Above the cloud joarder kamal
Above the cloud   joarder kamalAbove the cloud   joarder kamal
Above the cloud joarder kamal
 

Kürzlich hochgeladen

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

My weekend startup: seocrawler.co

  • 2. TEAM Goran Čandrlić Conversion, Google AdWords & Internet Marketing Specialist Webiny Cofounder Hrvoje Hudoletnjak Software developer Microsoft ASP.NET/IIS MVP
  • 3. WHY? Target market Web masters, site owners Marketers Usage scenarios Get broken pages, redirects, non-index, non-follow, ... On-site SQL quality Crawl competitor pages and find out what are they doing Business model Free Pay as you go Share and get credits
  • 4. THE PLAN Let’s build a crawler MVP version: download CSV file of all pages Public launch: browsing crawled pages online, payments Let’s spread the word Use social channel to attract more users Let’s see what we’re missing, what can be done better Find out what would people like to pay Iterate, find new niche markets, ask and listen to people
  • 5. GETTING HANDS DIRTY ENGINE DEV Basic engine: 2 days Production ready (horizontal scalability, disaster recovery, ...): 60+ days Find edge cases (broken HTML), keep crawler running for days/weeks without crashing Analysis (tags and content) Store reports for user filtering and browsing WEB APP Landing page + admin UI (Themeforest) Communication with crawlers Browse reports, filters Payment gateway integration (Paypal) Ticketing support system
  • 6. CURRENT STATUS 2,5m pages crawled 150GB transfered 800 registered users Most important things: we (think we) know what should we do next polished some edge cases, made more stable service got the word spread got speaking slot at WebCampZg!!
  • 7.
  • 8.
  • 9.
  • 10. CLOUD STORAGE RABBIT MQ HTML, CSS AJAX / WEBSOCKETS USER FRONT END WEB APP CRAWLERS DB
  • 11. FRONT END / ADMIN UI Landing page + admin theme from Themeforest ASP.NET MVC 4 Entity Framework 5 (POCO, EF migrations) DotNetOpenAuth for Social login EasyNetQ for RabbitMQ (pub/sub), CQS pattern for inprocess msg SignalR (fullduplex: WebSockets – Ajax pooling duplex) KnockoutJS, jQuery, Toastr StructureMap IOC/DI, Automapper (db entities <> DTO)
  • 12. RABBIT MQ ADO.NET / EF LOG CONTROLLER COMMAND/QUERY BUS (CQS) CRAWLER CRAWLER WORKER CRAWLER WORKER CRAWLER WORKER ...
  • 13. CRAWLER SERVICE Multi-threaded Crawler (vs evented crawler) Entity Framework 5 LINQ + RAW SQL queries with EF + ADO.NET Bulk Insert EasyNetQ, RabbitMQ, CQS pattern Structuremap, HTMLAgilityPack, NLog Protobuf
  • 14. CRAWLER WORKER PROCESS Start or Resume Resume: load state (SQL, serialized) Get next page from queue (RabbitMQ, durable store) Download HTML (200ms – 5sec delay), HEAD req for external Check statuses, canonical, redirects Run page analysers, extract data for report, prepare for bulk insert Find links Check duplicated, blacklisted Check Robots.txt Check if visited – cache & db Normalize & store to queue (RabbitMQ) Save state every N pages (Serialize with Protobuf, store byte[] to Db)
  • 15. RABBITMQ + EASYNETQ ADMIN UI rabbitBus.OpenChannel(c => c.Publish(new RecreateReportMessage(id))); SERVICE rabbitbus.Subscribe<RecreateReportMessage>("crawlerservice", message => { _commandBus.Execute(new MakeReportCommand(message.ProjectId)); });
  • 16. COMMAND BUS (MEDIATOR) bool alreadyVisited = _bus.Request<bool>(new VisitedPageQuery.Input(projectId, urlHash)); _bus.Execute(new SavePageCommand(pageData, webPage)); public class SavePageReportHandler : IHandle<SavePageCommand> { // implementation } Encapsulate command / query into classes IOC / DI for finding and matching handler with command/query types Easy unit testing AOP: intercept query or command, pre/post execution (logging, auth, caching, ...)
  • 17. ISSUES Everything will crash: net connection, db, thread, VM, ... Resuming / saving states Memory issue/leaks with some frameworks Don’t optimize before profiling (memory, db) Log everything DB indexes: how to store for fast filtering, paging DB as queueing system (don’t) CQS: command / query separation Broken HTML, crazy links Cloud services: connections fail
  • 18. LEARNED ORM Go low level (raw SQL, bulk insert, SP) if needed Profile: memory, SQL queries Watch for 1st level cache (ORM unit of work or session) NoSQL? Caching in process – in memory Plan moving to separate service (Redis, ...) SOA Pipeline design Pub/Sub, CQS pattern (Mediator) Unit testing Cloud resiliance
  • 19. HOSTING Hosting: All on one server for now Started with EC2 Migrated to Azure VM (higher HDD IO, faster CPU), Bizspark (free VM), free inbound traffic! Now on Hetzner (dedicated, i7, 32GB RAM, 2xSSD, Win2012 = 60€/m) Stack: Win 2012, SQL Server 2012, .NET 4.5, ASP.NET MVC 4 Load & stress testing (crawl 500k URLs) Goal: 100 parallel crawlers on VM 2CPU 4GB RAM (OS, DB) Will scale when needed
  • 20. FUTURE PLANS Fancy reports Brand new web user interface Integration with 3th party services (MajesticSEO, ...) Special page analysis NoSQL (RavenDb or Redis) for caching Warehouse Db for browsing crawled pages Lucene for full text search (RavenDb) Refactor crawler, pipeline design, async evented design
  • 21. THANK YOU! QUESTIONS? Hrvoje Hudoletnjak m: hrvoje@hudoletnjak.com t: twitter.com/hhrvoje Goran Čandrlid m: gorancandrlic@gmail.com t: twitter.com/chande