SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
The Distributed Crawler
v2.5.x-chaika
Service architecture overview
The Hierarchical Cluster Engine project
IOIX Ukraine, 2014-2015
Introduction
The HCE-DC is a multipurpose high productivity scalable and
extensible engine of web data collecting and processing.
Built on several HCE project's products and technologies:
â—Ź
The Distributed Tasks Manager (DTM) service.
â—Ź
The hce-node network cluster application.
â—Ź
The API bindings for Python and PHP languages.
â—Ź
The tools and library of crawling algorithms.
â—Ź
The tools and library of scraping algorithms.
â—Ź
Web administration console.
â—Ź
The real-time REST API.
provides flexible configuration and deployment automation to get
installation closed to a target project and easy integration.
â—Ź
Crawling - scan the web sites, analyze and parse pages, detect
and collect URLs links and web resources' data. Download
resources from web-servers using collected or provided with
request URL(s) and store them in local raw data file storage.
Everything on multi-host and multi-process architecture
â—Ź
Process web page contents with several customizable applied
algorithms like a unstructured textual content scraping and store
results in local sql db storage on multi-host and multi-process
architecture.
â—Ź
Manage tasks of crawling and processing with scheduling and
balancing using tasks management service of multi-host
architecture or real-time multi-threaded load-balancing
client-server architecture with multi-host backend engine.
The main functional purposes
The extended functionality
â—Ź
Developer's API – full access for configuration, deployment,
monitoring and management processes and data.
â—Ź
Applied API – full featured multi-thread multi-host REST http-based
protocol to perform crawling and scraping batch requests.
â—Ź
Web administration console - the DC and DTM services and
user's accounts, roles, permissions, crawling, scraping, results
collect, aggregation and convert management, statistical data
collect and visualize, notification, triggering and another utility
tools.
â—Ź
Helper tools and libraries – several support applied utility.
Distributed asynchronous nature
The HCE-DC engine itself is an architecturally fully distributed system. It
can be deployed and configured as single- and multi-host installation.
Key features and properties of distributed architecture:
â—Ź
No central database or data storage for crawling and
processing. Each physical host unit with the same structures
shards data but represented as single service.
â—Ź
Crawling and processing goes on several physical hosts parallel
multi-process way including downloading, fetching, DOM
parsing, URLs collecting, fields extracting, post processing,
metric calculations and so on tasks.
â—Ź
Customizable strategies of data sharding and requests
balancing with minimization of data redundancy and
optimization of resources usage.
â—Ź
Reducing of pages and scraped contents internally with smart
merging avoiding of resources duplicates in fetch data client
response.
Flexible balancing and scalability
The HCE-DC as service can be deployed on set of physical hosts.
The number of hosts depends on their hardware productivity rate
(CPU cores number, disk space size, network interface speed and so on) and
can be from one to tens or more. Key scalability principles are:
– The hardware computational unit is physical or logical
host (any kind of virtualization and containers supported).
– The hardware units can be added in to the system and
gradually filled with data during regular crawling iterations
at run-time. No dedicated data migration.
– Computational tasks balancing is resource usage
optimized. Tasks scheduler selects computational unit
with maximum free system resources using customizable
estimation formula. Different system resources usage
indicators available: CPU, RAM, DISK, IO wait,
processes number, threads number and so on.
Extensible software and algorithms
The HCE-DC service for the Linux OS platform has three main
parts:
â—Ź
The core service daemon modules with functionality of:
scheduler of crawling tasks, manager of tasks queues,
managers of periodical processes, manager of
computational units data, manager of storage resources
aging, manager of real-time API requests and so on.
Typically the core daemon process runs on dedicated
host and represents service itself.
â—Ź
The computational unit modules set including crawling
crawler-task, scraping algorithms processor-task and
several scraper modules, storage management db-task,
pre-rocessor, finalizer modules and several additional
helper utilities. They acts on session-based principles
and exits after the input batch data set processed.
Open processing architecture
The computational modules set can be extended with any kind of
algorithms, libraries and frameworks for any platforms and
programming languages. The limitation is only the API interaction
translation that typically needs some adapters. The key principles
are:
â—Ź
Data processing modules involved as native OS
processes or via API including REST and CLI.
â—Ź
Process instances are isolated.
â—Ź
POSIX CLI API is default for inter-process data exchange
or simulated by converter utilities.
â—Ź
Open input/output protocol used to process batch in
sequential way step by step by each processing chain.
â—Ź
Streaming open data formats can be easily serialized –
json, xml and so on.
General DC service architecture
Real-time client-server architecture
Internal DC service architecture
Brief list of main DC features
Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling,
HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits
(size, pages, contents, errors, URLs, redirects, content types), requests delaying,
robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and
batching.
Fully automated distributed data processing: News ™ article (pre-defined sequential
scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™
universal (definitions of tags and rules to extract data from pages based on xpath and
csspath, content parts joining, merging, best result selection, regular expressions post
processing, multi-item pages (product, search results, etc), multi-rule, multi-template)
scraping engines, WYSIWYG templates editor, processed contents selection and
merging and extensible processing modules architecture.
Fully automated resources data management: periodic operations, data aging, update,
re-crawling and re-processing.
Web administration console: full CRUD of projects for data collect and process with set of
parameters per project, users with roles and permissions ACL, DC and DTM service's
statistics, crawling and processing project's statistics.
Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to
process, full featured parameters set with additionally limitations per user account and
authorization state.
Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
Statistics of three physical hosts
installation for one month
â—Ź
Projects: 8
â—Ź
Pages crawled: 6.2M
â—Ź
Crawling batches: 60K
â—Ź
Processing batches: 90K
â—Ź
Purging batches: 16K
â—Ź
Aging batches: 16K
â—Ź
Projects re-crawlings: 30K
â—Ź
CPU Load Average: 0.45 avg / 3.5 max
â—Ź
CPU utilization: 3% avg / 30% max
â—Ź
I/O wait time: 0.31 avg / 6.4 max
â—Ź
Network connections: 250 avg / 747 max
â—Ź
Network traffic: 152Kbps avg / 5.5Mbps max
â—Ź
Data hosts: 2
â—Ź
Load-balancing of system OS
resources linear managed
CPU load average, I/O wait
and RAM usage without
excesses and overloads.
â—Ź
Linear scalability of real-time
requests per physical host.
â—Ź
Linear scalability of automated
crawling, processing and
aging per physical host.

Weitere ähnliche Inhalte

Was ist angesagt?

Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
 
Schema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBDharma Shukla
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus
 
Hypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of ThingsHypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of ThingsMichael Koster
 
Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)Globus
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...CloudTechnologies
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchSanura Hettiarachchi
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Frameworksixtyone
 
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Globus
 
Repository As A Service (RaaS) at ICPSR
Repository As A Service  (RaaS) at ICPSRRepository As A Service  (RaaS) at ICPSR
Repository As A Service (RaaS) at ICPSRHarshakumar Ummerpillai
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobus
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File TransferGlobus
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)David Green
 

Was ist angesagt? (20)

Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Schema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDBSchema Agnostic Indexing with Azure DocumentDB
Schema Agnostic Indexing with Azure DocumentDB
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
 
Hypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of ThingsHypermedia System Architecture for a Web of Things
Hypermedia System Architecture for a Web of Things
 
Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)Data Orchestration at Scale (GlobusWorld Tour West)
Data Orchestration at Scale (GlobusWorld Tour West)
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
 
Generic Crawler
Generic CrawlerGeneric Crawler
Generic Crawler
 
Repository As A Service (RaaS) at ICPSR
Repository As A Service  (RaaS) at ICPSRRepository As A Service  (RaaS) at ICPSR
Repository As A Service (RaaS) at ICPSR
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
 

Ă„hnlich wie Distributed Crawler Service architecture presentation

Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureCarst Vaartjes
 
Public Cloud Workshop
Public Cloud WorkshopPublic Cloud Workshop
Public Cloud WorkshopAmer Ather
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overviewcornelia davis
 
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitAmazon Web Services
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2
 
WEB-DBMS A quick reference
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick referenceMarc Dy
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan
 
Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpbigdata sunil
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...Sriskandarajah Suhothayan
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0Deepak Sood
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsKhalid Ahmed
 
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Pieter Brinkman
 
Census Bureau PBOCS
Census Bureau PBOCSCensus Bureau PBOCS
Census Bureau PBOCSTolu A Williams
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
 
Khushali Patel-resume-
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-Khushali11
 

Ă„hnlich wie Distributed Crawler Service architecture presentation (20)

HCE project brief
HCE project briefHCE project brief
HCE project brief
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
 
Public Cloud Workshop
Public Cloud WorkshopPublic Cloud Workshop
Public Cloud Workshop
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
 
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS SummitDiscover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
Discover MongoDB Atlas and MongoDB Stitch - DEM02-S - Mexico City AWS Summit
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
 
WEB-DBMS A quick reference
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick reference
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan 2016
Sergey Stoyan 2016
 
Sergey Stoyan 2016
Sergey Stoyan 2016Sergey Stoyan 2016
Sergey Stoyan 2016
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
 
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
Ikenstudiolive
 
Census Bureau PBOCS
Census Bureau PBOCSCensus Bureau PBOCS
Census Bureau PBOCS
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Khushali Patel-resume-
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-
 

KĂĽrzlich hochgeladen

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 

KĂĽrzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Distributed Crawler Service architecture presentation

  • 1. The Distributed Crawler v2.5.x-chaika Service architecture overview The Hierarchical Cluster Engine project IOIX Ukraine, 2014-2015
  • 2. Introduction The HCE-DC is a multipurpose high productivity scalable and extensible engine of web data collecting and processing. Built on several HCE project's products and technologies: â—Ź The Distributed Tasks Manager (DTM) service. â—Ź The hce-node network cluster application. â—Ź The API bindings for Python and PHP languages. â—Ź The tools and library of crawling algorithms. â—Ź The tools and library of scraping algorithms. â—Ź Web administration console. â—Ź The real-time REST API. provides flexible configuration and deployment automation to get installation closed to a target project and easy integration.
  • 3. â—Ź Crawling - scan the web sites, analyze and parse pages, detect and collect URLs links and web resources' data. Download resources from web-servers using collected or provided with request URL(s) and store them in local raw data file storage. Everything on multi-host and multi-process architecture â—Ź Process web page contents with several customizable applied algorithms like a unstructured textual content scraping and store results in local sql db storage on multi-host and multi-process architecture. â—Ź Manage tasks of crawling and processing with scheduling and balancing using tasks management service of multi-host architecture or real-time multi-threaded load-balancing client-server architecture with multi-host backend engine. The main functional purposes
  • 4. The extended functionality â—Ź Developer's API – full access for configuration, deployment, monitoring and management processes and data. â—Ź Applied API – full featured multi-thread multi-host REST http-based protocol to perform crawling and scraping batch requests. â—Ź Web administration console - the DC and DTM services and user's accounts, roles, permissions, crawling, scraping, results collect, aggregation and convert management, statistical data collect and visualize, notification, triggering and another utility tools. â—Ź Helper tools and libraries – several support applied utility.
  • 5. Distributed asynchronous nature The HCE-DC engine itself is an architecturally fully distributed system. It can be deployed and configured as single- and multi-host installation. Key features and properties of distributed architecture: â—Ź No central database or data storage for crawling and processing. Each physical host unit with the same structures shards data but represented as single service. â—Ź Crawling and processing goes on several physical hosts parallel multi-process way including downloading, fetching, DOM parsing, URLs collecting, fields extracting, post processing, metric calculations and so on tasks. â—Ź Customizable strategies of data sharding and requests balancing with minimization of data redundancy and optimization of resources usage. â—Ź Reducing of pages and scraped contents internally with smart merging avoiding of resources duplicates in fetch data client response.
  • 6. Flexible balancing and scalability The HCE-DC as service can be deployed on set of physical hosts. The number of hosts depends on their hardware productivity rate (CPU cores number, disk space size, network interface speed and so on) and can be from one to tens or more. Key scalability principles are: – The hardware computational unit is physical or logical host (any kind of virtualization and containers supported). – The hardware units can be added in to the system and gradually filled with data during regular crawling iterations at run-time. No dedicated data migration. – Computational tasks balancing is resource usage optimized. Tasks scheduler selects computational unit with maximum free system resources using customizable estimation formula. Different system resources usage indicators available: CPU, RAM, DISK, IO wait, processes number, threads number and so on.
  • 7. Extensible software and algorithms The HCE-DC service for the Linux OS platform has three main parts: â—Ź The core service daemon modules with functionality of: scheduler of crawling tasks, manager of tasks queues, managers of periodical processes, manager of computational units data, manager of storage resources aging, manager of real-time API requests and so on. Typically the core daemon process runs on dedicated host and represents service itself. â—Ź The computational unit modules set including crawling crawler-task, scraping algorithms processor-task and several scraper modules, storage management db-task, pre-rocessor, finalizer modules and several additional helper utilities. They acts on session-based principles and exits after the input batch data set processed.
  • 8. Open processing architecture The computational modules set can be extended with any kind of algorithms, libraries and frameworks for any platforms and programming languages. The limitation is only the API interaction translation that typically needs some adapters. The key principles are: â—Ź Data processing modules involved as native OS processes or via API including REST and CLI. â—Ź Process instances are isolated. â—Ź POSIX CLI API is default for inter-process data exchange or simulated by converter utilities. â—Ź Open input/output protocol used to process batch in sequential way step by step by each processing chain. â—Ź Streaming open data formats can be easily serialized – json, xml and so on.
  • 9. General DC service architecture
  • 11. Internal DC service architecture
  • 12. Brief list of main DC features Fully automated distributed web sites crawling with: set of root URLs, periodic re-crawling, HTTP and HTML redirects, http timeout, dynamic HTML rendering, prioritization, limits (size, pages, contents, errors, URLs, redirects, content types), requests delaying, robots.txt, rotating proxies, RSS (1,2,RDF,Atom), scan depth, filters, page chains and batching. Fully automated distributed data processing: News ™ article (pre-defined sequential scraping and extractors usage – Goose, Newspaper and Scrapy) and Template ™ universal (definitions of tags and rules to extract data from pages based on xpath and csspath, content parts joining, merging, best result selection, regular expressions post processing, multi-item pages (product, search results, etc), multi-rule, multi-template) scraping engines, WYSIWYG templates editor, processed contents selection and merging and extensible processing modules architecture. Fully automated resources data management: periodic operations, data aging, update, re-crawling and re-processing. Web administration console: full CRUD of projects for data collect and process with set of parameters per project, users with roles and permissions ACL, DC and DTM service's statistics, crawling and processing project's statistics. Web REST API gateway: synchronous HTTP REST requests with batching to crawl and to process, full featured parameters set with additionally limitations per user account and authorization state. Real-time requests API: native CLI client, asynchronous and synchronous REST requests.
  • 13. Statistics of three physical hosts installation for one month â—Ź Projects: 8 â—Ź Pages crawled: 6.2M â—Ź Crawling batches: 60K â—Ź Processing batches: 90K â—Ź Purging batches: 16K â—Ź Aging batches: 16K â—Ź Projects re-crawlings: 30K â—Ź CPU Load Average: 0.45 avg / 3.5 max â—Ź CPU utilization: 3% avg / 30% max â—Ź I/O wait time: 0.31 avg / 6.4 max â—Ź Network connections: 250 avg / 747 max â—Ź Network traffic: 152Kbps avg / 5.5Mbps max â—Ź Data hosts: 2 â—Ź Load-balancing of system OS resources linear managed CPU load average, I/O wait and RAM usage without excesses and overloads. â—Ź Linear scalability of real-time requests per physical host. â—Ź Linear scalability of automated crawling, processing and aging per physical host.