SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Crawling and Processing the
Italian Corporate Web
Alessio Guerrieri
SpazioDati S.R.L.
Your speaker
● Born in Trento
● Studied at UniTN and Georgia Tech
● PhD in Large Scale Graph Analytics
● Teaches Algorithms and Data Structures
● Data Scientist at SpazioDati
In my spare time:
● {Read|Watch|Play} {Science
Fiction|Fantasy} {Novels|TV|Board
Games}
SpazioDati S.R.L.
● Born in 2012
● Data integration
● Focus on corporate world:
○ Official data from Camera di Commercio
○ Open data
● Atoka
○ B2B database of company information
○ Sales intelligence
○ API
● Data analytics
○ Portfolio analysis
○ Lead generation
○ Risk evaluation
Always hard at work!
Internet Data Gathering (IDG)
IDG is an internal project to gather, process
and organize internet data about italian
companies.
It uses many different technologies for Big
Data Gathering and Processing.
Entire pipeline runs on Amazon AWS A representation of the Internet
Internet Data Gathering (IDG)
Takeaways:
● Web data is HORRIBLE
● OSS can help!
● For Big Data, you need a Big Framework
Crawling the Corporate Web
Web Crawler
Image from https://en.wikipedia.org/wiki/Web_crawler
Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB
Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB
Nutch in SpazioDati
● Restricted to:
○ .it domains
○ domains registered in Italy (through
whois)
● Runs weekly:
○ Cluster of 15 machines
○ Use Elastic MapReduce service
○ 12M pages each week
● Keep complete history
○ 5.3T downloaded
○ After 4 months pages are not processed
Crawling is not easy!
Issues with crawling:
● People who do not want to be crawled
○ Be polite!
○ We follow robots.txt specification and
use unique User Agent
● Avoid accidental DDOS attacks
○ Each domain should be crawled
sequentially
● Never crawl too deeply
○ Filters on depth, url length and queries
○ Try to avoid crawling too much a single
domain
“The crawlers delved too greedily and too deep”
https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0
?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A
1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828
031&ie=UTF8&qid=1504078452&rnid=509800031
Processing the Corporate Web
Extracting data from Crawl
Crawler gives us compressed json of HTML with metadata
● Structured, useful information
● Domain based
● Distributed processing
Easy information Medium information Complex information
Text Social Accounts Technologies
Links Logo Entities
Codici Fiscali Language People
Hadoop for data processing:
● User defines User Defined Functions
● Hadoop framework
○ Stores input data
○ Divides it in chunkes
○ Makes it available to all machines
○ Runs UDFs on all chunkes
○ Guarantees fault tolerance
○ Collects output
Hadoop
This guy does not have the energy to implement
fault tolerance...
PIG
Scripting language for Hadoop
● Scripts are written in Pig Latin
● Looks kinda like SQL
● Easy built pipelines
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Pig in SpazioDati
Our pipeline:
1. Computes domain for each page
2. Groups by domain
3. Extracts information for each domain
4. Integrates data from other sources (i.e.
whois)
5. Exports a json for each domain
● Runs (roughly) monthly
● Cluster of 30 machines
● AWS’s Elastic MapReduce service
● Difficult to test :(
Querying the Corporate Web
Requirements
We want to index our extracted data.
● We should access it easily
● We should explore it efficiently
We will able to:
● Match it with official data about
companies
● Serve it in the backend of our services
5M jsons without indexing
Elasticsearch
Open source search engine
● Based on Lucene index
○ Highly efficient index
○ Mostly on disk
● Full text search
● Nested fields support
● Cluster structure
● Web interface
● Allows (very) complex queries
5M indexed jsons
Sample query
Domains that contain the word ‘speck’ in the
text:
{
"_source": false,
"query":{
"term":{
"text": "speck"
}
},
"size": 5
}
{
"hits": {
"total": 15069,
"max_score": 11.716405,
"hits": [
{
"_id": "www.titospeck.it",
"_score": 11.716405
},
{
"_id": "derpsairer.it",
"_score": 11.6602
},
{
"_id": "www.speck.it",
"_score": 11.626965
},
{
"_id": "www.bayona-music.com",
"_score": 11.607182
},
{
"_id": "www.salumificiocoati.it",
"_score": 11.560882
}
]
}
}
Sample query (2)
Domains that contain the phrases similar to
speck and tech in the text:
{
"_source": false,
"query":{
"term":{
"text": "speck and tech"
}
},
"size": 3
}
{
"hits": {
"total": 1003897,
"max_score" : 19.871191,
"hits": [
{
"_id": "speckand.tech" ,
"_score": 19.871191
},
{
"_id": "www.speckietechies.com" ,
"_score": 19.674822
},
{
"_id": "francescobonadiman.com" ,
"_score": 17.935522
}
]
}
}
Complex query
{
"size": 0,
"query":{
"bool":{
"must":[
{
"term":{
"technologies.cms.name" : "WordPress"
}
},
{
"term":{
"technologies.cms.version" :"3.0"
}
}
]
}
}
}
{
"took": 1,
"timed_out" : false,
"_shards": {
"total": 10,
"successful" : 10,
"failed": 0
},
"hits": {
"total": 211,
"max_score" : 0,
"hits": []
}
}
Complex query
Compute the distribution of most used cms software
{
"size": 0,
"aggregations" : {
"aggs" : {
"terms": {
"field" : "technologies.cms.name" ,
"size" : 20
}
}
}
}
{
"aggregations" : {
"aggs": {
"doc_count_error_upper_bound" : 997,
"sum_other_doc_count" : 43403,
"buckets" : [
{
"key": "WordPress" ,
"doc_count" : 590133
},
{
"key": "Joomla" ,
"doc_count" : 163595
},
{
"key": "Drupal" ,
"doc_count" : 33727
},
{
"key": "DM Polopoly" ,
"doc_count" : 30455
},
{
"key": "Weebly" ,
"doc_count" : 9861
}
]
}
}
}
Getting value from the Corporate Web
The rest of the IDG pipeline
IDG is much more:
● Finding the correct domains for each
company
● Extracting information from social networks
● Validating emails collected in the web
● ecc…
The real IDG pipeline
Conclusions
● There is a lot of Open Source
Software for Big Data processing
● You’ll need to tinker with available
features
● Web data is often:
○ Outdated
○ Badly formatted
○ Ambiguous
Thanks for your attention!
Questions?
Interested?
see www.spaziodati.eu/jobs for opportunities!

Weitere ähnliche Inhalte

Ähnlich wie Crawling and Processing the Italian Corporate Web

Kiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-finalKiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-final
Romania Testing
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
Márton Kodok
 

Ähnlich wie Crawling and Processing the Italian Corporate Web (20)

Living Labs Challenge Workshop
Living Labs Challenge WorkshopLiving Labs Challenge Workshop
Living Labs Challenge Workshop
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
A search engine in a world of events and microservices - SF Pot @Meetic
A search engine in a world of events and microservices - SF Pot @MeeticA search engine in a world of events and microservices - SF Pot @Meetic
A search engine in a world of events and microservices - SF Pot @Meetic
 
how to scrape data from yellow pages
how to scrape data from yellow pages how to scrape data from yellow pages
how to scrape data from yellow pages
 
Kiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-finalKiran karnad rtc2014 ghdb-final
Kiran karnad rtc2014 ghdb-final
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Running a business on Web Scraped Data
Running a business on Web Scraped DataRunning a business on Web Scraped Data
Running a business on Web Scraped Data
 
Empowering red and blue teams with osint c0c0n 2017
Empowering red and blue teams with osint   c0c0n 2017Empowering red and blue teams with osint   c0c0n 2017
Empowering red and blue teams with osint c0c0n 2017
 
DevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQueryDevTalks Keynote Powering interactive data analysis with Google BigQuery
DevTalks Keynote Powering interactive data analysis with Google BigQuery
 
EMFcamp2022 - What if apps logged into you, instead of you logging into apps?
EMFcamp2022 - What if apps logged into you, instead of you logging into apps?EMFcamp2022 - What if apps logged into you, instead of you logging into apps?
EMFcamp2022 - What if apps logged into you, instead of you logging into apps?
 
Word press optimizations
Word press optimizations Word press optimizations
Word press optimizations
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
Troubleshooting SEO for JS Frameworks - Patrick Stox - DTD 2018
 
Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3Crawlable Spatial Data - #Geo4Web research topic #3
Crawlable Spatial Data - #Geo4Web research topic #3
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 

Mehr von Speck&Tech

What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
Speck&Tech
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
Speck&Tech
 

Mehr von Speck&Tech (20)

What should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futuresWhat should 6G be? - 6G: bridging gaps, connecting futures
What should 6G be? - 6G: bridging gaps, connecting futures
 
Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"Creare il sangue artificiale: "buon sangue non mente"
Creare il sangue artificiale: "buon sangue non mente"
 
AWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scalaAWS: gestire la scalabilità su larga scala
AWS: gestire la scalabilità su larga scala
 
Praticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web ServicesPraticamente... AWS - Amazon Web Services
Praticamente... AWS - Amazon Web Services
 
Data Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information designData Sense-making: navigating the world through the lens of information design
Data Sense-making: navigating the world through the lens of information design
 
Data Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as powerData Activism: data as rhetoric, data as power
Data Activism: data as rhetoric, data as power
 
Delve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomicsDelve into the world of the human microbiome and metagenomics
Delve into the world of the human microbiome and metagenomics
 
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
Home4MeAi: un progetto sociale che utilizza dispositivi IoT per sfruttare le ...
 
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
Monitorare una flotta di autobus: architettura di un progetto di acquisizione...
 
Why LLMs should be handled with care
Why LLMs should be handled with careWhy LLMs should be handled with care
Why LLMs should be handled with care
 
Building intelligent applications with Large Language Models
Building intelligent applications with Large Language ModelsBuilding intelligent applications with Large Language Models
Building intelligent applications with Large Language Models
 
Privacy in the era of quantum computers
Privacy in the era of quantum computersPrivacy in the era of quantum computers
Privacy in the era of quantum computers
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computers
 
Give your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUsGive your Web App superpowers by using GPUs
Give your Web App superpowers by using GPUs
 
From leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technologyFrom leaf to orbit: exploring forests with technology
From leaf to orbit: exploring forests with technology
 
Innovating Wood
Innovating WoodInnovating Wood
Innovating Wood
 
Behind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIXBehind the scenes of our everyday Internet: the role of an IXP like MIX
Behind the scenes of our everyday Internet: the role of an IXP like MIX
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Truck planning: how to certify the right route
Truck planning: how to certify the right routeTruck planning: how to certify the right route
Truck planning: how to certify the right route
 
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
Break it up! 5G, cruise control, autonomous vehicle cooperation, and bending ...
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Crawling and Processing the Italian Corporate Web

  • 1. Crawling and Processing the Italian Corporate Web Alessio Guerrieri SpazioDati S.R.L.
  • 2. Your speaker ● Born in Trento ● Studied at UniTN and Georgia Tech ● PhD in Large Scale Graph Analytics ● Teaches Algorithms and Data Structures ● Data Scientist at SpazioDati In my spare time: ● {Read|Watch|Play} {Science Fiction|Fantasy} {Novels|TV|Board Games}
  • 3. SpazioDati S.R.L. ● Born in 2012 ● Data integration ● Focus on corporate world: ○ Official data from Camera di Commercio ○ Open data ● Atoka ○ B2B database of company information ○ Sales intelligence ○ API ● Data analytics ○ Portfolio analysis ○ Lead generation ○ Risk evaluation Always hard at work!
  • 4. Internet Data Gathering (IDG) IDG is an internal project to gather, process and organize internet data about italian companies. It uses many different technologies for Big Data Gathering and Processing. Entire pipeline runs on Amazon AWS A representation of the Internet
  • 5. Internet Data Gathering (IDG) Takeaways: ● Web data is HORRIBLE ● OSS can help! ● For Big Data, you need a Big Framework
  • 7. Web Crawler Image from https://en.wikipedia.org/wiki/Web_crawler
  • 8. Apache Nutch ● Distributed crawler runnable on Hadoop ● Highly configurable Each iteration: 1. Injector adds new Urls 2. Generator runs Scoring Function to select Urls 3. Urls are divided in segments 4. Each segment is downloaded in parallel 5. Pages are parsed 6. Newly discovered urls are added to CrawlDB
  • 9. Apache Nutch ● Distributed crawler runnable on Hadoop ● Highly configurable Each iteration: 1. Injector adds new Urls 2. Generator runs Scoring Function to select Urls 3. Urls are divided in segments 4. Each segment is downloaded in parallel 5. Pages are parsed 6. Newly discovered urls are added to CrawlDB
  • 10. Nutch in SpazioDati ● Restricted to: ○ .it domains ○ domains registered in Italy (through whois) ● Runs weekly: ○ Cluster of 15 machines ○ Use Elastic MapReduce service ○ 12M pages each week ● Keep complete history ○ 5.3T downloaded ○ After 4 months pages are not processed
  • 11. Crawling is not easy! Issues with crawling: ● People who do not want to be crawled ○ Be polite! ○ We follow robots.txt specification and use unique User Agent ● Avoid accidental DDOS attacks ○ Each domain should be crawled sequentially ● Never crawl too deeply ○ Filters on depth, url length and queries ○ Try to avoid crawling too much a single domain “The crawlers delved too greedily and too deep” https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0 ?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A 1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828 031&ie=UTF8&qid=1504078452&rnid=509800031
  • 13. Extracting data from Crawl Crawler gives us compressed json of HTML with metadata ● Structured, useful information ● Domain based ● Distributed processing Easy information Medium information Complex information Text Social Accounts Technologies Links Logo Entities Codici Fiscali Language People
  • 14. Hadoop for data processing: ● User defines User Defined Functions ● Hadoop framework ○ Stores input data ○ Divides it in chunkes ○ Makes it available to all machines ○ Runs UDFs on all chunkes ○ Guarantees fault tolerance ○ Collects output Hadoop This guy does not have the energy to implement fault tolerance...
  • 15. PIG Scripting language for Hadoop ● Scripts are written in Pig Latin ● Looks kinda like SQL ● Easy built pipelines input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 16. Pig in SpazioDati Our pipeline: 1. Computes domain for each page 2. Groups by domain 3. Extracts information for each domain 4. Integrates data from other sources (i.e. whois) 5. Exports a json for each domain ● Runs (roughly) monthly ● Cluster of 30 machines ● AWS’s Elastic MapReduce service ● Difficult to test :(
  • 18. Requirements We want to index our extracted data. ● We should access it easily ● We should explore it efficiently We will able to: ● Match it with official data about companies ● Serve it in the backend of our services 5M jsons without indexing
  • 19. Elasticsearch Open source search engine ● Based on Lucene index ○ Highly efficient index ○ Mostly on disk ● Full text search ● Nested fields support ● Cluster structure ● Web interface ● Allows (very) complex queries 5M indexed jsons
  • 20. Sample query Domains that contain the word ‘speck’ in the text: { "_source": false, "query":{ "term":{ "text": "speck" } }, "size": 5 } { "hits": { "total": 15069, "max_score": 11.716405, "hits": [ { "_id": "www.titospeck.it", "_score": 11.716405 }, { "_id": "derpsairer.it", "_score": 11.6602 }, { "_id": "www.speck.it", "_score": 11.626965 }, { "_id": "www.bayona-music.com", "_score": 11.607182 }, { "_id": "www.salumificiocoati.it", "_score": 11.560882 } ] } }
  • 21. Sample query (2) Domains that contain the phrases similar to speck and tech in the text: { "_source": false, "query":{ "term":{ "text": "speck and tech" } }, "size": 3 } { "hits": { "total": 1003897, "max_score" : 19.871191, "hits": [ { "_id": "speckand.tech" , "_score": 19.871191 }, { "_id": "www.speckietechies.com" , "_score": 19.674822 }, { "_id": "francescobonadiman.com" , "_score": 17.935522 } ] } }
  • 22. Complex query { "size": 0, "query":{ "bool":{ "must":[ { "term":{ "technologies.cms.name" : "WordPress" } }, { "term":{ "technologies.cms.version" :"3.0" } } ] } } } { "took": 1, "timed_out" : false, "_shards": { "total": 10, "successful" : 10, "failed": 0 }, "hits": { "total": 211, "max_score" : 0, "hits": [] } }
  • 23. Complex query Compute the distribution of most used cms software { "size": 0, "aggregations" : { "aggs" : { "terms": { "field" : "technologies.cms.name" , "size" : 20 } } } } { "aggregations" : { "aggs": { "doc_count_error_upper_bound" : 997, "sum_other_doc_count" : 43403, "buckets" : [ { "key": "WordPress" , "doc_count" : 590133 }, { "key": "Joomla" , "doc_count" : 163595 }, { "key": "Drupal" , "doc_count" : 33727 }, { "key": "DM Polopoly" , "doc_count" : 30455 }, { "key": "Weebly" , "doc_count" : 9861 } ] } } }
  • 24. Getting value from the Corporate Web
  • 25. The rest of the IDG pipeline IDG is much more: ● Finding the correct domains for each company ● Extracting information from social networks ● Validating emails collected in the web ● ecc… The real IDG pipeline
  • 26. Conclusions ● There is a lot of Open Source Software for Big Data processing ● You’ll need to tinker with available features ● Web data is often: ○ Outdated ○ Badly formatted ○ Ambiguous
  • 27. Thanks for your attention! Questions? Interested? see www.spaziodati.eu/jobs for opportunities!