Crawling and Processing the Italian Corporate Web

Crawling and Processing the
Italian Corporate Web
Alessio Guerrieri
SpazioDati S.R.L.

Your speaker
● Born in Trento
● Studied at UniTN and Georgia Tech
● PhD in Large Scale Graph Analytics
● Teaches Algorithms and Data Structures
● Data Scientist at SpazioDati
In my spare time:
● {Read|Watch|Play} {Science
Fiction|Fantasy} {Novels|TV|Board
Games}

SpazioDati S.R.L.
● Born in 2012
● Data integration
● Focus on corporate world:
○ Official data from Camera di Commercio
○ Open data
● Atoka
○ B2B database of company information
○ Sales intelligence
○ API
● Data analytics
○ Portfolio analysis
○ Lead generation
○ Risk evaluation
Always hard at work!

Internet Data Gathering (IDG)
IDG is an internal project to gather, process
and organize internet data about italian
companies.
It uses many different technologies for Big
Data Gathering and Processing.
Entire pipeline runs on Amazon AWS A representation of the Internet

Internet Data Gathering (IDG)
Takeaways:
● Web data is HORRIBLE
● OSS can help!
● For Big Data, you need a Big Framework

Web Crawler
Image from https://en.wikipedia.org/wiki/Web_crawler

Apache Nutch
● Distributed crawler runnable on Hadoop
● Highly configurable
Each iteration:
1. Injector adds new Urls
2. Generator runs Scoring Function to
select Urls
3. Urls are divided in segments
4. Each segment is downloaded in parallel
5. Pages are parsed
6. Newly discovered urls are added to
CrawlDB

Nutch in SpazioDati
● Restricted to:
○ .it domains
○ domains registered in Italy (through
whois)
● Runs weekly:
○ Cluster of 15 machines
○ Use Elastic MapReduce service
○ 12M pages each week
● Keep complete history
○ 5.3T downloaded
○ After 4 months pages are not processed

Crawling is not easy!
Issues with crawling:
● People who do not want to be crawled
○ Be polite!
○ We follow robots.txt specification and
use unique User Agent
● Avoid accidental DDOS attacks
○ Each domain should be crawled
sequentially
● Never crawl too deeply
○ Filters on depth, url length and queries
○ Try to avoid crawling too much a single
domain
“The crawlers delved too greedily and too deep”
https://www.amazon.it/s/ref=lp_1345828031_nr_p_n_binding_browse-b_0
?fst=as%3Aoff&rh=n%3A411663031%2Cn%3A%21411664031%2Cn%3A
1345828031%2Cp_n_binding_browse-bin%3A509801031&bbn=1345828
031&ie=UTF8&qid=1504078452&rnid=509800031

Extracting data from Crawl
Crawler gives us compressed json of HTML with metadata
● Structured, useful information
● Domain based
● Distributed processing
Easy information Medium information Complex information
Text Social Accounts Technologies
Links Logo Entities
Codici Fiscali Language People

Hadoop for data processing:
● User defines User Defined Functions
● Hadoop framework
○ Stores input data
○ Divides it in chunkes
○ Makes it available to all machines
○ Runs UDFs on all chunkes
○ Guarantees fault tolerance
○ Collects output
Hadoop
This guy does not have the energy to implement
fault tolerance...

PIG
Scripting language for Hadoop
● Scripts are written in Pig Latin
● Looks kinda like SQL
● Easy built pipelines
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Pig in SpazioDati
Our pipeline:
1. Computes domain for each page
2. Groups by domain
3. Extracts information for each domain
4. Integrates data from other sources (i.e.
whois)
5. Exports a json for each domain
● Runs (roughly) monthly
● Cluster of 30 machines
● AWS’s Elastic MapReduce service
● Difficult to test :(

Requirements
We want to index our extracted data.
● We should access it easily
● We should explore it efficiently
We will able to:
● Match it with official data about
companies
● Serve it in the backend of our services
5M jsons without indexing

Elasticsearch
Open source search engine
● Based on Lucene index
○ Highly efficient index
○ Mostly on disk
● Full text search
● Nested fields support
● Cluster structure
● Web interface
● Allows (very) complex queries
5M indexed jsons

Sample query
Domains that contain the word ‘speck’ in the
text:
{
"_source": false,
"query":{
"term":{
"text": "speck"
}
},
"size": 5
}
{
"hits": {
"total": 15069,
"max_score": 11.716405,
"hits": [
{
"_id": "www.titospeck.it",
"_score": 11.716405
},
{
"_id": "derpsairer.it",
"_score": 11.6602
},
{
"_id": "www.speck.it",
"_score": 11.626965
},
{
"_id": "www.bayona-music.com",
"_score": 11.607182
},
{
"_id": "www.salumificiocoati.it",
"_score": 11.560882
}
]
}
}

Sample query (2)
Domains that contain the phrases similar to
speck and tech in the text:
{
"_source": false,
"query":{
"term":{
"text": "speck and tech"
}
},
"size": 3
}
{
"hits": {
"total": 1003897,
"max_score" : 19.871191,
"hits": [
{
"_id": "speckand.tech" ,
"_score": 19.871191
},
{
"_id": "www.speckietechies.com" ,
"_score": 19.674822
},
{
"_id": "francescobonadiman.com" ,
"_score": 17.935522
}
]
}
}

Complex query
{
"size": 0,
"query":{
"bool":{
"must":[
{
"term":{
"technologies.cms.name" : "WordPress"
}
},
{
"term":{
"technologies.cms.version" :"3.0"
}
}
]
}
}
}
{
"took": 1,
"timed_out" : false,
"_shards": {
"total": 10,
"successful" : 10,
"failed": 0
},
"hits": {
"total": 211,
"max_score" : 0,
"hits": []
}
}

Complex query
Compute the distribution of most used cms software
{
"size": 0,
"aggregations" : {
"aggs" : {
"terms": {
"field" : "technologies.cms.name" ,
"size" : 20
}
}
}
}
{
"aggregations" : {
"aggs": {
"doc_count_error_upper_bound" : 997,
"sum_other_doc_count" : 43403,
"buckets" : [
{
"key": "WordPress" ,
"doc_count" : 590133
},
{
"key": "Joomla" ,
"doc_count" : 163595
},
{
"key": "Drupal" ,
"doc_count" : 33727
},
{
"key": "DM Polopoly" ,
"doc_count" : 30455
},
{
"key": "Weebly" ,
"doc_count" : 9861
}
]
}
}
}

Getting value from the Corporate Web

The rest of the IDG pipeline
IDG is much more:
● Finding the correct domains for each
company
● Extracting information from social networks
● Validating emails collected in the web
● ecc…
The real IDG pipeline

Conclusions
● There is a lot of Open Source
Software for Big Data processing
● You’ll need to tinker with available
features
● Web data is often:
○ Outdated
○ Badly formatted
○ Ambiguous

Thanks for your attention!
Questions?
Interested?
see www.spaziodati.eu/jobs for opportunities!

Crawling and Processing the Italian Corporate Web

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Crawling and Processing the Italian Corporate Web

Ähnlich wie Crawling and Processing the Italian Corporate Web (20)

Mehr von Speck&Tech

Mehr von Speck&Tech (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Crawling and Processing the Italian Corporate Web