SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Crawl the entire web
in 10 minutes...
Copyright ©: 2015 OnPage.org GmbH
Using AWS-EMR, AWS-S3, PIG, CommonCrawl
...and just 100 €
Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData solutions
About Me
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- You want to crawl > 1 Billion URLs
Solution ?
Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URLs
- Over 1.000 TB total since 2009
- URL seeding list from Blekko: https://blekko.com
Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Wayback Machine)
- Partitionable with S3 Object Prefix possibility
- 100MB-1GB file Sizes (gzip) -> Hadoop size
Nice Data Format
Store the raw crawl data.
Format 1:
WARC
Store only the
Meta-Information
as JSON
Format 2:
WAT
Store only the
Plain Text Content
Format 3:
WET
Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warcexamples
- https://github.com/mortardata/mortar-examples
- https://github.com/matpalm/common-crawl
PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";
-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";
%default OUTPUT_PATH "s3://example-bucket/out";
pages = LOAD '$INPUT_PATH'
USING FileLoaderClass
AS (url, html);
meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;
filtered = FILTER meta_titles BY meta_title IS NOT NULL;
STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pay per Hour to scale as much as posible
It‘s Demo Time!
Let's cross fingers now
That‘s it!
Customer:
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
And: We are hiring!
https://de.onpage.org/about/jobs/

Más contenido relacionado

Was ist angesagt?

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceBram Luyten
 
Evolution of digital libraries
Evolution of digital librariesEvolution of digital libraries
Evolution of digital librariesMahesh V M
 
الفهرسة الاجتماعية والفهارس الهوائية
الفهرسة الاجتماعية والفهارس الهوائيةالفهرسة الاجتماعية والفهارس الهوائية
الفهرسة الاجتماعية والفهارس الهوائيةProf. Sherif Shaheen
 
Recursos electronicos
Recursos electronicosRecursos electronicos
Recursos electronicoslinita1818
 
IOT, Real World Things, & Linked data
IOT, Real World Things, & Linked dataIOT, Real World Things, & Linked data
IOT, Real World Things, & Linked datarobin fay
 
Digital rights management an essential feature in the digital era
Digital rights management an essential feature in the digital eraDigital rights management an essential feature in the digital era
Digital rights management an essential feature in the digital eraKishor Satpathy
 
El libro en la edad media
El libro en la edad mediaEl libro en la edad media
El libro en la edad mediasocazu
 
Tutarlı bir katalog için otorite kontrol
 Tutarlı bir katalog için otorite kontrol Tutarlı bir katalog için otorite kontrol
Tutarlı bir katalog için otorite kontrolEmine Gür
 

Was ist angesagt? (14)

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Campos de control. Pilar Tejero López
Campos de control. Pilar Tejero LópezCampos de control. Pilar Tejero López
Campos de control. Pilar Tejero López
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpace
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Evolution of digital libraries
Evolution of digital librariesEvolution of digital libraries
Evolution of digital libraries
 
الفهرسة الاجتماعية والفهارس الهوائية
الفهرسة الاجتماعية والفهارس الهوائيةالفهرسة الاجتماعية والفهارس الهوائية
الفهرسة الاجتماعية والفهارس الهوائية
 
Library ad Information Science Education in Germany
Library ad Information Science Education in GermanyLibrary ad Information Science Education in Germany
Library ad Information Science Education in Germany
 
Recursos electronicos
Recursos electronicosRecursos electronicos
Recursos electronicos
 
Web 3.0 Intro
Web 3.0 IntroWeb 3.0 Intro
Web 3.0 Intro
 
IOT, Real World Things, & Linked data
IOT, Real World Things, & Linked dataIOT, Real World Things, & Linked data
IOT, Real World Things, & Linked data
 
MeSH
MeSHMeSH
MeSH
 
Digital rights management an essential feature in the digital era
Digital rights management an essential feature in the digital eraDigital rights management an essential feature in the digital era
Digital rights management an essential feature in the digital era
 
El libro en la edad media
El libro en la edad mediaEl libro en la edad media
El libro en la edad media
 
Tutarlı bir katalog için otorite kontrol
 Tutarlı bir katalog için otorite kontrol Tutarlı bir katalog için otorite kontrol
Tutarlı bir katalog için otorite kontrol
 

Ähnlich wie Crawl the entire web in 10 minutes...and just 100€

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - finalMongoDB
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value storesBhasker Kode
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
 
Web Development in Perl
Web Development in PerlWeb Development in Perl
Web Development in PerlNaveen Gupta
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloudkidrane
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionTreasure Data, Inc.
 
Nosql-columbia-feb2011
Nosql-columbia-feb2011Nosql-columbia-feb2011
Nosql-columbia-feb2011siculars
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Germany
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 
Reducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGReducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGMaarten Balliauw
 
StartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStart Pad
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloAnton Babenko
 

Ähnlich wie Crawl the entire web in 10 minutes...and just 100€ (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
Web Development in Perl
Web Development in PerlWeb Development in Perl
Web Development in Perl
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
Nosql-columbia-feb2011
Nosql-columbia-feb2011Nosql-columbia-feb2011
Nosql-columbia-feb2011
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 
Reducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGReducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAG
 
StartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and You
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group Oslo
 

Último

The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 

Último (20)

The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 

Crawl the entire web in 10 minutes...and just 100€

  • 1. Crawl the entire web in 10 minutes... Copyright ©: 2015 OnPage.org GmbH Using AWS-EMR, AWS-S3, PIG, CommonCrawl ...and just 100 €
  • 2. Since 2011 in Munich Work at OnPage.org Interested in Webcrawling and BigData Frameworks Build low cost scalable BigData solutions About Me Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org
  • 3. Do you want to build your own Search- Engine? - High Hardware / Cloud Costs - Nutch needs ~ 1 Hour for 1 million URLs - You want to crawl > 1 Billion URLs
  • 5. Don‘t Crawl! - Use Common-Crawl : https://commoncrawl.org - Non-Profit-Organisation - ~Monthly over 2 Billions Crawled URLs - Over 1.000 TB total since 2009 - URL seeding list from Blekko: https://blekko.com
  • 6. Don‘t Crawl! – Use Common Crawl! - Scalably stored on Amazon AWS S3 - Hadoop compatible format powered by Archive.org (Wayback Machine) - Partitionable with S3 Object Prefix possibility - 100MB-1GB file Sizes (gzip) -> Hadoop size
  • 8. Store the raw crawl data. Format 1: WARC
  • 10. Store only the Plain Text Content Format 3: WET
  • 11. Choose the right format - WARC (Raw HTML): 1.000 MB - WAT (Meta data as JSON) : 450 MB - WET (Plain Text): 150 MB
  • 12. Processing - Pure Hadoop with MapReduce - Input Classes: http://commoncrawl.org/the-data/get-started/
  • 13. Processing - High Level ETL-Layer like PIG: http://pig.apache.org - Example Stuff : - https://github.com/norvigaward/warcexamples - https://github.com/mortardata/mortar-examples - https://github.com/matpalm/common-crawl
  • 14. PIG Example REGISTER file:/home/hadoop/lib/pig/piggybank.jar DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader(); %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz"; -- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/"; %default OUTPUT_PATH "s3://example-bucket/out"; pages = LOAD '$INPUT_PATH' USING FileLoaderClass AS (url, html); meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title; filtered = FILTER meta_titles BY meta_title IS NOT NULL; STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
  • 15. Hadoop & PIG on AWS - Support new Hadoop releases - PIG Integration - Replace HDFS with S3 - Easy UI to start quickly - Pay per Hour to scale as much as posible
  • 16. It‘s Demo Time! Let's cross fingers now
  • 17. That‘s it! Customer: Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org And: We are hiring! https://de.onpage.org/about/jobs/

Hinweis der Redaktion

  1. Screenshot austauschen + shclecht lesbar