SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Crawl the entire web
in 10 minutes...
Copyright ©: 2015 OnPage.org GmbH
Using AWS-EMR, AWS-S3, PIG, CommonCrawl
...and just 100 €
Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData solutions
About Me
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- You want to crawl > 1 Billion URLs
Solution ?
Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URLs
- Over 1.000 TB total since 2009
- URL seeding list from Blekko: https://blekko.com
Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Wayback Machine)
- Partitionable with S3 Object Prefix possibility
- 100MB-1GB file Sizes (gzip) -> Hadoop size
Nice Data Format
Store the raw crawl data.
Format 1:
WARC
Store only the
Meta-Information
as JSON
Format 2:
WAT
Store only the
Plain Text Content
Format 3:
WET
Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warcexamples
- https://github.com/mortardata/mortar-examples
- https://github.com/matpalm/common-crawl
PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";
-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";
%default OUTPUT_PATH "s3://example-bucket/out";
pages = LOAD '$INPUT_PATH'
USING FileLoaderClass
AS (url, html);
meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;
filtered = FILTER meta_titles BY meta_title IS NOT NULL;
STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pay per Hour to scale as much as posible
It‘s Demo Time!
Let's cross fingers now
That‘s it!
Customer:
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
And: We are hiring!
https://de.onpage.org/about/jobs/

Weitere ähnliche Inhalte

Ähnlich wie Crawl the entire web in 10 minutes...and just 100€

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - finalMongoDB
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value storesBhasker Kode
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
 
Web Development in Perl
Web Development in PerlWeb Development in Perl
Web Development in PerlNaveen Gupta
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloudkidrane
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionTreasure Data, Inc.
 
Nosql-columbia-feb2011
Nosql-columbia-feb2011Nosql-columbia-feb2011
Nosql-columbia-feb2011siculars
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Germany
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Aad Versteden
 
Reducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGReducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGMaarten Balliauw
 
StartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStart Pad
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloAnton Babenko
 

Ähnlich wie Crawl the entire web in 10 minutes...and just 100€ (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
 
thinking in key value stores
thinking in key value storesthinking in key value stores
thinking in key value stores
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
Web Development in Perl
Web Development in PerlWeb Development in Perl
Web Development in Perl
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
 
Insight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestionInsight Data Engineering: Open source data ingestion
Insight Data Engineering: Open source data ingestion
 
Open source data ingestion
Open source data ingestionOpen source data ingestion
Open source data ingestion
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
Nosql-columbia-feb2011
Nosql-columbia-feb2011Nosql-columbia-feb2011
Nosql-columbia-feb2011
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
AWS Pop-up Loft Berlin: Cache is King - Running Lean Architectures: Optimizin...
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 
Reducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAGReducing latency on the web with the Azure CDN - DevSum - SWAG
Reducing latency on the web with the Azure CDN - DevSum - SWAG
 
StartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and YouStartPad Countdown 8 - Amazon Web Services and You
StartPad Countdown 8 - Amazon Web Services and You
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Terraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group OsloTerraform Q&A - HashiCorp User Group Oslo
Terraform Q&A - HashiCorp User Group Oslo
 

Kürzlich hochgeladen

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 

Kürzlich hochgeladen (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 

Crawl the entire web in 10 minutes...and just 100€

  • 1. Crawl the entire web in 10 minutes... Copyright ©: 2015 OnPage.org GmbH Using AWS-EMR, AWS-S3, PIG, CommonCrawl ...and just 100 €
  • 2. Since 2011 in Munich Work at OnPage.org Interested in Webcrawling and BigData Frameworks Build low cost scalable BigData solutions About Me Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org
  • 3. Do you want to build your own Search- Engine? - High Hardware / Cloud Costs - Nutch needs ~ 1 Hour for 1 million URLs - You want to crawl > 1 Billion URLs
  • 5. Don‘t Crawl! - Use Common-Crawl : https://commoncrawl.org - Non-Profit-Organisation - ~Monthly over 2 Billions Crawled URLs - Over 1.000 TB total since 2009 - URL seeding list from Blekko: https://blekko.com
  • 6. Don‘t Crawl! – Use Common Crawl! - Scalably stored on Amazon AWS S3 - Hadoop compatible format powered by Archive.org (Wayback Machine) - Partitionable with S3 Object Prefix possibility - 100MB-1GB file Sizes (gzip) -> Hadoop size
  • 8. Store the raw crawl data. Format 1: WARC
  • 10. Store only the Plain Text Content Format 3: WET
  • 11. Choose the right format - WARC (Raw HTML): 1.000 MB - WAT (Meta data as JSON) : 450 MB - WET (Plain Text): 150 MB
  • 12. Processing - Pure Hadoop with MapReduce - Input Classes: http://commoncrawl.org/the-data/get-started/
  • 13. Processing - High Level ETL-Layer like PIG: http://pig.apache.org - Example Stuff : - https://github.com/norvigaward/warcexamples - https://github.com/mortardata/mortar-examples - https://github.com/matpalm/common-crawl
  • 14. PIG Example REGISTER file:/home/hadoop/lib/pig/piggybank.jar DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader(); %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz"; -- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/"; %default OUTPUT_PATH "s3://example-bucket/out"; pages = LOAD '$INPUT_PATH' USING FileLoaderClass AS (url, html); meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title; filtered = FILTER meta_titles BY meta_title IS NOT NULL; STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
  • 15. Hadoop & PIG on AWS - Support new Hadoop releases - PIG Integration - Replace HDFS with S3 - Easy UI to start quickly - Pay per Hour to scale as much as posible
  • 16. It‘s Demo Time! Let's cross fingers now
  • 17. That‘s it! Customer: Twitter: @danny_munich Facebook: https://www.facebook.com/danny.linden2 E-mail: danny@onpage.org And: We are hiring! https://de.onpage.org/about/jobs/

Hinweis der Redaktion

  1. Screenshot austauschen + shclecht lesbar