2. Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData solutions
About Me
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: danny@onpage.org
3. Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- You want to crawl > 1 Billion URLs
5. Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URLs
- Over 1.000 TB total since 2009
- URL seeding list from Blekko: https://blekko.com
6. Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Wayback Machine)
- Partitionable with S3 Object Prefix possibility
- 100MB-1GB file Sizes (gzip) -> Hadoop size
11. Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
12. Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
13. Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warcexamples
- https://github.com/mortardata/mortar-examples
- https://github.com/matpalm/common-crawl
14. PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";
-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";
%default OUTPUT_PATH "s3://example-bucket/out";
pages = LOAD '$INPUT_PATH'
USING FileLoaderClass
AS (url, html);
meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;
filtered = FILTER meta_titles BY meta_title IS NOT NULL;
STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('t');
15. Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pay per Hour to scale as much as posible