BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

Lucky Oyster
dive deep - discover pearls

$100 Worth of Priceless
Leveraging Common Crawl and Spot Instances to Data Mine The Web

Lisa Green, Common Crawl
Matthew Berk, Lucky Oyster

Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone on Amazon’s Public Data Sets

What Does $100 Buy You?
• 2 nosebleed seats at an NFL game
• 1/10 cost of an entry level Dell PowerEdge
• 80 minutes of time from a mid level engineer
• Omakase for 1 at Shiro’s Sushi in Seattle

or…

$100 + 14 hours + 300 lines of Ruby =

3.4 billion Web pages processed, data mined, and indexed for
search and research.

Even a few years ago,
this would have been unthinkable.

The Experiment
• Process most recent (2012) Web crawl from Common Crawl
• Determine extent and nature of hardcoded references to
Facebook
• Extract structured metadata (Open Graph and Schema.org)
• Store, analyze and index entity metadata and link structure

Components
• AWS Spot Instances
– Peak of ~200 nodes
– ~5,000 hours of compute time
– Average cost of $0.02 per hour
• Custom ruby code for extraction and analysis
• Beanstalkd, Apache httpd, Sinatra
• Some sysadmin elbow grease

Architecture
• Master instance (m2.4xlarge)
– Queue for Common Crawl S3 paths
– Data collection and node control service
– Indexers and Solr instances

• Worker nodes (c1.medium)
– Spot instances with worker AMI
– Consume S3 paths; decompress and stream ARC files
– Extract and analyze

• Goals were simplicity, interruption tolerance, and high throughput

Findings / Output
• Lucky Oyster Study (see appendix or
http://blog.luckyoyster.com )
• Utility computing = major cost savings
• Reusable framework for low complexity Web scale crawl
processing
• Indexes of 400+ million structured entities for R&D

Thank you. Questions?

matthew@luckyoyster.com

lisa@commoncrawl.org

The Lucky Oyster Study
• Based on 3.4 billion URLs from Common Crawl
• 22% of pages reference Facebook directly
• 8% of pages implement Open Graph tags
• Top open graph types: hotels, movies, activities, songs, games,
books
• Study of shift in locus (away from the open Web) and nature
(towards entities) of content

We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.

Please fill out an evaluation
form when you have a
chance.

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

Ähnlich wie BDT204 Awesome Applications of Open Data - AWS re: Invent 2012 (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012