Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18. Lucky Oyster
dive deep - discover pearls
$100 Worth of Priceless
Leveraging Common Crawl and Spot Instances to Data Mine The Web
Lisa Green, Common Crawl
Matthew Berk, Lucky Oyster
19. Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone on Amazon’s Public Data Sets
20. What Does $100 Buy You?
• 2 nosebleed seats at an NFL game
• 1/10 cost of an entry level Dell PowerEdge
• 80 minutes of time from a mid level engineer
• Omakase for 1 at Shiro’s Sushi in Seattle
or…
21. $100 + 14 hours + 300 lines of Ruby =
3.4 billion Web pages processed, data mined, and indexed for
search and research.
Even a few years ago,
this would have been unthinkable.
22. The Experiment
• Process most recent (2012) Web crawl from Common Crawl
• Determine extent and nature of hardcoded references to
Facebook
• Extract structured metadata (Open Graph and Schema.org)
• Store, analyze and index entity metadata and link structure
23. Components
• AWS Spot Instances
– Peak of ~200 nodes
– ~5,000 hours of compute time
– Average cost of $0.02 per hour
• Custom ruby code for extraction and analysis
• Beanstalkd, Apache httpd, Sinatra
• Some sysadmin elbow grease
24. Architecture
• Master instance (m2.4xlarge)
– Queue for Common Crawl S3 paths
– Data collection and node control service
– Indexers and Solr instances
• Worker nodes (c1.medium)
– Spot instances with worker AMI
– Consume S3 paths; decompress and stream ARC files
– Extract and analyze
• Goals were simplicity, interruption tolerance, and high throughput
25. Findings / Output
• Lucky Oyster Study (see appendix or
http://blog.luckyoyster.com )
• Utility computing = major cost savings
• Reusable framework for low complexity Web scale crawl
processing
• Indexes of 400+ million structured entities for R&D
28. The Lucky Oyster Study
• Based on 3.4 billion URLs from Common Crawl
• 22% of pages reference Facebook directly
• 8% of pages implement Open Graph tags
• Top open graph types: hotels, movies, activities, songs, games,
books
• Study of shift in locus (away from the open Web) and nature
(towards entities) of content
29. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.