2. MY BACKGROUND
• Based in Seattle, WA
• Education:
– BS in Computer Science, The American University, 1985
– Graduate student in Digital Media, University of Washington, 2010
• Background:
– Microsoft Visual Studio team
– Consulting to startups and VC’s
– Amazon employee since 2002
• Evangelist:
– Speak
– Write
– Tweet
• Author, “Host Your Web Site in the Cloud”
• Email: jbarr@amazon.com
• Twitter: @jeffbarr
3. AGENDA
• What is Big Data
• Elastic MapReduce Overview
• Example Use Cases
• Ecosystem and Tools
• Upcoming Features
• Discussion
4. W HAT IS BIG DATA?
• Doesn’t refer just to volume
– You can benefit from Big Data infrastructure
without having a ton of data
– Many existing technologies have little
problem physically handling large volumes
• Challenges result from the
combination of data volume, data
structure, and usage demands from
that data, usually tied to timeliness
• Big Data Tools are needed to provide
a holistic view of enterprise data and
systematically harness it for insights
and trends
5. WHAT IS AMAZON ELASTIC MAPREDUCE
• Enables customers to easily, securely and
cost-effectively process vast amounts of
data:
– Spin-up hundreds of instances
– Process hundreds of terabytes of data
• Hosted Hadoop framework running on
Amazon’s web-scale infrastructure
6. • Launch and monitor job flows
• AWS Management Console
• Command line interface
• REST API
7. WHY USE AMAZON ELASTIC MAPREDUCE
• Elastic MapReduce removes “MUCK”
from Big Data processing
– Hard to manage compute clusters
– Hard to tune Hadoop
– Hard to monitor running Job Flows
– Hard to debug Hadoop jobs
– Hadoop issues prevent smooth
operation in the cloud
8. PROBLEMS CUSTOMERS SOLVE WITH
ELASTIC MAPREDUCE
• Targeted advertising / Clickstream analysis
• Data warehousing applications
• Bio-informatics (Genome analysis)
• Financial simulation (Monte Carlo simulation)
• File processing (resize jpegs)
• Web indexing
• Data mining and BI
9. HARDWARE REQUIREMENTS FOR USE CASES
• Data or I/O Intensive (m1/m2 instances)
– Data Warehouse
– Data Mining
• Click stream, logs, events, etc.
• Compute or I/O Intensive (c1, cc1/HPC instances)
– Credit Ratings
– Fraud Models
– Portfolio analysis
– VaR calculation
10. CLICKSTREAM ANALYSIS – R AZORFISH AND BEST BUY
• Best Buy came to Razorfish
– 3.5 billion records, 71 million unique cookies, 1.7 million targeted ads
required per day
User recently
purchased a
home theater Targeted Ad
system and is
searching for (1.7 Million per day)
video games
• Leveraged AWS and Elastic MapReduce
– 100 node cluster on demand
– Processing time dropped from 2+ days to 8 hours
– Increased ROAS (Return on Advertising Spend) by 500%
12. W HAT IS MAPR EDUCE?
• Invented by Google
• New processing model
• Highly scalable
• Easy to understand
• Industry standard
• Something worth knowing
13. ELASTIC MAPR EDUCE MODEL – O VERVIEW
• Take input data
• Break in to sub-problems
• Distribute to worker nodes
• Worker nodes process sub-problems in parallel
• Take output of worker nodes and reduce to answer
25. NOTES / ATTRIBUTES
• Mapper and Reducer in Java JAR files
• Scale as large as needed
– Data
– Processing
– Add nodes (even while running) to speed up
• No need to manage intermediate data
• Suitable for certain types of problems
– Record-oriented input
– No dependencies between records
• No more MUCK – focus on your problem