3. An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
7. Human generated
Machine generated
Tweet
Surf the internet
Buy and sell products
Upload images and videos
Play games
Check in at restaurants
Search for cafes
Find deals
Watch content online
Look for directions
Use social media
8. Human generated
Machine generated
Networks and security devices
Mobile phones
Cell phone towers
Smart grids
Smart meters
Telematics from cars
Sensors on machines
Videos from traffic and security
cameras
10. Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendati
ons
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial
Services
Monte Carlo
Simulations
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demographi
cs
Usage
analysis
In-game
metrics
17. Big Data + Cloud = Awesome Combination
Big data:
• Potentially massive datasets
• Iterative, experimental style
of data manipulation and
analysis
• Frequently not a steady-state
workload; peaks and valleys
• Data is a combination of
structured and unstructured
data in many formats
AWS Cloud:
• Massive, virtually unlimited
capacity
• Iterative, experimental style of
infrastructure deployment/usage
• At its most efficient with highly
variable workloads
• Tools for managing structured
and unstructured data
20. Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
31. More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 months
4 million ratings per day
3 million searches
Device location , time , day,
week etc.
Social data
38. Who is Razorfish
• Full service Digital Agency
• Developed an Ad-Serving Platform compatible with most browsers
• Clickstream analysis of data , current historical trends and segmentation of
users
• Segmentation is used to serve ads and cross sell
• 45TB of Log data
• Problems at scale
– Giant Datasets
– Building Infrastructure requires large continuous investment
– Build for peak holiday season
– Traditional Data stores are not scaling
51. Why AWS + EMR
• Prefect Clarity of Cost
• No upfront infrastructure investment
• No client processing contention
• Without EMR/Hadoop it takes 3 days , with EMR 8 hours
– Scalability 1 node x 100 hours = 100 nodes x 1 hour
• Meet SLA
52. Playfish improves in-game experience for its users
through data mining
Challenge:
Must understand player usage trends across
50M month users, multiple platforms, 10s of
games, and in the face of rapid growth. This
drives both in-game improvements and
defines what games to target next.
Solution:
EMR provides Playfish the flexibility to
experiment and rapidly ask new questions.
All usage data is stored in S3 and analysts
run ad-hoc hive queries that can slice the
data by time, game, and user.
53.
54.
55. Data Driven Game Design
Data is being used to understand what gamers are doing
inside the game (behavioral analysis)
- What features people like (rely on data instead of
forum posts)
- What features are abandoned
- A/B testing
- Monetization – In Game Analytics
59. Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
1
60. Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
2
61. Queue , pre-process and then write to data source
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
store
3
62. Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue
Service (SQS)
Amazon Simple Storage Service
(S3)
Amazon Elastic MapReduce
63. Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data
store
HDFS
4
64. What is Flume
• Collection, Aggregation of streaming Event Data
– Typically used for log data, sensor data , GPS data etc
• Significant advantages over ad-hoc solutions
– Reliable, Scalable, Manageable, Customizable and High Performance
– Declarative, Dynamic Configuration
– Contextual Routing
– Feature rich
– Fully extensible
65. Typical Aggregation Flow
[Client]+ Agent [ Agent]* Destination
Flume uses a multi-tier approach where multiple agents can send data to
another agent which acts as a aggregator. For each agent , data can from
either an agent or a client or can be sent to another agent or a sink
81. Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable,
Fault-tolerant, Distributed System
82. Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
83. Person Start End Duration
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
84. Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
85. Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
86. Person Start End Duration
Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17 18
Charlie 11:17:24 11:17:38 14
Bob 11:23:10 11:23:25 15
Alice 16:26:46 16:26:54 8
David 17:20:28 17:20:45 17
Alice 18:16:53 18:17:00 7
Charlie 19:33:44 19:33:59 15
Bob 21:13:32 21:13:43 11
David 22:36:22 22:36:34 12
Alice 23:42:01 23:42:11 10
87. Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
88. Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
map
89. Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
90. Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
91. Person Total
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
92. Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
Bob 49
Alice 25
93. Person Total
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
94. Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
96. Person Total
Alice 25
Bob 49
Charlie 63
David 29
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
reduce
97. Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
98. Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
99. map
reduce
Works on one record. In this case it
does “end time minus start time”
In parallel over all the records
Group together common records
(e.g “Alice, Bob”) and add all the
results
106. EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language
(HiveQL)
107. R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and applications
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining
109. Launch a Hive cluster from the CLI (step 1/1)
./elastic-mapreduce --create --alive
--name "Test Hive"
--hadoop-version 0.20
--num-instances 5
--instance-type m1.large
--hive-interactive
--hive-versions 0.7.1
110. SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on
S3
Language called HiveQL, similar to SQL
An example of a query could be:
SELECT COUNT(1) FROM sometable;
Requires to setup a mapping to the input data
Uses SerDe:s to make different input formats
queryable
Powerful data types (Array & Map..)
111. SQL HiveQL
Updates UPDATE, INSERT,
DELETE
INSERT, OVERWRITE
TABLE
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Functions Hundreds Dozens
Multi-table inserts Not supported Supported
Create table as select Not valid SQL-92 Supported
126. EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?
127.
128. 1 instance x 100 hours = 100 instances x 1 hour
129. EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
How does EMR work ?
134. Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
135. SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
139. But Data-warehouses are
• Hard to manage
• Very expensive
• Difficult to scale
• Difficult to get performance
140. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
141. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Parallelize and Distribute Everything
Dramatically Reduce I/O
MPP
Load
Query
Resize
Backup
Restore
142. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Parallelize and Distribute Everything
Dramatically Reduce I/O
MPP
Load
Query
Resize
Backup
Restore
Direct-attached storage
Large data block sizes
Column data store
Data compression
Zone maps
143. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
144. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
Create a cluster in minutes
Automatic OS and software patching
Scale up to 1.6PB with a few clicks and no downtime
145. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Start Small and Grow Big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE
1 node (2TB) 2-32 node cluster (64TB)
8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE
2-100 node cluster (1.6PB)
146. Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
147. Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL
Single Node
Effective Hourly Price
Per TB
Effective Annual Price
per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
148. Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
150. Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
151. Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
152. Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
153. Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
154. Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools
156. Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
157. The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
159. Where do you start ?
• Where is your data ? (S3, SQL, NoSQL ?)
– Are you collecting all your data ?
– What is the format (structured or unstructured)
– How much is this data going to grow ?
• How do you want to process it ?
– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?
• How do you want to use this data
– Visualization tools
• Do you yourself or engage an AWS partner
• Write to me sinhaar@amazon.com