Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Redshift, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
4. Customer Needs
• Store Any Amount of Data
– Without Capacity Planning
• Perform Complex Analysis on Any Data
– Scale on Demand
• Store Data Securely
• Decrease Time to Market
– Build Environments Quickly
• Reduce Costs
– Reduce Capital Expenditure
• Enable Global Reach
6. Elastic Block Store
High performance block storage
Availability
99.99%
device
1GB to 1TB in size
Durability
Mount as drives to instances with
99.999999999%
snapshot/cloning functionalities
Is a Web Store
Not a file system
No Single Points of Failure
Eventually consistent
Paradigm
Object store
Performance
Very Fast
Redundancy
Across Availability Zones
Security
Public Key / Private Key
Pricing
$0.095/GB/month
(DUB)
Typical use
case
Limits
IMAGE read many
Write once,
100 Buckets, Unlimited
Storage, 5TB Objects
Simple Storage Service
Highly scalable object storage for the internet
1 byte to 5TB in size
99.999999999% durability
8. Performance & Scalability
Amazon S3 provides near linear scalability
S3 Streaming Performance
100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr
Reader Connections
34 secs per terabyte
GB/Second
9. Spotify uses Amazon S3 for Music Storage
AMAZON S3 GIVES
US CONFIDENCE IN
O U R A B I L I T Y TO
EXPAND STORAGE
Q U I C K LY W H I L E
ALSO PROVIDING
H I G H
D A T A
D U R A B I L I T Y
-Emil Fredriksson
Operations Director for Spotify
• Spotify is an online music
service offering instant access
to over 16 million licensed
songs
• Over 15 million active users
and 4 million paying
subscribers
• Spotify adds over 20,000 tracks
a day to its catalogue
10. Elastic Block Store
High performance block storage
Durability
device
99.999999999%
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities
Designed for Archival
Not a file system
Vaults & Archives
3-5 Hour Retrieval Time
Paradigm
Archive Store
Performance
Configurable - Low
Redundancy
Across Availability Zones
Security
Public Key / Private Key
Pricing
$0.011/GB/month
Typical use
case
IMAGE once, read
Write
infrequently
< 10% / Month
Amazon Glacier
Long term object archive
Extremely low cost per gigabyte
99.999999999% durability
11. Storage Lifecycle Integration
Simple Storage Service
Glacier
Highly scalable object storage
Long term object archive
1 byte to 5TB in size
Extremely low cost per gigabyte
99.999999999% durability
99.999999999% durability
12. NOSQL Data Capture
RDS
Dynamo
DB
Redshift
Deployment & Administration
App Services
Compute
Storage
Database
Networking
AWS Global Infrastructure
DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive
13. Dynamo Consistency
√
√
√
• Writes
• Writes are acknowledged
(committed) once they exist in at
least two physical data centers
• Writes are persisted to SSD
• Reads
• No reduction in durability or
consistency in order to
achieve throughput
Strongly Consistent Read
Stale Values reads possible
No Stale Values read
Highest Throughput
• Tunable for Application
Requirements
Eventually Consistent Read
Lower Potential Throughput
14. Shazam scaled Dynamo DB to 500,000 IOPS for a
Superbowl Ad
AWS GAVE USE
THE ABILITY TO
BRING A MASSIVE
AMOUNT OF
C A P A C I T Y
ONLINE IN A
S H O RT P E R I O D
O F
T I M E
-Jason Titus
Shazam CTO
•
Shazam connects more than 200
million people, in more than 200
countries and 33 languages, to the
music, TV shows and brands they love
•
When customers hear a song or see a
TV program or ad they like, they simply
activate the app to “tag” it
•
Shazam realized it could support over
500,000 writes per second with
Dynamo DB
•
Also using Amazon EMR for largescale data analysis that can require
more than 1 million writes per second
19. Reducing Costs with Spot Instances
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
Scenario #1
Job Flow
Scenario #2
Job Flow
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Duration:
14 Hours
Duration:
7 Hours
Time Savings: 50%
Cost Savings: ~20%
Other EMR + Spot Use Cases
Run entire cluster on Spot for biggest cost savings
Reduce the cost of application testing
20. Compute
Vertical
Scaling
From $0.02/hr
Elastic Compute Cloud (EC2)
Basic unit of compute capacity
Range of CPU, memory & local disk options
13 Instance types available, from micro to cluster
compute
Feature
Details
App Services
Run windows or linux distributions
Scalable
Deployment & Administration
Flexible
Wide range of instance types from micro to cluster
compute
Machine Images
Full control
Compute
Storage
Database
Secure
Configurations can be saved as machine images
(AMIs) from which new instances can be created
Full root or administrator rights
Full firewall control via Security Groups
AWS Global Infrastructure
Monitoring
Publishes metrics to Cloud Watch
Inexpensive
Networking
On-demand, Reserved and Spot instance types
VM Import/Export
Import and export VM images to transfer
configurations in and out of EC2
21.
22. Cluster Compute
1
EC2 Instance
2nd Generation cluster compute instance
Cluster Compute instances implement HVM process execution
Intel® Xeon® E5-2670 processors
10 Gigabit Ethernet
80 EC2
Compute Units
60GB RAM
3TB Local
Disk
Cluster Compute
23. Cluster Compute
2
Network placement groups
Cluster instances deployed in a ‘Placement
Group’ enjoy low latency, full bisection 10
Gbps bandwidth
10Gbps
24. CC2 Instance Cluster
240 TFLOPS
Making it the 72nd fastest
supercomputer in the world
(#42 when announced at SC’11)
(Test performed nov 2011, benchmark published June 2012)
25. Cluster GPU
1
EC2 instance
GPU compute instances: Intel® Xeon® X5570 processors
2 x NVIDIA Tesla “Fermi” M2050 GPUs
I/O Performance: Very High (10 Gigabit Ethernet)
33.5 EC2
Compute Units
20GB RAM
2x NVIDIA GPU
@ >400 Cores
Each
Cluster GPU
26. S&P Capital IQ Uses AWS for Big Data Processing
S3
Provides data to 4200+ top
global investment firms
Launched Hadoop faster,
Learned Hadoop faster
Hadoop Cluster
28. Structured Data Analysis
Relational Database Service
RDS
Dynamo
DB
Managed Oracle, MySQL & SQL Server
Dynamo DB
Redshift
Managed NOSQL Database
Deployment & Administration
App Services
Compute
Storage
Amazon Redshift
Massively Parallel Petabyte Scale Data Warehouse
Database
Networking
AWS Global Infrastructure
29. Structured Data Analysis
RDS
Dynamo
DB
Redshift
Deployment & Administration
App Services
Compute
Storage
Database
Relational Database Service
Database-as-a-Service
Networking
No need to install or manage database instances
Scalable and fault tolerant configurations
AWS Global Infrastructure
Integration with Data Pipeline
30. Structured Data Analysis
RDS
Dynamo
DB
Redshift
Deployment & Administration
App Services
Compute
Storage
Database
Redshift
Managed Massively Parallel Petabyte Scale Data
Networking
AWS Global Infrastructure
Warehouse
Streaming Backup/Restore to S3
Extensive Security
2 TB -> 1.6 PB
31. Redshift parallelizes and distributes everything
Common BI Tools
Query
JDBC/ ODBC
Load
Backup
Restore
Resize
Leader
Node
10GigE Mesh
Compute
Node
Compute
Node
Compute
Node
32. Redshift lets you start small and grow
big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE
8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE
Single Node (2TB)
Cluster 2-100 Nodes (32TB – 1.6PB)
Cluster 2-32 Nodes (4TB – 64TB)
33. Important Redshift Features
No Downtime Resize
Streaming Backup/Restore to S3
Automated Point In Time
Snapshotting
Workload Management
Support for VPC
Support for Encrypted Data Loads
Cluster SSL Only Communications
34. Application Services
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.
Activity: This is a data aggregation,
manipulation, or copy that runs on a userconfigured schedule.
Deployment & Administration
Output Datanode: This supports all the same
datasources as the input datanode, but they don’t
have to be the same type.
App Services
Compute
Storage
Database
Data Pipeline
Networking
Automatically Provision EC2 & EMR Resources
Manage Dependencies & Scheduling
AWS Global Infrastructure
Automatically Retry and Notify of Success &
Failure