Analytics on AWS - IP Expo 2013

BIG DATA
When innovation is required
to collect, store, analyze, and
manage your data

Customer Needs
• Store Any Amount of Data
– Without Capacity Planning

• Perform Complex Analysis on Any Data
– Scale on Demand

• Store Data Securely

• Decrease Time to Market
– Build Environments Quickly

• Reduce Costs
– Reduce Capital Expenditure

• Enable Global Reach

Elastic Block Store
High performance block storage
Availability

99.99%
device
1GB to 1TB in size

Durability

Mount as drives to instances with

99.999999999%

snapshot/cloning functionalities

Is a Web Store
Not a file system
No Single Points of Failure
Eventually consistent

Paradigm

Object store

Performance

Very Fast

Redundancy

Across Availability Zones

Security

Public Key / Private Key

Pricing

$0.095/GB/month
(DUB)

Typical use
case
Limits

IMAGE read many
Write once,
100 Buckets, Unlimited
Storage, 5TB Objects

Simple Storage Service
Highly scalable object storage for the internet
1 byte to 5TB in size
99.999999999% durability

Objects in S3

2100

2000

1500
1300

Peak Requests: 1.2 Million / Second

1000

762

500

Billions

262

102

14

40

0
Q4 2007

Q4 2008

Q4 2009

Q4 2010

Q4 2011

Q4 2012

Today

Performance & Scalability
Amazon S3 provides near linear scalability

S3 Streaming Performance
100 VMs; 9.6GB/s; $26/hr
350 VMs; 28.7GB/s; $90/hr

Reader Connections

34 secs per terabyte

GB/Second

Spotify uses Amazon S3 for Music Storage

AMAZON S3 GIVES
US CONFIDENCE IN
O U R A B I L I T Y TO
EXPAND STORAGE
Q U I C K LY W H I L E
ALSO PROVIDING
H I G H
D A T A
D U R A B I L I T Y
-Emil Fredriksson
Operations Director for Spotify

• Spotify is an online music
service offering instant access
to over 16 million licensed
songs
• Over 15 million active users
and 4 million paying
subscribers
• Spotify adds over 20,000 tracks
a day to its catalogue

Elastic Block Store
High performance block storage

Durability
device
99.999999999%
1GB to 1TB in size
Mount as drives to instances with
snapshot/cloning functionalities

Designed for Archival
Not a file system
Vaults & Archives
3-5 Hour Retrieval Time

Paradigm

Archive Store

Performance

Configurable - Low

Redundancy

Across Availability Zones

Security

Public Key / Private Key

Pricing

$0.011/GB/month

Typical use
case

IMAGE once, read
Write
infrequently
< 10% / Month

Amazon Glacier
Long term object archive
Extremely low cost per gigabyte

Storage Lifecycle Integration
Simple Storage Service

Glacier

Highly scalable object storage

Long term object archive

1 byte to 5TB in size

Extremely low cost per gigabyte



NOSQL Data Capture

RDS

Dynamo
DB

Redshift

Deployment & Administration
App Services
Compute

Storage

Database

Networking
AWS Global Infrastructure

DynamoDB
Provisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Integration with EMR & Hive

Dynamo Consistency

√

√

√

• Writes
• Writes are acknowledged
(committed) once they exist in at
least two physical data centers
• Writes are persisted to SSD

• Reads

• No reduction in durability or
consistency in order to
achieve throughput

Strongly Consistent Read

Stale Values reads possible

No Stale Values read

Highest Throughput

• Tunable for Application
Requirements

Eventually Consistent Read

Lower Potential Throughput

Shazam scaled Dynamo DB to 500,000 IOPS for a
Superbowl Ad
AWS GAVE USE
THE ABILITY TO
BRING A MASSIVE
AMOUNT OF
C A P A C I T Y
ONLINE IN A
S H O RT P E R I O D
O F
T I M E
-Jason Titus
Shazam CTO

•

Shazam connects more than 200
million people, in more than 200
countries and 33 languages, to the
music, TV shows and brands they love

•

When customers hear a song or see a
TV program or ad they like, they simply
activate the app to “tag” it

•

Shazam realized it could support over
500,000 writes per second with
Dynamo DB

•

Also using Amazon EMR for largescale data analysis that can require
more than 1 million writes per second

Complex Data Analysis
…
Parallel ETL

Application Services

Elastic
MapReduce

App Services
Compute

Storage

Elastic MapReduce
Database

Managed, elastic Hadoop cluster
Integrates with S3 & DynamoDB
Automated installation of Hive & Pig

Networking

Support for Spot Instances
Integrated HBase NOSQL Database


Reducing Costs with Spot Instances
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
Scenario #1
Job Flow

Scenario #2
Job Flow

#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28

#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75

Duration:

14 Hours

Duration:
7 Hours

Time Savings: 50%
Cost Savings: ~20%

Other EMR + Spot Use Cases
Run entire cluster on Spot for biggest cost savings
Reduce the cost of application testing

Compute
Vertical
Scaling
From $0.02/hr

Elastic Compute Cloud (EC2)
Basic unit of compute capacity
Range of CPU, memory & local disk options
13 Instance types available, from micro to cluster
compute

Feature

Details

App Services

Run windows or linux distributions

Scalable


Flexible

Wide range of instance types from micro to cluster
compute

Machine Images
Full control

Compute

Storage

Database

Secure

Configurations can be saved as machine images
(AMIs) from which new instances can be created
Full root or administrator rights
Full firewall control via Security Groups


Monitoring

Publishes metrics to Cloud Watch

Inexpensive

Networking

On-demand, Reserved and Spot instance types

VM Import/Export

Import and export VM images to transfer
configurations in and out of EC2

Cluster Compute
1

EC2 Instance
2nd Generation cluster compute instance

Cluster Compute instances implement HVM process execution
Intel® Xeon® E5-2670 processors
10 Gigabit Ethernet

80 EC2
Compute Units
60GB RAM
3TB Local
Disk

Cluster Compute

Cluster Compute
2

Network placement groups
Cluster instances deployed in a ‘Placement

Group’ enjoy low latency, full bisection 10
Gbps bandwidth

10Gbps

CC2 Instance Cluster

240 TFLOPS
Making it the 72nd fastest
supercomputer in the world
(#42 when announced at SC’11)
(Test performed nov 2011, benchmark published June 2012)

Cluster GPU
1

EC2 instance
GPU compute instances: Intel® Xeon® X5570 processors

2 x NVIDIA Tesla “Fermi” M2050 GPUs
I/O Performance: Very High (10 Gigabit Ethernet)

33.5 EC2
Compute Units
20GB RAM
2x NVIDIA GPU
@ >400 Cores
Each

Cluster GPU

S&P Capital IQ Uses AWS for Big Data Processing

S3

Provides data to 4200+ top
global investment firms

Launched Hadoop faster,
Learned Hadoop faster

Hadoop Cluster

Structured Data Analysis
Relational Database Service
RDS

Dynamo
DB

Managed Oracle, MySQL & SQL Server

Dynamo DB
Redshift

Managed NOSQL Database

App Services
Compute

Storage

Amazon Redshift
Massively Parallel Petabyte Scale Data Warehouse

Database

Networking


RDS

Dynamo
DB

Redshift

App Services
Compute

Storage

Database

Relational Database Service
Database-as-a-Service

Networking

No need to install or manage database instances
Scalable and fault tolerant configurations


Integration with Data Pipeline


RDS

Dynamo
DB

Redshift

App Services
Compute

Storage

Database

Redshift
Managed Massively Parallel Petabyte Scale Data

Networking

Warehouse
Streaming Backup/Restore to S3
Extensive Security

2 TB -> 1.6 PB

Redshift parallelizes and distributes everything

Common BI Tools

Query

JDBC/ ODBC

Load
Backup

Restore
Resize

Leader
Node
10GigE Mesh

Compute
Node

Compute
Node

Compute
Node

Redshift lets you start small and grow
big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE

8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE

Single Node (2TB)

Cluster 2-100 Nodes (32TB – 1.6PB)

Cluster 2-32 Nodes (4TB – 64TB)

Important Redshift Features

No Downtime Resize
Streaming Backup/Restore to S3
Automated Point In Time
Snapshotting
Workload Management
Support for VPC
Support for Encrypted Data Loads
Cluster SSL Only Communications

Application Services
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.

Activity: This is a data aggregation,
manipulation, or copy that runs on a userconfigured schedule.


Output Datanode: This supports all the same
datasources as the input datanode, but they don’t
have to be the same type.

App Services

Compute

Storage

Database

Data Pipeline
Networking

Automatically Provision EC2 & EMR Resources
Manage Dependencies & Scheduling


Automatically Retry and Notify of Success &
Failure

Sample Use Case
Input: RDS Table
Table: User-Demographics
SQL Precondition: “Select last_update from table“ > #{YY-MM-DD}
Input: DynamoDB Table
Table: User-Event-Data-#{year-month}

Activity: EMR Transform
Hive Query: user-metrics.hql
Frequency: Daily

Output: S3 file
Path: s3://trend-data/#{year-month-day}.csv
Success Notification: metrics@example.com
Failure Notification: emr-admin@example.com
Delay Notification: : emr-admin@example.com

End User Reporting

EMR

Redshift

RDS

Analytics on AWS - IP Expo 2013

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Analytics on AWS - IP Expo 2013

Ähnlich wie Analytics on AWS - IP Expo 2013 (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Analytics on AWS - IP Expo 2013