Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon

AWS Big Data Platform
LA Big Data Day 2015
6/27/2015

GB TB
PB
IDC Digital Universe Study
projects 44 zettabytes (ZB)
by 2020
Source: IDC
ZB
EB

Hello Cloud.
(Amazon Web Services)

On demand Pay as you go
Uniform Available

Security
Scaling
Analytics
Database
Monitoring
Messaging
Workflow
DNS
Load Balancing
BackupCDN
On demand Pay as you go
Uniform Available
Compute
Storage

Enterprise Applications
Platform Services
Administration & Security
Core Services
Infrastructure
Analytics
App
Services
Deployment &
Management
Mobile
Services

Infrastructure Regions Points of PresenceAvailability Zones
Core Services
Storage
(Object, Block
and Archival)
Compute
(VMs, Auto-scaling
and Load Balancing)
Databases
(Relational, NoSQL, Caching)
Networking
(VPC, DX, DNS)
CDN
Access Control
Usage
Auditing
Monitoring and
Logs
Administration &
Security
Key Storage
Identity
Management
Platform
Services
Deployment & Management
One-click web app
deployment
Dev/ops resource
management
Resource Templates Push Notifications
Mobile Services
Mobile Analytics
Identity
Sync
App Services
Workflow
Transcoding
Email
Search
Queuing &
Notifications
App streaming
Analytics
Hadoop
Data Pipelines
Data
Warehouse
Real-time
Streaming Data
Enterprise
Applications
Virtual Desktops Collaboration and Sharing

AWS global footprint
US-WEST (N. California) EU-WEST (Ireland)
ASIA PAC
(Tokyo)
ASIA PAC
(Singapore)
US-WEST (Oregon)
SOUTH AMERICA
(Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA-PAC
(Sydney)
EU-CENT (Frankfurt)
ASIA PAC
(Beijing)

Big Data Pipeline
Data Answers
Collect Process Analyze
Store

Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis

Object Storage
Streaming Ingest
NoSQL
RDBMS
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Event
Driven
Framework

S3
Kinesis
DynamoDB
RDS (RDBMS)
AWS Lambda
KCL Apps
Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis

Store anything
Object storage
Scalable
99.999999999% durability
Amazon S3

Why is Amazon S3 good for Big Data?
• Unlimited number of objects
• Files from 1B to 5TB
• Encryption
• Versioning
• Lifecycle management
• Deep storage integration
Obj
ect
Cop
y
S3 Object
Copy
Region
DataCenters

Real-time processing
High throughput; elastic
Easy to use
S3, Redshift, DynamoDB Integrations
Amazon
Kinesis

Amazon Kinesis
Managed service for streaming ingest & processing
Sending Consuming
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache Storm
Amazon Elastic
MapReduce
AWS Mobile
SDK

• Streams are made of Shards
• Each Shard ingests data up to 1MB/sec,
and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by splitting or
merging Shards
• Replay data inside of 24Hr. Window
Kinesis Stream & Shards
Worker
Worker
Worker
Worker
Worker
Worker
Worker

Real-time event processing frameworks
Kinesis
Client
Library
AWS Lambda

Example: Amazon Kinesis + Apache Storm
Kinesis
Storm
Spout
Producer
Amazon
Kinesis
Apache Storm
ElastiCache
(Redis) Node.js Client
(D3)
http://blogs.aws.amazon.com/bigdata/post/Tx36LYSCY2R0A9B/Implement-a-
Real-time-Sliding-Window-Application-Using-Amazon-Kinesis-and-Apache

Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
Hadoop MPP Warehouse
Machine Learning

Store
Data Collection
and Storage
Data
Processing
Event
Processing
Data
Analysis
EMR Redshift
Machine Learning

Hadoop/HDFS clusters
Hive, Spark, MapReduce
Easy to use; fully managed
On-demand and spot pricing
Amazon EMR

Easy to add and remove compute capacity on your cluster
Match compute
demands with
cluster sizing.
Resizable clusters

Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost

Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at
same data in Amazon S3
EMR
EMR
Amazon
S3

create external table if not exists omniturelogs
( … )
row format delimited
fields terminated by 't'
lines terminated by 'n'
location 's3:// my-bucket/input/omniturelogs';

create external table if not exists omniturelogs
( … )
row format delimited
fields terminated by 't'
lines terminated by 'n'
location 's3:// my-bucket/input/omniturelogs';
or “hdfs://<namenode>/…” or “/<hdfs directory>”

Petabyte scale
Massively parallel
Relational data warehouse
Fully managed
Amazon
Redshift

Amazon Redshift
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Amazon Redshift works with your
existing analysis tools
JDBC/ODBC
Amazon Redshift

Easy to use, managed ML
Based on tech used internally at
Amazon
Use data stored in the AWS cloud
Deploy models in seconds
Amazon
Machine
Learning

Amazon Machine Learning
• Binary classification
– Will customer buy this product or not buy
– Is this email Spam or not Spam
• Multiclass classification
– Is this movie a comedy, documentary, or thriller
• Regression/ forecasting
– How many units will sell?
– What price will this house sell at?

Build & Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML

1. Model Building: Ingest, Explore and understand your data

1. Model Building: Configuration & Training

Real-time predictions
Synchronous, low-latency, high-throughput prediction generation
Request through service API or server or mobile SDKs
Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}

Batch predictions
Asynchronous, large-volume prediction generation
Request through service console or API
Best for applications that deal with batches of data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)

Query for predictions
with batch APIProcess data
Raw data Aggregated data Predictions
Your application
Sample Prediction Flow
EMR Amazon ML
S3 S3 S3

S3
Kinesis
EMR
Redshift
DynamoDB
Collect Process & Analyze Consume
Amazon Machine Learning

Cloud enables big data
processing

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon

Ähnlich wie Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon (20)

Mehr von Data Con LA

Mehr von Data Con LA (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon