Introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures (e.g., streaming, real-time intelligence, and analytics). We will review the AWS big data portfolio of services including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), Redshift, Aurora and Machine Learning, and learn how customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
8. Infrastructure Regions Points of PresenceAvailability Zones
Core Services
Storage
(Object, Block
and Archival)
Compute
(VMs, Auto-scaling
and Load Balancing)
Databases
(Relational, NoSQL, Caching)
Networking
(VPC, DX, DNS)
CDN
Access Control
Usage
Auditing
Monitoring and
Logs
Administration &
Security
Key Storage
Identity
Management
Platform
Services
Deployment & Management
One-click web app
deployment
Dev/ops resource
management
Resource Templates Push Notifications
Mobile Services
Mobile Analytics
Identity
Sync
App Services
Workflow
Transcoding
Email
Search
Queuing &
Notifications
App streaming
Analytics
Hadoop
Data Pipelines
Data
Warehouse
Real-time
Streaming Data
Enterprise
Applications
Virtual Desktops Collaboration and Sharing
9. AWS global footprint
US-WEST (N. California) EU-WEST (Ireland)
ASIA PAC
(Tokyo)
ASIA PAC
(Singapore)
US-WEST (Oregon)
SOUTH AMERICA
(Sao Paulo)
US-EAST (Virginia)
GOV CLOUD
ASIA-PAC
(Sydney)
EU-CENT (Frankfurt)
ASIA PAC
(Beijing)
15. Why is Amazon S3 good for Big Data?
• Unlimited number of objects
• Files from 1B to 5TB
• Encryption
• Versioning
• Lifecycle management
• Deep storage integration
Obj
ect
Cop
y
S3 Object
Copy
Region
DataCenters
17. Amazon Kinesis
Managed service for streaming ingest & processing
Sending Consuming
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache Storm
Amazon Elastic
MapReduce
AWS Mobile
SDK
18. • Streams are made of Shards
• Each Shard ingests data up to 1MB/sec,
and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by splitting or
merging Shards
• Replay data inside of 24Hr. Window
Kinesis Stream & Shards
Worker
Worker
Worker
Worker
Worker
Worker
Worker
24. Easy to add and remove compute capacity on your cluster
Match compute
demands with
cluster sizing.
Resizable clusters
25. Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
26. Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at
same data in Amazon S3
EMR
EMR
Amazon
S3
27. create external table if not exists omniturelogs
( … )
row format delimited
fields terminated by 't'
lines terminated by 'n'
location 's3:// my-bucket/input/omniturelogs';
28. create external table if not exists omniturelogs
( … )
row format delimited
fields terminated by 't'
lines terminated by 'n'
location 's3:// my-bucket/input/omniturelogs';
or “hdfs://<namenode>/…” or “/<hdfs directory>”
35. Easy to use, managed ML
Based on tech used internally at
Amazon
Use data stored in the AWS cloud
Deploy models in seconds
Amazon
Machine
Learning
36. Amazon Machine Learning
• Binary classification
– Will customer buy this product or not buy
– Is this email Spam or not Spam
• Multiclass classification
– Is this movie a comedy, documentary, or thriller
• Regression/ forecasting
– How many units will sell?
– What price will this house sell at?
41. Real-time predictions
Synchronous, low-latency, high-throughput prediction generation
Request through service API or server or mobile SDKs
Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}
42. Batch predictions
Asynchronous, large-volume prediction generation
Request through service console or API
Best for applications that deal with batches of data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)
43. Query for predictions
with batch APIProcess data
Raw data Aggregated data Predictions
Your application
Sample Prediction Flow
EMR Amazon ML
S3 S3 S3