2. Agenda
• AWS Big data building blocks
• AWS Big data platform
– Log data collection & storage
– Introducing Amazon Kinesis
– Data Analytics & Computation
– Collaboration & sharing
19. Collection of Data
Sources
Aggregation
Tool
Data Sink
Web Servers
Application servers
Connected Devices
Mobile Phones
Etc
Scalable method to
collect and aggregate
Flume, Kafka, Kinesis,
Queue
Reliable and durable
destination OR
Destinations
30. Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
31. Putting data into Kinesis
Managed Service for Ingesting Fast Moving Data
• Streams are made of Shards
⁻ A Kinesis Stream is composed of multiple Shards
⁻ Each Shard ingests up to 1MB/sec of data, and up to 1000 TPS
⁻ Each Shard emits up to 2 MB/sec of data
⁻ All data is stored for 24 hours
⁻ You scale Kinesis streams by adding or removing Shards
• Simple PUT interface to store data in Kinesis
⁻ Producers use a PUT call to store data in a Stream
⁻ A Partition Key is used to distribute the PUTs across Shards
⁻ A unique Sequence # is returned to the Producer upon a
successful PUT call
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
32.
33. Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Apps
Client library for fault-tolerant, at least-once, real-time processing
• Key streaming application attributes:
– Be distributed, to handle multiple shards
– Be fault tolerant, to handle failures in hardware or software
– Scale up and down as the number of shards increase or decrease
• Kinesis Client Library (KCL) helps with distributed processing:
– Automatically starts a Kinesis Worker for each shard
– Simplifies reading from the stream by abstracting individual shards
– Increases / Decreases Kinesis Workers as # of shards changes
– Checkpoints to keep track of a Worker’s location in the stream
– Restarts Workers if they fail
• Use the KCL with Auto Scaling Groups
– Automatically add EC2 instances when load increases
– KCL will redistributes Workers to use the new EC2 instances
35. Customers using Amazon Kinesis
Mobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game
insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad
performance for advertisers/ publishers
Custom-built solutions operationally complex to
manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based
processing pipeline
• Delay with critical business data delivery
• Developer burden in building reliable, scalable
platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer
• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing
• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time
applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance
to optimize marketing spend, and increase
responsiveness to clients
38. Digital Ad. Tech Metering with Kinesis
Continuous Ad
Metrics Extraction
Incremental Ad.
Statistics
Computation
Metering Record Archive
Ad Analytics Dashboard
40. Collection of Data
Sources
Aggregation
Tool
Data Sink
Web Servers
Application servers
Connected Devices
Mobile Phones
Etc
Scalable method to
collect and aggregate
Flume, Kafka, Kinesis,
Queue
Reliable and durable
destination OR
Destinations
43. Cloud Database and Storage Tier — Use the Right Tool
for the Job!
App/Web Tier
Client Tier
Data Tier
Database & Storage Tier
Search
Hadoop/HDF
S
Cache
Blob Store
SQL NoSQL
44. Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud Database and Storage Tier — Use the Right Tool
for the Job!
45. What Database and Storage Should I Use?
• Data structure
• Query complexity
• Data characteristics: hot, warm, cold
49. What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(3 TB Max)
GB–TB GB–PB
(~nodes)
GB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate Very
High
Very High High High Low – Very
High
Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost
$/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -
Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data
50. Decouple your storage and analysis engine
1. Single Version of Truth
2. Choice of multiple analytics Tools
3. Parallel execution from different teams
4. Lower cost
51. S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3
56. Batch Processing
• Take large amount of cold data and ask
questions
• Takes minutes or hours to get answers back
Example: Generating hourly, daily,
weekly reports
58. Stream Processing (AKA Real Time)
• Take small amount of hot data and ask
questions
• Takes short amount of time to get your
answer back
Example: 1min metrics
60. Amplab Big Data Benchmark
Scan query Aggregate query Join query
https://amplab.cs.berkeley.edu/benchmark/
61. What Batch Processing Technology Should I Use?
Redshift Impala Presto Spark Hive
Query Latency Low Low Low Low - Medium Medium - High
Durability High High High High High
Data Volume 1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes EMR bootstrap EMR
bootstrap
EMR bootstrap Yes (EMR)
Storage Native HDFS HDFS/S3 HDFS/S3 HDFS/S3
# of BI Tools High Medium High Low High
Query Latency
(Low is better)
62. What Stream Processing Technology Should I Use?
Spark Streaming Apache Storm +
Trident
Kinesis Client Library
Scale/Throughput ~ Nodes ~ Nodes ~ Nodes
Data Volume ~ Nodes ~ Nodes ~ Nodes
Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling
Fault Tolerance Built-in Built-in KCL Check pointing
Programming languages Java, Python, Scala Java, Scala, Clojure Java, Python
70. SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
Redshift
Petabyte scale
Columnar Data -
warehouse
71. SQL based processing for unstructured data
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse
72. Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
74. Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
75. Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
76. Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools
77. Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
79. Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline
80. The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline