찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)

Big data on AWS
김일호, Solutions Architect
09-Nov-2016

Agenda
• AWS Big data building blocks
• AWS Big data platform
– Log data collection & storage
– Introducing Amazon Kinesis
– Data Analytics & Computation
– Collaboration & sharing

AWS Big data building blocks (brief)

Use the right tools
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Amazon
Elastic
MapReduce

Store anything
Object storage
Scalable
99.999999999%
durability
Amazon
S3

Real-time processing
High throughput; elastic
Easy to use
EMR, S3, Redshift, DynamoDB
Integrations
Amazon
Kinesis

NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond latency
Amazon
DynamoDB

Relational data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift

Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon
Elastic
MapReduce

HDFS
Amazon
RedShift
Amazon
RDS
Amazon S3 Amazon
DynamoDB
Amazon EMR
Amazon
Kinesis
AWS Data Pipeline
Data management Hadoop Ecosystem analytical tools
Data
Sources
AWS Data
Pipeline

Generation
Collection & storage
Analytics & computation
Collaboration & sharing

Amazon
DynamoDB
Amazon
RDS
Amazon
Redshift
AWS
Direct Connect
AWS
Storage Gateway
AWS
Import/ Export
Amazon
Glacier
S3
Amazon
Kinesis
Amazon EMR
Generation

Amazon EC2 Amazon EMR
Amazon
Kinesis
Generation

Amazon
Redshift
Amazon
DynamoDB
Amazon
RDS
S3 Amazon EC2 Amazon EMR
Amazon
CloudFront
AWS
CloudFormation
AWS
Data Pipeline
Generation

The right tools.
At the right scale.
At the right time.

Collection of Data
Sources
Aggregation
Tool
Data Sink
Web Servers
Application servers
Connected Devices
Mobile Phones
Etc
Scalable method to
collect and aggregate
Flume, Kafka, Kinesis,
Queue
Reliable and durable
destination OR
Destinations

Types of Data Ingest
• Transactional
– Database
reads/writes
• File
– Click-stream logs
• Stream
– Click-stream logs
Database
Cloud
Storage
Stream
Storage

Run your own log collector
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2

Use a Queue
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
store

Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue
Service (SQS)
Amazon Simple Storage Service
(S3)
Amazon Elastic MapReduce

Use a Tool like FLUME, KAFKA, HONU etc
Flume running
on EC2
Amazon S3
Any other data
store
HDFS

Stream
Storage
Database
Cloud
Storage

26
Why Stream Storage?
Convert multiple streams into fewer
persistent sequential streams
Sequential streams are easier to
process
Amazon Kinesis or Kafka
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard or Partition 1
Producer 1
Producer 2
Producer 3
Producer N

27
Amazon Kinesis or Kafka
Why Stream Storage?
Decouple producers and consumers
Buffer
Preserve client ordering
Streaming MapReduce
Consumer replay / reprocess
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Producer 2
Producer 3
Producer N

Data
Sources
App.4
[Machine
Learning]
AWS Endpoint
App.1
[Aggregate &
De-Duplicate]
Data
Sources
Data
Sources
Data
Sources
App.2
[Metric
Extraction]
S3
DynamoDB
Redshift
App.3
[Sliding
Window
Analysis]
Data
Sources
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
Introducing Amazon Kinesis
Managed Service for Real-Time Processing of Big Data
EMR

Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates data
across three data centers (availability zones)
Aggregate and
archive to S3
Millions of
sources producing
100s of terabytes
per hour
Front
End
Authentication
Authorization
Ordered stream
of events supports
multiple readers
Real-time
dashboards
and alarms
Machine learning
algorithms or
sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts

Putting data into Kinesis
Managed Service for Ingesting Fast Moving Data
• Streams are made of Shards
⁻ A Kinesis Stream is composed of multiple Shards
⁻ Each Shard ingests up to 1MB/sec of data, and up to 1000 TPS
⁻ Each Shard emits up to 2 MB/sec of data
⁻ All data is stored for 24 hours
⁻ You scale Kinesis streams by adding or removing Shards
• Simple PUT interface to store data in Kinesis
⁻ Producers use a PUT call to store data in a Stream
⁻ A Partition Key is used to distribute the PUTs across Shards
⁻ A unique Sequence # is returned to the Producer upon a
successful PUT call
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis

Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Apps
Client library for fault-tolerant, at least-once, real-time processing
• Key streaming application attributes:
– Be distributed, to handle multiple shards
– Be fault tolerant, to handle failures in hardware or software
– Scale up and down as the number of shards increase or decrease
• Kinesis Client Library (KCL) helps with distributed processing:
– Automatically starts a Kinesis Worker for each shard
– Simplifies reading from the stream by abstracting individual shards
– Increases / Decreases Kinesis Workers as # of shards changes
– Checkpoints to keep track of a Worker’s location in the stream
– Restarts Workers if they fail
• Use the KCL with Auto Scaling Groups
– Automatically add EC2 instances when load increases
– KCL will redistributes Workers to use the new EC2 instances

34
Easy Administration
Managed service for real-time streaming
data collection, processing and analysis.
Simply create a new stream, set the desired
level of capacity, and let the service handle
the rest.
Real-time Performance
Perform continual processing on streaming
big data. Processing latencies fall to a few
seconds, compared with the minutes or
hours associated with batch processing.
High Throughput. Elastic
Seamlessly scale to match your data
throughput rate and volume. You can easily
scale up to gigabytes per second. The service
will scale up or down based on your
operational or business needs.
S3, EMR, Storm, Redshift, & DynamoDB
Integration
Reliably collect, process, and transform all of
your data in real-time & deliver to AWS data
stores of choice, with Connectors for S3,
Redshift, and DynamoDB.
Build Real-time Applications
Client libraries that enable developers to
design and operate real-time streaming data
processing applications.
Low Cost
Cost-efficient for workloads of any scale. You
can get started by provisioning a small
stream, and pay low hourly rates only for
what you use.
Amazon Kinesis: Key Developer Benefits

Customers using Amazon Kinesis
Mobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game
insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad
performance for advertisers/ publishers
Custom-built solutions operationally complex to
manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based
processing pipeline
• Delay with critical business data delivery
• Developer burden in building reliable, scalable
platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer
• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing
• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time
applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance
to optimize marketing spend, and increase
responsiveness to clients

Under NDA
Gaming Analytics with Amazon Kinesis

Digital Ad. Tech Metering with Kinesis
Continuous Ad
Metrics Extraction
Incremental Ad.
Statistics
Computation
Metering Record Archive
Ad Analytics Dashboard

Cloud Database and Storage Tier Anti-pattern
App/Web Tier
Client Tier
Database & Storage Tier = All-in-one?

Cloud Database and Storage Tier — Use the Right Tool
for the Job!
App/Web Tier
Client Tier
Data Tier
Database & Storage Tier
Search
Hadoop/HDF
S
Cache
Blob Store
SQL NoSQL

Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud Database and Storage Tier — Use the Right Tool
for the Job!

What Database and Storage Should I Use?
• Data structure
• Query complexity
• Data characteristics: hot, warm, cold

Data Structure and Query Types vs Storage
Technology
Structured – Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
Structured – Complex Query
SQL
Amazon RDS
Search
Amazon CloudSearch
Unstructured – No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured – Custom Query
Hadoop/HDFS
Amazon Elastic MapReduce
DataStructureComplexity
Query Structure Complexity

What is the Temperature of Your
Data?

Amazon
RDS
Request Rate
High Low
Cost/GB
High Low
Latency
Low High
Data Volume
Low High
Amazon
Glacier
Amazon
CloudSearch
Structure
Low
High
Amazon
DynamoD
B
Amazon
ElastiCach
e

What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(3 TB Max)
GB–TB GB–PB
(~nodes)
GB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate Very
High
Very High High High Low – Very
High
Low–
Very High
(no limit)
Very Low
(no limit)
Storage cost
$/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -
Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data

Decouple your storage and analysis engine
1. Single Version of Truth
2. Choice of multiple analytics Tools
3. Parallel execution from different teams
4. Lower cost

S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3

Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Kinesis
Choose depending upon design

Process
• Answering questions about data
• Questions
– Analytics: Think SQL/data warehouse
– Classification: Think sentiment analysis
– Prediction: Think page-views prediction
– Etc

Processing Frameworks
Generally come in two major types:
• Batch processing
• Stream processing
• Interactive query

Batch Processing
• Take large amount of cold data and ask
questions
• Takes minutes or hours to get answers back
Example: Generating hourly, daily,
weekly reports

Stream Processing (AKA Real Time)
• Take small amount of hot data and ask
questions
• Takes short amount of time to get your
answer back
Example: 1min metrics

Processing Tools
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR
• Hive/Tez, Pig, Spark,
Impala, Spark, Presto, ….
• Stream processing
– Apache Spark streaming
– Apache Storm (+ Trident)
– Amazon Kinesis client and
connector library

Amplab Big Data Benchmark
Scan query Aggregate query Join query
https://amplab.cs.berkeley.edu/benchmark/

What Batch Processing Technology Should I Use?
Redshift Impala Presto Spark Hive
Query Latency Low Low Low Low - Medium Medium - High
Durability High High High High High
Data Volume 1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes EMR bootstrap EMR
bootstrap
EMR bootstrap Yes (EMR)
Storage Native HDFS HDFS/S3 HDFS/S3 HDFS/S3
# of BI Tools High Medium High Low High
Query Latency
(Low is better)

What Stream Processing Technology Should I Use?
Spark Streaming Apache Storm +
Trident
Kinesis Client Library
Scale/Throughput ~ Nodes ~ Nodes ~ Nodes
Data Volume ~ Nodes ~ Nodes ~ Nodes
Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling
Fault Tolerance Built-in Built-in KCL Check pointing
Programming languages Java, Python, Scala Java, Scala, Clojure Java, Python

Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR

Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR

Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Spark and Shark
Cloudera Impala

Hadoop is good for
1. Ad Hoc Query analysis
2. Large Unstructured Data Sets
3. Machine Learning and Advanced Analytics
4. Schema less

SQL based Low Latency Analytics on structured data

SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
Redshift
Petabyte scale
Columnar Data -
warehouse

SQL based processing for unstructured data
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse

Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework

Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift

Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools

Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools

Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools

Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools

Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline

The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline

Expanding analytics architecture

Adding Amazon Kinesis Analytics, Amazon Machine Learning, and
Amazon ElasticSearch
Amazon RedshiftAmazon Elastic
MapReduce
Amazon
Glacier
Amazon
DynamoD
B
Amazon
Machine
Learning
Amazon Kinesis
Data WarehouseSemi-structured NoSQL Predictive
Models
Other AppsStreaming
Amazon
Simple
Storage
Service
Data Lake Archive
Log
Generato
r
Creating summary tables from log table
Amazon
Elasticsearch Serv
AWS
Lambda
Amazon
Kinesis
Analytics

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)

Ähnlich wie 찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트) (20)

Mehr von Amazon Web Services Korea

Mehr von Amazon Web Services Korea (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)