How to build Business Intelligence System from scratch on AWS (Day1, Day2)
------------------------------------------------------------------------------------------
2020-03-18(수)~19(목) 2일 동안 온라인으로 진행한 Online AWS Analytics Immersion Day 전체 발표 자료 입니다.
BI(Business Intelligence) 시스템을 설계하는 과정에서 AWS Analytics 서비스들을 어떻게 활용할 수 있는지 설명 드리고자 만든 자료 입니다.
Target Audience
-------------------
Online Analytics Immersion Day는 다음과 같은 고객을 대상으로 진행됩니다.
- AWS Analytics Services (ex. Kinesis, Athena, Redshift, EMR, etc)의 기본 개념을 알고 있지만, 이러한 서비스 활용 방법 및 데이터 분석 시스템 구축 과정이 궁금하신 분
- 데이터 분석 시스템을 구축한 경험은 있지만, 자신이 만든 시스템을 아키텍처 관점에서
어떻게 평가하고 확인할 수 있는지 궁금하신 분
8. Data Temperature SpectrumStructure
Hot data Warm data Cold data
Low
High
High Request rate
LowHigh Cost / GB
Low HighLatency
Low HighData Volume
Low
In-Memory SQL
NoSQL Search
Object Storage
Archive
Storage
Graph
9. Simplify Big Data Processing
Collect ConsumeStore Process/
Analyze
Data
1 4
0 9
5
Answers &
Insights
Time to answer(Latency)
Throughput
Cost
13. Relational
Databases
Flat Files
And Many Others!
Retail Data
Ops Data
Marketing Data
Amazon QuickSight
Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone
17. Is it Well-Architected?
QuickSightAmazon RDS
Operation excellence
ü Scale-up: CPU, Memory, Disk
ü Connection Pooling
ü Schema changes: alter table
ü Schema validation
Reliability
ü High availability: Master-Slave fail-over
ü Data loss
Performance efficiency
ü Slow SQL queries caused by poor
database indexing
ü Number of writes >>> Number of reads
Cost optimization
ü Management of RDMS
ü Compute(Instance), Storage cost
Security
ü Encryption data at rest and in transition
18. Key considerations for storing data
• Storage Volume
• Data Model
• Data Scheme
• Access Patterns
ü SQL Supported
ü Put/Get(key, value)
ü Range Query
ü Join Query
• Serverless or Managed Service
• Current skill set
25. Storage Engine Comparison
S3 Redshift Elasticsearch DocumentDB DynamoDB
Storage volume Unlimited
Limited
(~ tens of TB)
Limited
(~ tens of TB)
Limited
(~ tens of TB)
Unlimited
Data model Object Relational Document Document Key-value
Data scheme Schema-free Fixed schema Schema-free (JSON) Schema-free (JSON) (Key, value)
SQL supported Query engine dependent Yes Yes No No
Put/Get (key, value) Yes Yes Yes Yes Yes
Range query Query engine dependent Yes Yes Yes Difficult
Join query Query engine dependent Yes No Difficult Very Difficult
Serverless Yes No No Yes Yes
v https://db-engines.com/en/system/Amazon+DocumentDB%3BAmazon+DynamoDB%3BAmazon+Redshift%3BElasticsearch
26. Use Amazon S3 as Your Persistent Data Store
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• Decouple storage and compute
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Amazon EMR clusters with Amazon EC2 Spot
Instances
• Multiple & heterogeneous analysis clusters and services can use the
same data
• Designed for 99.999999999% durability
• No need to pay for data replication within a region
• Secure: SSL, client/server-side encryption at rest
• Low cost
31. Comparison of SQL Processing engines
Data Structure Semi Semi Semi Full
Languages API/SQL SQL SQL SQL
Data Store
S3 (Glue),
S3/HDFS (Spark)
S3/HDFS S3 Local
Use case Transformation
SQL Queries
for S3/HDFS
Serverless SQL
Queries for S3
Fully Featured
SQL Database
Performance
AWS Glue Amazon Athena Amazon Redshift
33. Data Processing: ETL vs ELT
Source1
Source2 Target
Source1
Source2
Staging
tables
Final
tables
Target
(MPP database)
Extract Transform Load
E → T → L
Extract & Load Transform
E → L → T
Source3Source3
34. How about Real-time Analytics?
Kinesis
Data Firehose
S3 Athena QuickSight
Real-time Analysis Layer
ü Streaming data processing
ü Latency: milliseconds ~ seconds
Messages
CDC
Event
Streams
35. Real-time Analytics
Kinesis
Data Firehose
S3 Athena QuickSight
Real-time Analysis Layer
SQS
Kinesis
Data Streams
MSK
ü Streaming data processing
ü Latency: milliseconds ~ seconds
36. • Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• Preserve client ordering
• Parallel consumption
• Streaming MapReduce
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
Producer 2
Producer 3
Producer n
Key = Red
Key = Green
Key = Blue
Key = Violet
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
…
Why Stream Storage?
37. • Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client ordering (standard)
• FIFO queue preserves client
ordering
• No streaming MapReduce
• No parallel consumption
• Amazon SNS can publish to
multiple SNS subscribers
(queues or Lambda functions)
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO
Producers
Amazon SQS Queue
Publisher
Amazon SNS
Topic
AWS Lambda
function
Amazon SQS
queue
Queue
Subscriber
What about Amazon SQS?
40. RDD Operator RDD RDD …Operator
Operator
Data
Sink
OperatorOperator
Data
Source
Amazon EMR
Applications
Framework
Process Layer
Data Layer
Infrastructure
S3
EMRFS
Amazon
S3
Instances Spot Instances
41. Real-time Analytics with Spark or Flink on EMR, RDS
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
QuickSightEMR Amazon RDS
42. Real-time Analytics with Spark or Flink on EMR, DynamoDB
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
EMR DynamoDB real-time
dashboard
43. Real-time Analytics with Spark or Flink on EMR, ElastiCache
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
EMR real-time
dashboard
ElastiCache
44. Amazon Kinesis Data Analytics
Easily process and analyze streaming data with standard SQL
45. Real-time Analytics with Kinesis Data Analytics, RDS
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Kinesis
Data Analytics
Lambda
function
QuickSightAmazon RDSKinesis
Data Streams
46. Real-time Analytics with Kinesis Data Analytics, DynamoDB
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Kinesis
Data Analytics
Lambda
function
Kinesis
Data Streams
real-time
dashboard
DynamoDB
47. Real-time Analytics with Kinesis Data Analytics, ElastiCache
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Kinesis
Data Analytics
Lambda
function
Kinesis
Data Streams
real-time
dashboard
ElastiCache
49. Real-time Analytics with Elasticsearch Service
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Lambda function Elasticsearch Service Kibana
50. 3 ways to build Real-time Analytics
Kinesis
Data Streams
Lambda
function
Elasticsearch Service Kibana
EMR
real-time
dashboard
ElastiCache
Kinesis
Data Analytics
Lambda
function
QuickSight
Amazon RDS
Kinesis
Data Streams
DynamoDB
1
2
3
53. Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Lambda function Elasticsearch Service Kibana
Batch Layer Serving Layer
Speed Layer
Business Intelligence System
Lambda Architecture
57. Beyond Data Analytics – Network Analysis(with Neptune)
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Lambda function Elasticsearch Service Kibana
NeptuneLambda function API GatewayLambda function
59. Beyond Data Analytics: Recommendation(with Personalize)
Kinesis
Data Firehose
S3 Athena QuickSight
Kinesis
Data Streams
Lambda function Elasticsearch Service Kibana
S3
ElastiCache API Gateway
Event
(time-based)
Lambda
function
Personalize
Lambda function
EMR
60. Fully managed
hosting with
auto scaling
One-click
deployment
Notebook
instances
Built-in, high-
performance
algorithms
Automatic
model tuning
One-click
training
Build Train Deploy
Amazon SageMaker
Build, Train, and Deploy Machine Learning Models Quickly & Easily, at
Scale
61. Beyond Data Analytics: Machine Learning(with SageMaker)
Kinesis Data Firehose S3 Athena QuickSight
Kinesis
Data Streams
Lambda function Elasticsearch Service Kibana
TrainNotebook
Model
Models
bucket
SageMaker
Endpoint
SageMaker
64. Collect ConsumeStore Process/
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Athena Amazon QuickSight
Amazon Redshift
Amazon Elasticsearch
Service
Amazon Machine
Learning
AWS Lambda
65. Lessons Learned: Architectural Principles
• Build decoupled systems
- Data → Store → Process → Store → Analyze → Answers
• Use the right tool for the job
- Data structure, latency, throughput, access patterns
• Leverage managed and serverless services
- Scalable/elastic, available, reliable, secure, no/low admin
• Use log-centric design patterns
- Immutable logs (data lake), materialized views
• Be cost-conscious
- Big data ≠ Big cost
• Working backwards
- Design from consume to collect
66. Learn more
- Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AWS re:Invent 2018
https://www.slideshare.net/AmazonWebServices/big-data-analytics-architectural-patterns-and-best-
practices-ant201r1-aws-reinvent-2018
- Everything You Need to Know About Big Data: From Architectural Principles to Best Practices
https://www.slideshare.net/AmazonWebServices/everything-you-need-to-know-about-big-data-
from-architectural-principles-to-best-practices
- Big Data Analytics Options on AWS
https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf
- AWS Big Data Blog
https://aws.amazon.com/ko/blogs/big-data
- AWS Well-Architected Labs
https://wellarchitectedlabs.com