이 강연에서는 AWS Big Data 분석 아키텍처 모범 사례를 살펴보고 표준 SQL을 사용해 Amazon S3에 저장된 데이터를 간편하게 분석할 수 있는 대화식 쿼리 서비스인 Amazon Athena의 특징과 최신 기능들에 대하여 고객 사례와 함께 소개드립니다.
연사: Greg Khairallah, 아마존 웹서비스 Amazon Big Data 및 Athena 총괄 사업 개발 매니저
3. Amazon S3 as your persistent data store
• Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
• Resize and shut down clusters with no dat
a loss
• Match use case to Analytics services
• Easily evolve your analytic infrastructure as
technology evolves
EMR
EMR
Amazon
S3
Amazon EMR
Amazon Athena
Amazon Redshift
4. Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration
5. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
Query data in-place using
open file formats
Full Amazon Redshift
SQL support
S3
SQL
6. Amazon EMR – Hadoop, Spark, Presto in the Cloud
• Managed platform
• Launch a cluster in minutes
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
7. Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
8. AWS Glue – Coming Soon
Data Catalog
Hive metastore compatible metadata repository of data sources.
Crawls data source to infer table, data type, partition format.
Job Execution
Runs jobs in Spark containers – automatic scaling based on SLA.
Glue is serverless - only pay for the resources you consume.
Job Authoring
Generates Python code to move data from source to destination.
Edit with your favorite IDE; share code snippets using Git.
9. Security
Identity and Access Manage
ment (IAM) policies
Bucket policies
Access Control Lists (ACLs)
Private VPC endpoints to Am
azon S3
Pre-signed S3 URLs
Encryption
SSL endpoints
Server Side Encryption
(SSE-S3)
S3 Server Side Encrypti
on with provided keys
(SSE-C, SSE-KMS)
Client-side Encryption
Audit & Compliance
Buckets access logs
Lifecycle Management P
olicies
Versioning & MFA delete
s
Certifications – HIPAA, P
CI, SOC 1/2/3 etc.
Implement the right cloud security controls
10. How we’ve created and tamed (or tamed by)
the monster called big data
11. 이상현 Kurt Lee
Vingle Inc
https://www.vingle.net
iOS / Frontend / Backend
Technical Leader
kurt@vingle.net
https://github.com/breath103
12.
13. 1. How to collect
2. Where to store
3. How to query
From millions of clients, with 3 developers, without loosing data, Cheaply
Really Stably, in TB scale
With plain SQL. hopefully
27. a. Fluentd fails, the “Service” goes
down
b. Why are we spending EC2 for just
simple proxy api call
c. Do you even know how many
records coming in?
d. Are we losing data or not
e. Using ruby for making just network
call? really?
f. I want microservice
34. 1. Too much old data (older than 3 months)
2. We don’t use it “THAT” intensively
3. “Managing” “Cluster”
4. So much “simple aggregation” jobs.
(count - group by - 1 day / 1 hour / 5 minutes)
5. Missing Hadoop Eco-System
Even Redshift can gets slow
44. If i need to rebuild “EVERYTHING”
1. Use Lambda / Kinesis Firehose to Collect Data
2. Put “EVERYTHING” at S3
3. Use Athena
1. Intensive / Frequent Query -> Redshift
2. Hadoop Echosystem -> AWS EMR
3. Complicated Data Pipeline -> AWS DataPipeline
1. Luigi / Oozie / Airflow..
48. No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault toler
ance built in
Serverless characteristics
49. Simple Pricing - $5/TB Scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text fil
es
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
50. Familiar Technologies Under the Covers
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning