민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS

© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
문종민
Solutions Architect
민첩하고 비용효율적인 Data Lake 구축

강연 중 질문하는 방법
Go to Webinar “Questions” 창에 자신이 질문한
내역이 표시됩니다. 기본적으로 모든 질문은
공개로 답변 됩니다만 본인만 답변을 받고 싶으면
(비공개)라고 하고 질문해 주시면 됩니다.
본 컨텐츠는 고객의 편의를 위해 AWS 서비스 설명을 위해 온라인 세미나용으로 별도로 제작, 제공된 것입니다. 만약 AWS
사이트와 컨텐츠 상에서 차이나 불일치가 있을 경우, AWS 사이트(aws.amazon.com)가 우선합니다. 또한 AWS 사이트 상에서
한글 번역문과 영어 원문에 차이나 불일치가 있을 경우(번역의 지체로 인한 경우 등 포함), 영어 원문이 우선합니다.
AWS는 본 컨텐츠에 포함되거나 컨텐츠를 통하여 고객에게 제공된 일체의 정보, 콘텐츠, 자료, 제품(소프트웨어 포함) 또는 서비스를 이용함으로 인하여 발생하는 여하한
종류의 손해에 대하여 어떠한 책임도 지지 아니하며, 이는 직접 손해, 간접 손해, 부수적 손해, 징벌적 손해 및 결과적 손해를 포함하되 이에 한정되지 아니합니다.
고지 사항(Disclaimer)

• Data challenge today
• What is a data lake?
• Cost-effective use of Data Lake
• Delivering results faster
• Modern data architecture
본 세션의 주요주제

Documents and files Records Streams
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Amazon Kinesis
Streams
Spreadsheets Infrastructure logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data Amazon Kinesis
Firehose
Sensor data
ERP
다양한 데이터 원천과 형식… 꾸준한 데이터 양의 증가
요즘의 데이터 유형은...

Web and mobile
data
Logs
Social Media data
Streaming data IOT data
Spreadsheets
Structured data
Unstructured and Semi-structured data
Dark data
어떤 문제점이 있는가?

Data Volume
The Data Gap
1990 2000 2010 2020
Generated Data
Available for Analysis
어떤 문제점이 있는가?

Data duplication
Data Scientists
Analysts
Business Users
Applications
Agile Real time
Flexible Scale
다양한 데이터 소비주체와 요구사항

AWS Data Lake란?
Data lake는 이기종 데이터 세트를 분류, 처리,
분석 및 소비 할 수 있는 사실상 무제한의
중앙집중식 스토리지 플랫폼을 갖춘
아키텍처입니다.
AWS Data lake의 주요 속성
• 컴퓨팅과 스토리지의 분리
• 데이타의 신속한 수집 및 변형
• 안전한 멀티-테넌시
• 저장소 내에서 쿼리 가능
• 데이타 읽기 수행 시 스키마 적용

AWS Data Lake의 이점
모든 유형의 데이터를
모든 규모로
낮은 비용으로 신속하게
수집, 저장
하나의 데이터 원천에서
관련 데이터를 신속하게
검색
AWS의 다른 서비스들을
통해 데이터를 쉽게 활용

FINRA – Data Lake 기반 Big Data Analytics
INTAKE MANAGEMENT ANALYTICS
Validation
Normalization
Linkage
Amazon GlacierAmazon S3
Machine Learning
Amazon EMR
Amazon Redshift
text text
API API
 Structured &
Unstructured Data
 Millions of documents
 25K data checks daily
 Normalization
 33,000 Servers Daily
 Centralized Data
 Normalized Data
 Integrated Data
 Discoverable
 Direct Data Query
 ML/AI Platforms
 Applications/ Visualizations
Exchange Data
 12 Equities Markets
 4 Options Markets
SIP Data
 SIP trades
 SIP NBBO
 OPRA
Broker Dealer data
 4000 plus firms
Third Party Data
 Bloomberg
 Thomson Reuters
 DTCC
 OCC
Machine Learning
Amazon EMR
Amazon Redshift
Amazon GlacierAmazon S3
KMS
IAM
RDS
 하루 320 ~ 350억 건의
transaction 처리 log 분석
 S3에 보관된 데이터를 기
준으로 분석 수행, 복제본
유지하지 않음
 KMS로 데이터 암호화
 On-premises 대비 비용
60% 절감

AWS Data Lake 컴포넌트
가능한 최소의 비용으로 어떤 규모의, 어떤 분석 workload도 수행
Insights
Analytics
Data Lake
Data Movement
QuickSight SageMaker
Glue
(ETL & Data Catalog)
S3/Glacier
(Storage)
Redshift
+Spectrum
EMR Athena
Elasticsearch
service Kinesis Data Analytics
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams
Real-time
Comprehend
DW Big data processing Interactive

타의 추종을 불허하는
내구성, 가용성 및 확장성
최상의 보안, 컴플라이언스
및 감사 기능
모든 규모에 대한
객체 수준 제어
데이터에 대한 비즈니스
통찰력
두배나 많은
partner integrations
다양한 데이터 수집
방법
Data lake로 Amazon S3를 사용해야 하는 이유

Cost-effective use of Data Lake

데이터 계층화를 통한 비용 최적화
Hot
Cold
Amazon
S3 standard
Amazon S3 -
infrequent access
Amazon
Glacier
HDFS
 빈번한 access를 요하는 data set에
대해 local HDFS로 EMR/Hadoop
사용
 덜 빈번한 access를 하는 data는
S3에 보관, 거의 access하지 않는
data는 Glacier에 백업해서 비용
최소화
 S3 Analytics를 사용하여 스토리지
계층 분석, 계층화 전략 최적화
S3 Analytics

S3를 원천으로 데이터 처리
Amazon AthenaAmazon Redshift
Spectrum
Amazon EMR AWS Glue
Amazon S3

Amazon EMR: 컴퓨팅과 스토리지의 분리
• Hadoop / Spark와 같이
고도로 분산된 프로세싱
프레임워크
• EC2 온디맨드, 예약,
스팟 인스턴스 조합가능
• Datasets 압축
• Columnar file formats
• 작은 file들을 결합
• S3DistCp “groupBy” 옵션

Amazon Redshift Spectrum: Exabyte 규모의 S3
데이터 query
• Join 가능한 구조화된
데이터
• 여러 개의 온디맨드
clusters로 동시성 확장
• Columnar 파일 포맷
• 데이터 파티셔닝
• 쿼리 조건 필터링으로
쿼리 성능 향상

Amazon Athena: ETL이 필요 없는 Query
• 서버리스 서비스
• Schema on read
• Datasets 압축
• Columnar 파일 포맷
• File sizes 최적화
• Query 최적화

적절한 data format 사용
쿼리 당 스캔되는 데이터의 양으로 과금
압축 된 컬럼 형식 사용
• Parquet
• ORC
다양한 도구와의 쉬운 통합
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as text files 1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings
87% less with
Parquet
34x faster 99% less data scanned 99.7% cheaper

지금까지 일반 tool들의 동작은...
S3에서 해당 객체를 application으로 다운로드 후
application에서 필요한 데이터 선별

지금까지는...
필요한 object들을 Glacier에서 S3로 복원 후,
이를 사용
Amazon
S3
Amazon
Glacier

Select
Amazon S3 Select and Amazon Glacier Select
SQL 표현식을 기반으로 하나의 오브젝트에서 부분 데이터 조회

Amazon S3 Select
표준 SQL 문을 사용하여 한 object 내에서 필터링 된 데이터 집합 조회
• Amazon S3 내에서 콘텐츠 인식 수행이 가능한 첫 API
• Amazon Athena 및 Spectrum과 달리 Amazon S3 시스템 내에서 작동
• SQL 문은 하나의 object 대상으로 수행 – 여러 object의 그룹 기반이 아님.
• SDK (Java, Python), AWS CLI 및 Presto Connector를 통해 액세스 가능 – 다른 지원
추가 예정
• 적합한 사용 대상
• Amazon Redshift Spectrum, Amazon Athena, Presto on AWS, Custom 쿼리 엔진들
• Log mining을 수행

Amazon S3 Select
Output
Format: CSV, JSON
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV, TSV,
custom), JSON, Parquet
Compression: None, GZIP, Bzip2
Encryption: Server-side(SSE-C,
SSE-S3, SSE-KMS)

Amazon S3 Select: Serverless applications
Amazon
S3
AWS
Lambda
Amazon
SNS
S3
Select
Lambda
Trigger

Amazon S3 Select: Serverless MapReduce
Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket,
Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
‘SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM
s3object as obj’)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
1/5의 비용으로 2배 빠른 수행

Amazon S3 Select with Presto
기존 Hive Metastore와 호환되어
쿼리 변경없이 자동으로 S3 Select 요청으로 변환
Amazon S3
S3 Select

최대 400% 빠르고
최대 80% 비용절감
Amazon S3 Select: Big Data 분석 가속화
Amazon S3
Before:
Amazon S3
S3 Select
After:

Amazon S3 Select 지원 예정
Amazon Athena Amazon EMRAmazon Redshift
Spectrum

Amazon Glacier Select 사용
기존 restore-object API 호출 arguments
Glacier Select 사용을 위한 신규(optional) restore-object API 호출 arguments
SQL Query Output S3 경로 SNS topic
object id Tier

Amazon Glacier Select 작동 방식
App Amazon Glacier Amazon S3Glacier select (Archive Id, SQL, Tier,
output을 기록할 S3 bucket, SNS topic)
200 OK
데이터 조회 및
필터링
S3에 결과 기록
결과 준비 시 Amazon SNS로 공지

Delivering Results Faster

Data Lake 성능 최적화
작은 file, data들을 결합
• EMR: S3distcp
• Amazon Kinesis Firehose
S3 Select
• 더 빠르고 저렴한 Big data
• 최대 400% 성능 개선
Data Formats
• Columnar formats
• EMRFS 일관성 보기
Amazon
S3
Amazon
DynamoDB

Amazon Kinesis – 실시간
실시간으로 비디오 및 데이터 스트림을 쉽게 수집, 처리 및 분석
분석을 위한 비디오
스트림 캡처, 처리 및
저장
데이터 스트림을
AWS 데이터
저장소에 적재
SQL을 사용하여 데이터
스트림 분석
데이터 스트림을
분석하는 응용
프로그램 작성
Kinesis Video StreamsKinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL

작업의 80%는 데이터 준비에 사용
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

AWS Glue - Serverless Data catalog & ETL service
Data Catalog
ETL Job
생성
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
자동으로 데이터 탐색, 스키마 저장
사용자 정의 코드 생성
ETL 작업 예약, 실행
Serverless

Amazon SageMaker
아이디어에서 운영까지 ML 모델을 가장 빠르고 쉽게 구축하는 방법
End-to-End
머신러닝
플랫폼
Zero setup
유연한 Model
Training
초당 과금
$

Modern Data Architecture

Modern data architecture
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Amazon RedShift
Data Warehouse
Amazon RDS
Legacy Apps
Data analysts
Data scientists
Business users
Engagement platforms
Amazon
ElasticSearch
Amazon Athena
Amazon
Kinesis
Connected
devices
Social media
Advanced
Analytics
MLlib
Event Capture
Amazon
Kinesis
Stream Analysis
Amazon EMR Event Scoring
Amazon AI
Event Handler
AWS Lambda Response Handler
AWS Lambda
Near-Zero Latency
Amazon DynamoDB
Automation / events

Transactional Data
Stream Data
Collect Store Analyze Visualize
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
File Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
필요 시 선택

AWS Data Lake Solution
• Data lake 참조 구현
• 사용자 interface 제공
• Command line interface 제공
• KMS로 암호화되는 관리형 저장소
• IAM등을 통한 access 제어
• Active Directory 연동기능 제공
• Glue, Athena와 통합
CloudFormation template으로
구성 가능

참고자료
S3 Select, Glacier Select - https://aws.amazon.com/ko/blogs/korea/s3-glacier-select/
Data Lake on AWS - https://aws.amazon.com/ko/answers/big-data/data-lake-solution/
Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility -
https://docs.aws.amazon.com/aws-technical-content/latest/building-data-lakes/building-
data-lake-aws.html
Build a Data Lake Foundation with AWS Glue and Amazon S3 -
https://aws.amazon.com/ko/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-
and-amazon-s3/

더 나은 세미나를 위해 여러분의 의견을 남겨주세요!
웨비나 종료 후 설문이 시작됩니다.
 질문에 대한 답변 드립니다.
 발표자료 / 녹화 영상을 제공합니다.
http://bit.ly/awskr-webinar
AWS 데이터 기반 의사결정 웹세미나에
참석해주셔서 대단히 감사합니다.

민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS

Ähnlich wie 민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS (20)

Mehr von Amazon Web Services Korea

Mehr von Amazon Web Services Korea (20)

민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS