2017 AWS DB Day | Amazon Redshift 자세히 살펴보기

Amazon RedShift 자세히 살펴보기
BuildingAccessLogAnalysisSystemforRainist
YoungJoon Jeong - AWS Solutions Architect
Time : 16:20 – 17:20
Sunghyun Hwang - CTO of Rainist

We start with the customer… and innovate
Enterprise-class, Accelerated Computing Instances
Managing databases is painful & difficult
SQL DBs do not perform well at scale
Hadoop is difficult to deploy and manage
DWs are complex, costly, and slow
Commercial DBs are punitive & expensive
Streaming data Is difficult to capture & analyze
BI Tools are expensive and hard to manage
 X1,P2,G2,I3 Instances*
 Amazon RDS
 Amazon DynamoDB
 Amazon EMR
 Amazon Redshift
 Amazon Aurora
 Amazon Kinesis
 Amazon QuickSight
Customers told us… We created…
*https://aws.amazon.com/intel/

AnalyzeStore
Glacier
S3
DynamoDB
RDS, Aurora
AWS Big Data Portfolio
Data Pipeline
CloudSearch
EMR EC2
Redshift Machine
Learning
ElasticSearch
Database
Migration
QuickSight
Amazon
Athena
Kinesis Fir
ehose
Import Export
Direct Connect
Collect
Kinesis
Kinesis An
alytics

Global Footprint
16 Regions; 42 Availability Zones; 76 Edge Locations
Redshift

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Amazon Redshift

Redshift is used for mission-critical workloads
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics
Financial and management
reporting

Amazon Redshift Architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads, backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS1/DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)

Benefit #1: Amazon Redshift is fast
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’
MIN: 01-JUNE-2016
MAX: 20-JUNE-2016
MIN: 08-JUNE-2016
MAX: 30-JUNE-2016
MIN: 12-JUNE-2016
MAX: 20-JUNE-2016
MIN: 02-JUNE-2016
MAX: 25-JUNE-2016
Unsorted Table
MIN: 01-JUNE-2016
MAX: 06-JUNE-2016
MIN: 07-JUNE-2016
MAX: 12-JUNE-2016
MIN: 13-JUNE-2016
MAX: 18-JUNE-2016
MIN: 19-JUNE-2016
MAX: 24-JUNE-2016
Sorted By Date
Sort Keys and Zone Maps

Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize

ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Distribution Keys

H/W optimized for I/O intensive workloads, 4GB/sec/node
Enhanced networking, over 1M packets/sec/node
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Example: Our new Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: same as our prior generation !

Benefit #2: Amazon Redshift is inexpensive

DS2 (HDD) Price Per Hour for
DS2.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 1.150 $ 5,037
1 Year Reservation $ 0.670 $ 2,910
DC1 (SSD) Price Per Hour for
DC1.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.300 $ 15,768
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No up front costs
Pay as you go
Benefit #2: Amazon Redshift is inexpensive

Benefit #2: Amazon Redshift lets you start small and grow big
Dense Storage (DS2.XL)
2 TB HDD, 31 GB RAM, 2 slices/4 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Dense Storage (DS2.8XL)
16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigE
Cluster 2-128 Nodes (32 TB – 2 PB)
Note: Nodes not to scale

Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups to S3
Continuous and incremental backups across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed

Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/Region level disasters

Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM, AWS CloudHSM & KMS support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC

•Initial Release in US East (N. Virginia; US West (Oregon), EU (Ireland); Asia
Pacific (Tokyo, Singapore, Sydney) Regions
•MANIFEST option for the COPY & UNLOAD commands
•SQL Functions: Most recent queries
•Resource-level IAM, CRC32
•Data Pipeline
•Event notifications, encryption, key rotation, audit logging, on-premises or
AWS CloudHSM; PCI, SOC 1/2/3
•Cross-Region Snapshot Copy
•Audit features, cursor support, 500 concurrent client connections
•EIP Address for VPC Cluster
•New system views to tune table design and track WLM query queues
•Custom ODBC/JDBC drivers; Query Visualization
•Mobile Analytics auto export
•KMS for GovCloud Region; HIPAA BAA
•Interleaved sort keys
•New Dense Storage Nodes (DS2) with better RAM and CPU.
•New Reserved Storage Nodes: No, Partial & All Upfront Options
•Cross-region backups for KMS encrypted clusters
•Scaler UDFs in Python
•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)
•Modify Cluster Dynamically
•Tag-based permissions and BZIP2
•System Tables for query Tuning
•Dense Compute Nodes
•Gzip & Lzop; JSON , RegEx, Cursors
•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit to 50
; support for the ECDH cipher suites for SSL connections; FedRAMP
•Cross-region ingestion
•Free trials & price reductions in Asia Pacific
•CloudWatch Alarm for Disk Usage
•AES 128-bit encryption; UTF-16; KMS Integration
•EU (Frankfurt); GovCloud Regions
•S3 Servier-side encryption support for UNLOAD
•Tagging Support for Cost-allocation
•WLM Queue-Hopping for timed-out queries
•Append rows & Export to BZIP-2
•Lambda for Clusters in VPC; Data Schema C
onversion Support from ML Console
•US West (N. California) Region.
Benefit #5: We innovate quickly
2013 20152014 2016
100+ new features added since launch
Release every two weeks
Automatic patching

Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML

Benefit #7: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence

DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon Re
dshift
Amazon Kinesis
Machine
Learning
Data Pipeline
CloudSearch
Mobile Analy
tics
Benefit #8: Service oriented architecture

Building Access Log Analysis System
UsingAWS Redshift
Sunghyun Hwang
CTO of Rainist

Using AWS Redshift

2015-06
18,925(명)
2017-02
360,031(명)
뱅크샐러드
Using AWS Redshift
MAU 2년 전 대비 1910% 성장
2015-06
29(장)
2017-02
779(장)
카드 발급 2년 전 대비 4100% 성장

Using AWS Redshift
로그 분석 시스템의 목적
1. 금융사 통계 자료 수집을 위함
2. 더 나은 고객 경험 설계를 위한 피드백 자료로 활용 및 분석
3. 비정상적인 요청 감지

Using AWS Redshift
금융사 성과 보고서
1. 각 금융사별 전체 상품 노출/클릭/신청 수 통계
2. 각 금융사별 인기 금융 상품 노출/클릭/신청 수 통계
3. 사용자가 가장 많이 방문하는 가맹점 입력 횟수/평균 입력 금액
4. 사용자의 지출 금액이 가장 높은 가맹점 입력 횟수/평균 입력 금액

로그 분석 시스템 구축의 어려움
Using AWS Redshift
1. 서비스의 성장과 함께 계속 늘어나는 로그 데이터

2016-01-30
50,000(건)
2017-02-14
3,000,000(건)
Using AWS Redshift

Using AWS Redshift
2. (MSA 환경에서) 분석에 필요한 데이터 파편화 문제
3. 적은 수의 인원(서버 엔지니어 한 명과 데이터 엔지니어 한 명)
2016-01-30
50,000(건)
2017-02-14
3,000,000(건)

Using AWS Redshift
뱅크샐러드의 Redshift 도입
• 데이터 웨어하우스 도입에 적합한 성능
2. (MSA 환경에서) 분석에 필요한 데이터 파편화 문제
• AWS 서비스를 활용한 손쉬운 파이프라인 구축
3. 적은 수의 인원(서버 엔지니어 한 명과 데이터 엔지니어 한 명)
• 팀에게 익숙하고 편한 환경 제공 (SQL, Postgres Interface)

Using AWS Redshift
뱅크샐러드 로그 분석 시스템 구성
Amazon
Route53
ELB
AmazonS3
AWS Data
Pipeline
Amazon
Redshift
Amazon ECS
(Card Domain)
Amazon RDS
(Aurora)
Amazon ECS
(Data Analysis)
… Amazon
SES
Amazon
CloudWatch
AWS Data
Pipeline

뱅크샐러드 로그 분석 시스템 결과물
Using AWS Redshift

Big Data in real world
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them

Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers
generate over a PB/day
It’s never been easier to generate vast amounts of data

Generate
Collect & Store
Analyze
Collaborate & Act
generating over PB/day
Amazon S3 lets you collect and store all this data
Store exabytes of
data in S3

Generate
Collect & Store
Analyze
Collaborate & Act
generating over PB/day
Highly
Constrained
But how do you analyze it?
Store exabytes of
data in S3

1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
Most generated data is unavailable for analysis
The Dark Data Problem

The tyranny of “OR”
• Amazon Redshift
• Super-fast local disk performance
• Sophisticated query optimization
• Join-optimized data formats
• Query using standard SQL
• Optimized for data warehousing
•Amazon EMR
• Directly access data in S3
• Scale out to thousands of nodes
• Open data formats
• Popular big data frameworks
• Anything you can dream up and code

You don’t need to choose.
I shouldn’t have to choose
I want “all of the above”

Amazon Redshift Spectrum
S3
SQL
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
Run SQL queries directly against data in S3 using thousands of nodes

Amazon Redshift Spectrum is fast
Leverages Amazon Redshift’s advanced cost-based optimizer
Pushes down projections, filters, aggregations and join reduction
Dynamic partition pruning to minimize data processed
Automatic parallelization of query execution against S3 data
Efficient join processing within the Amazon Redshift cluster

Amazon Redshift Spectrum is cost-effective
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
Each query can leverage 1000s of Amazon Redshift Spectrum nodes
You can reduce the TB scanned and improve query performance by:
• Partitioning data
• Using a columnar file format
• Compressing data

Amazon Redshift Spectrum is secure
Alerts &
notifications
Virtual private cloud
Audit logging
End-to-end
data encryption
Certifications &
compliance
Encrypt S3 data using SSE and AWS
KMS
Encrypt all Amazon Redshift data usi
ng KMS, AWS CloudHSM or your on-
premises HSMs
Enforce SSL with perfect forward enc
ryption using ECDHE
Amazon Redshift leader node in your
VPC. Compute nodes in private VPC.
Spectrum nodes in private VPC, store
no state.
Communicate event-specific notificati
ons via email, text message, or call w
ith Amazon SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA

Amazon Redshift Spectrum uses standard SQL
Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
Support for complex joins, nested queries & window functions
Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour

Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis

The Emerging Analytics Architecture
AthenaAmazon Athena
Interactive Query
AWS Glue
ETL & Data Catalog
Storage
Serverless
Compute
Data
Processing
Amazon S3
Exabyte-scale Object Storage
Amazon Kinesis Firehose
Real-Time Data Streaming
Amazon EMR
Managed Hadoop Applications
AWS Lambda
Trigger-based Code Execution
AWS Glue Data Catalog
Hive-compatible Metastore
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Redshift
Petabyte-scale Data Warehousing

Over 20 customers helped preview Amazon Redshift Spectrum

Defining External Schema and Creating Tables
Define an external schema in Amazon Redshift using the Amazon Athena data catal
og or your own Apache Hive Metastore
CREATE EXTERNAL SCHEMA <schema_name>
Query external tables using <schema_name>.<table_name>
Register external tables using Athena, your Hive Metastore client, or from Amazo
n Redshift CREATE EXTERNAL TABLE SCHEMA syntax
CREATE EXTERNAL TABLE <table_name>
[PARTITIONED BY <column_name, data_type, …>]
STORED AS file_format
LOCATION s3_location
[TABLE PROPERTIES property_name=property_value, …];

Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default key
Column types
• Numeric: bigint, int, smallint, float, double and
decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a partitioning key
Table type
• Non-partitioned table (s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-DD/..)

Converting to Parquet and ORC using Amazon EMR
You can use Hive CREATE TABLE AS SELECT to convert data
CREATE TABLE data_converted
STORED AS PARQUET
AS
SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
• 1TB of text data reduced to 130 GB in Parquet format with snappy compression
• Total cost of EMR job to do this: $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How many s
hould we order for Seattle? What were prior first few day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
• SELECT
• P.ASIN,
• P.TITLE
• FROM
• products P
• WHERE
• P.TITLE LIKE ‘%POTTER%’ AND
• P.AUTHOR = ‘J. K. Rowling’

An author is releasing the 8th book in her popular series. How many s
hould we order for Seattle? What were prior first few day sales?
Lets compute the sales of the prior books she’s written in this series a
nd return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
• SELECT
• P.ASIN,
• P.TITLE,
• SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
• FROM
• s3.d_customer_order_item_details D,
• products P
• WHERE
• D.ASIN = P.ASIN AND
• P.TITLE LIKE '%Potter%' AND
• P.AUTHOR = 'J. K. Rowling' AND
• GROUP BY P.ASIN, P.TITLE
• ORDER BY SALES_sum DESC
• LIMIT 20;

An author is releasing the 8th book in her popular series. How man
y should we order for Seattle? What were prior first few day sales?
Lets compute the sales of the prior books she’s written in this serie
s and return the top 20 values, just for the first three days of sales
of first editions
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
• SELECT
• P.ASIN,
• P.TITLE,
• P.RELEASE_DATE,
• FROM
• asin_attributes A,
• products P
• WHERE
• P.ASIN = A.ASIN AND
• A.EDITION LIKE '%FIRST%' AND
• D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
• D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
• GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
• LIMIT 20;
•

An author is releasing the 8th book in her popular series. How ma
ny should we order for Seattle? What were prior first few day sal
es?
Lets compute the sales of the prior books she’s written in this ser
ies and return the top 20 values, just for the first three days of sa
les of first editions in the city of Seattle, WA, USA
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
• SELECT
• P.ASIN,
• P.TITLE,
• R.POSTAL_CODE,
• P.RELEASE_DATE,
• FROM
• asin_attributes A,
• products P,
• regions R
• WHERE
• P.ASIN = A.ASIN AND
• D.REGION_ID = R.REGION_ID AND
• A.EDITION LIKE '%FIRST%' AND
• R.COUNTRY_CODE = ‘US’ AND
• R.CITY = ‘Seattle’ AND
• R.STATE = ‘WA’ AND
• D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
• D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
• GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
• LIMIT 20;
•

Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail rec
ords for each day over past 20 years.
190 million files across 15,000 partitions in S3. O
ne partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive clust
er would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.

ReCap : Redshift and Redshift Spectrum is…
Relational data warehouse
Can be streamed in
Can be processed in real time
Can be expend Exabyte scale
You can mix and match
On premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Redshift
can support

One More Thing
Beyond Amazon Redshift

Kinesis Stream, Kinesis Firehose

2017 AWS DB Day | Amazon Redshift 자세히 살펴보기

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 2017 AWS DB Day | Amazon Redshift 자세히 살펴보기

Ähnlich wie 2017 AWS DB Day | Amazon Redshift 자세히 살펴보기 (20)

Mehr von Amazon Web Services Korea

Mehr von Amazon Web Services Korea (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2017 AWS DB Day | Amazon Redshift 자세히 살펴보기