Amazon Redshift는 속도가 빠른 페타바이트 규모의 완전관리형 데이터 웨어하우스로, 간편하고 비용 효율적으로 모든 데이터를 기존 비즈니스 인텔리전스 도구를 사용하여 분석할 수 있게 해줍니다. 이 강연에서는 RedShift를 활용해 데이터 웨어하우스를 구축하고 데이터를 분석할 때의 모범사례과 다양한 고려사항에 대해 알아보고, Amazon S3에 있는 엑사바이트 규모의 데이터에 대해 복잡한 쿼리를 실행할 직접 수행할 수 있는 RedShift Spectrum을 실제로 사용할 때 고려사항에 대해 함께 다룰 예정입니다.
연사: 정영준, 아마존 웹서비스 솔루션즈 아키텍트
Exploring the Future Potential of AI-Enabled Smartphone Processors
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
1. Amazon RedShift 자세히 살펴보기
BuildingAccessLogAnalysisSystemforRainist
YoungJoon Jeong - AWS Solutions Architect
Time : 16:20 – 17:20
Sunghyun Hwang - CTO of Rainist
2. We start with the customer… and innovate
Enterprise-class, Accelerated Computing Instances
Managing databases is painful & difficult
SQL DBs do not perform well at scale
Hadoop is difficult to deploy and manage
DWs are complex, costly, and slow
Commercial DBs are punitive & expensive
Streaming data Is difficult to capture & analyze
BI Tools are expensive and hard to manage
X1,P2,G2,I3 Instances*
Amazon RDS
Amazon DynamoDB
Amazon EMR
Amazon Redshift
Amazon Aurora
Amazon Kinesis
Amazon QuickSight
Customers told us… We created…
*https://aws.amazon.com/intel/
3. AnalyzeStore
Glacier
S3
DynamoDB
RDS, Aurora
AWS Big Data Portfolio
Data Pipeline
CloudSearch
EMR EC2
Redshift Machine
Learning
ElasticSearch
Database
Migration
QuickSight
Amazon
Athena
Kinesis Fir
ehose
Import Export
Direct Connect
Collect
Kinesis
Kinesis An
alytics
5. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
Amazon Redshift
6. NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs
The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education
Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech
Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology
Selected Amazon Redshift Customers
7. Redshift is used for mission-critical workloads
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics
Financial and management
reporting
8. Amazon Redshift Architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads, backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS1/DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
10. SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2016’
MIN: 01-JUNE-2016
MAX: 20-JUNE-2016
MIN: 08-JUNE-2016
MAX: 30-JUNE-2016
MIN: 12-JUNE-2016
MAX: 20-JUNE-2016
MIN: 02-JUNE-2016
MAX: 25-JUNE-2016
Unsorted Table
MIN: 01-JUNE-2016
MAX: 06-JUNE-2016
MIN: 07-JUNE-2016
MAX: 12-JUNE-2016
MIN: 13-JUNE-2016
MAX: 18-JUNE-2016
MIN: 19-JUNE-2016
MAX: 24-JUNE-2016
Sorted By Date
Benefit #1: Amazon Redshift is fast
Sort Keys and Zone Maps
11. Benefit #1: Amazon Redshift is fast
Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize
12. ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
Benefit #1: Amazon Redshift is fast
Distribution Keys
14. Benefit #1: Amazon Redshift is fast
H/W optimized for I/O intensive workloads, 4GB/sec/node
Enhanced networking, over 1M packets/sec/node
Choice of storage type, instance size
Regular cadence of auto-patched improvements
Example: Our new Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: same as our prior generation !
16. DS2 (HDD) Price Per Hour for
DS2.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 1.150 $ 5,037
1 Year Reservation $ 0.670 $ 2,910
3 Year Reservation $ 0.280 $ 1,226
DC1 (SSD) Price Per Hour for
DC1.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.300 $ 15,768
1 Year Reservation $ 0.190 $ 9,360
3 Year Reservation $ 0.110 $ 5,782
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No up front costs
Pay as you go
Benefit #2: Amazon Redshift is inexpensive
17. Benefit #2: Amazon Redshift lets you start small and grow big
Dense Storage (DS2.XL)
2 TB HDD, 31 GB RAM, 2 slices/4 cores
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Dense Storage (DS2.8XL)
16 TB HDD, 244 GB RAM, 16 slices/36 cores, 10 GigE
Cluster 2-128 Nodes (32 TB – 2 PB)
Note: Nodes not to scale
18. Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups to S3
Continuous and incremental backups across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
19. Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/Region level disasters
20. Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM, AWS CloudHSM & KMS support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
21. •Initial Release in US East (N. Virginia; US West (Oregon), EU (Ireland); Asia
Pacific (Tokyo, Singapore, Sydney) Regions
•MANIFEST option for the COPY & UNLOAD commands
•SQL Functions: Most recent queries
•Resource-level IAM, CRC32
•Data Pipeline
•Event notifications, encryption, key rotation, audit logging, on-premises or
AWS CloudHSM; PCI, SOC 1/2/3
•Cross-Region Snapshot Copy
•Audit features, cursor support, 500 concurrent client connections
•EIP Address for VPC Cluster
•New system views to tune table design and track WLM query queues
•Custom ODBC/JDBC drivers; Query Visualization
•Mobile Analytics auto export
•KMS for GovCloud Region; HIPAA BAA
•Interleaved sort keys
•New Dense Storage Nodes (DS2) with better RAM and CPU.
•New Reserved Storage Nodes: No, Partial & All Upfront Options
•Cross-region backups for KMS encrypted clusters
•Scaler UDFs in Python
•AVRO Ingestion; Kinesis Firehose; Database Migration Service (DMS)
•Modify Cluster Dynamically
•Tag-based permissions and BZIP2
•System Tables for query Tuning
•Dense Compute Nodes
•Gzip & Lzop; JSON , RegEx, Cursors
•EMR Data Loading & Bootstrap Action with COPY command; WLM concurrency limit to 50
; support for the ECDH cipher suites for SSL connections; FedRAMP
•Cross-region ingestion
•Free trials & price reductions in Asia Pacific
•CloudWatch Alarm for Disk Usage
•AES 128-bit encryption; UTF-16; KMS Integration
•EU (Frankfurt); GovCloud Regions
•S3 Servier-side encryption support for UNLOAD
•Tagging Support for Cost-allocation
•WLM Queue-Hopping for timed-out queries
•Append rows & Export to BZIP-2
•Lambda for Clusters in VPC; Data Schema C
onversion Support from ML Console
•US West (N. California) Region.
Benefit #5: We innovate quickly
2013 20152014 2016
100+ new features added since launch
Release every two weeks
Automatic patching
22. Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML
23. Benefit #7: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
38. Building Access Log Analysis System
Using AWS Redshift
로그 분석 시스템의 목적
1. 금융사 통계 자료 수집을 위함
2. 더 나은 고객 경험 설계를 위한 피드백 자료로 활용 및 분석
3. 비정상적인 요청 감지
39. Building Access Log Analysis System
Using AWS Redshift
금융사 성과 보고서
1. 각 금융사별 전체 상품 노출/클릭/신청 수 통계
2. 각 금융사별 인기 금융 상품 노출/클릭/신청 수 통계
3. 사용자가 가장 많이 방문하는 가맹점 입력 횟수/평균 입력 금액
4. 사용자의 지출 금액이 가장 높은 가맹점 입력 횟수/평균 입력 금액
40. 로그 분석 시스템 구축의 어려움
Building Access Log Analysis System
Using AWS Redshift
1. 서비스의 성장과 함께 계속 늘어나는 로그 데이터
42. 로그 분석 시스템 구축의 어려움
Building Access Log Analysis System
Using AWS Redshift
1. 서비스의 성장과 함께 계속 늘어나는 로그 데이터
2. (MSA 환경에서) 분석에 필요한 데이터 파편화 문제
3. 적은 수의 인원(서버 엔지니어 한 명과 데이터 엔지니어 한 명)
2016-01-30
50,000(건)
2017-02-14
3,000,000(건)
43. Building Access Log Analysis System
Using AWS Redshift
뱅크샐러드의 Redshift 도입
1. 서비스의 성장과 함께 계속 늘어나는 로그 데이터
• 데이터 웨어하우스 도입에 적합한 성능
2. (MSA 환경에서) 분석에 필요한 데이터 파편화 문제
• AWS 서비스를 활용한 손쉬운 파이프라인 구축
3. 적은 수의 인원(서버 엔지니어 한 명과 데이터 엔지니어 한 명)
• 팀에게 익숙하고 편한 환경 제공 (SQL, Postgres Interface)
44. Building Access Log Analysis System
Using AWS Redshift
뱅크샐러드 로그 분석 시스템 구성
Amazon
Route53
ELB
AmazonS3
AWS Data
Pipeline
Amazon
Redshift
Amazon ECS
(Card Domain)
Amazon RDS
(Aurora)
Amazon ECS
(Data Analysis)
… Amazon
SES
Amazon
CloudWatch
AWS Data
Pipeline
45. 뱅크샐러드 로그 분석 시스템 결과물
Building Access Log Analysis System
Using AWS Redshift
47. Big Data in real world
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them
49. Generate
Collect & Store
Analyze
Collaborate & Act
Individual AWS customers
generating over PB/day
Amazon S3 lets you collect and store all this data
Store exabytes of
data in S3
51. 1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
Most generated data is unavailable for analysis
The Dark Data Problem
52. The tyranny of “OR”
• Amazon Redshift
• Super-fast local disk performance
• Sophisticated query optimization
• Join-optimized data formats
• Query using standard SQL
• Optimized for data warehousing
•Amazon EMR
• Directly access data in S3
• Scale out to thousands of nodes
• Open data formats
• Popular big data frameworks
• Anything you can dream up and code
53. You don’t need to choose.
I shouldn’t have to choose
I want “all of the above”
54. Amazon Redshift Spectrum
S3
SQL
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
Run SQL queries directly against data in S3 using thousands of nodes
55. Amazon Redshift Spectrum is fast
Leverages Amazon Redshift’s advanced cost-based optimizer
Pushes down projections, filters, aggregations and join reduction
Dynamic partition pruning to minimize data processed
Automatic parallelization of query execution against S3 data
Efficient join processing within the Amazon Redshift cluster
56. Amazon Redshift Spectrum is cost-effective
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
Each query can leverage 1000s of Amazon Redshift Spectrum nodes
You can reduce the TB scanned and improve query performance by:
• Partitioning data
• Using a columnar file format
• Compressing data
57. Amazon Redshift Spectrum is secure
Alerts &
notifications
Virtual private cloud
Audit logging
End-to-end
data encryption
Certifications &
compliance
Encrypt S3 data using SSE and AWS
KMS
Encrypt all Amazon Redshift data usi
ng KMS, AWS CloudHSM or your on-
premises HSMs
Enforce SSL with perfect forward enc
ryption using ECDHE
Amazon Redshift leader node in your
VPC. Compute nodes in private VPC.
Spectrum nodes in private VPC, store
no state.
Communicate event-specific notificati
ons via email, text message, or call w
ith Amazon SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
58. Amazon Redshift Spectrum uses standard SQL
Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
Support for complex joins, nested queries & window functions
Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour
59. Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
60. The Emerging Analytics Architecture
AthenaAmazon Athena
Interactive Query
AWS Glue
ETL & Data Catalog
Storage
Serverless
Compute
Data
Processing
Amazon S3
Exabyte-scale Object Storage
Amazon Kinesis Firehose
Real-Time Data Streaming
Amazon EMR
Managed Hadoop Applications
AWS Lambda
Trigger-based Code Execution
AWS Glue Data Catalog
Hive-compatible Metastore
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Redshift
Petabyte-scale Data Warehousing
62. Defining External Schema and Creating Tables
Define an external schema in Amazon Redshift using the Amazon Athena data catal
og or your own Apache Hive Metastore
CREATE EXTERNAL SCHEMA <schema_name>
Query external tables using <schema_name>.<table_name>
Register external tables using Athena, your Hive Metastore client, or from Amazo
n Redshift CREATE EXTERNAL TABLE SCHEMA syntax
CREATE EXTERNAL TABLE <table_name>
[PARTITIONED BY <column_name, data_type, …>]
STORED AS file_format
LOCATION s3_location
[TABLE PROPERTIES property_name=property_value, …];
63. Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default key
Column types
• Numeric: bigint, int, smallint, float, double and
decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a partitioning key
Table type
• Non-partitioned table (s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-DD/..)
64. Converting to Parquet and ORC using Amazon EMR
You can use Hive CREATE TABLE AS SELECT to convert data
CREATE TABLE data_converted
STORED AS PARQUET
AS
SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
• 1TB of text data reduced to 130 GB in Parquet format with snappy compression
• Total cost of EMR job to do this: $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
65. Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How many s
hould we order for Seattle? What were prior first few day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
• SELECT
• P.ASIN,
• P.TITLE
• FROM
• products P
• WHERE
• P.TITLE LIKE ‘%POTTER%’ AND
• P.AUTHOR = ‘J. K. Rowling’
66. Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How many s
hould we order for Seattle? What were prior first few day sales?
Lets compute the sales of the prior books she’s written in this series a
nd return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
• SELECT
• P.ASIN,
• P.TITLE,
• SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
• FROM
• s3.d_customer_order_item_details D,
• products P
• WHERE
• D.ASIN = P.ASIN AND
• P.TITLE LIKE '%Potter%' AND
• P.AUTHOR = 'J. K. Rowling' AND
• GROUP BY P.ASIN, P.TITLE
• ORDER BY SALES_sum DESC
• LIMIT 20;
67. Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How man
y should we order for Seattle? What were prior first few day sales?
Lets compute the sales of the prior books she’s written in this serie
s and return the top 20 values, just for the first three days of sales
of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
• SELECT
• P.ASIN,
• P.TITLE,
• P.RELEASE_DATE,
• SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
• FROM
• s3.d_customer_order_item_details D,
• asin_attributes A,
• products P
• WHERE
• D.ASIN = P.ASIN AND
• P.ASIN = A.ASIN AND
• A.EDITION LIKE '%FIRST%' AND
• P.TITLE LIKE '%Potter%' AND
• P.AUTHOR = 'J. K. Rowling' AND
• D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
• D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
• GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
• ORDER BY SALES_sum DESC
• LIMIT 20;
•
68. Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How ma
ny should we order for Seattle? What were prior first few day sal
es?
Lets compute the sales of the prior books she’s written in this ser
ies and return the top 20 values, just for the first three days of sa
les of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
• SELECT
• P.ASIN,
• P.TITLE,
• R.POSTAL_CODE,
• P.RELEASE_DATE,
• SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
• FROM
• s3.d_customer_order_item_details D,
• asin_attributes A,
• products P,
• regions R
• WHERE
• D.ASIN = P.ASIN AND
• P.ASIN = A.ASIN AND
• D.REGION_ID = R.REGION_ID AND
• A.EDITION LIKE '%FIRST%' AND
• P.TITLE LIKE '%Potter%' AND
• P.AUTHOR = 'J. K. Rowling' AND
• R.COUNTRY_CODE = ‘US’ AND
• R.CITY = ‘Seattle’ AND
• R.STATE = ‘WA’ AND
• D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
• D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
• GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
• ORDER BY SALES_sum DESC
• LIMIT 20;
•
69. Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail rec
ords for each day over past 20 years.
190 million files across 15,000 partitions in S3. O
ne partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive clust
er would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
70. ReCap : Redshift and Redshift Spectrum is…
Relational data warehouse
Can be streamed in
Can be processed in real time
Can be expend Exabyte scale
You can mix and match
On premises and cloud
Custom development and managed services
Infrastructure with managed scaling, security
Redshift
can support