Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jia-Ren, Lin
Cloud Support Engineer, Amazon Web Services
Modernize Your Data Warehouse
With Amazon Redshift And Amazon
Redshift Spectrum

Why Modernize?
Performance Scalability Cost

Load
Unload
Query
Backup
Restore
Amazon Redshift Architecture
Massively parallel, shared nothing
columnar architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL
processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore
Amazon Redshift Spectrum nodes
• Execute queries directly against
Amazon Simple Storage Service
(Amazon S3)
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum

Amazon Redshift Best Practices

Data Distribution
• Distribution Styles
• Goals
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4

Table Design
• Materialize often filtered columns from dimension tables
into fact tables
• Materialize often calculated values into tables
• Avoid DIST KEYS on temporal columns
• Keep data types as wide as necessary (but no longer than
necessary)
• VARCHAR, CHAR and NUMERIC
• Add compression to columns
• Optimal compression can be found using ANALYSE COMPRESSION
• Add SORT KEYS on the primary columns that are filtered on

Copy & Unload
• Delimited files are recommend
• Split files so there is a multiple of the number of slices
• Files sizes should be 1MB – 1GB after compression
• Use UNLOAD to extract large amounts of data from the
cluster
• Non-parallel UNLOAD only for very small amounts of data
S3

Extract, Load & Transform (ELT)
Wrap workflow/statements in an explicit transaction
Consider using DROP TABLE or TRUNCATE instead of DELETE
Staging Tables
• Use temporary table or permanent table with the “BACKUP NO” option
• If possible use DISTSTYLE KEY on both the staging table and production table to speed
up the INSERT AS SELECT statement
• Turn off automatic compression - COMPUPDATE OFF
• Copy compression settings from production table or use ANALYZE COMPRESSION
statement
• Use CREATE TABLE LIKE or write encodings into the DDL
• For copying a large number of rows (> hundreds of millions) consider using ALTER
TABLE APPEND instead of INSERT AS SELECT
SQL

Vacuum & Analyze
• VACUUM should be run as necessary
• Typically nightly or weekly
• Consider “Deep Copy” for larger or wide tables
• ANALYZE can be run periodically after ingestion on just predicate
columns
• Utility to VACUUM and ANALYZE all the tables in the cluster:
https://github.com/awslabs/amazon-redshift utils/tree/master/src/AnalyzeVacuumUtility

WLM & QMR
• Keep the number of WLM queues to a minimum, typically just 3 queues
to avoid having unused queues
• https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_apex_hourly.sql
• To maximize query throughput use WLM to throttle number of
concurrent queries to 15 or less
• Use QMR rather than WLM to set query timeouts
• Use QMR to log long running queries
• Save the superuser queue for administration tasks and cancelling
queries

Cluster Sizing
Use at least two computes nodes (multi-node cluster) in production for data mirroring
• Leader Node is given for no additional cost
Amazon Redshift is significantly faster in a VPC compared to EC2 Classic
Maintain at least 20% free space or 3x the size of the largest table
• Scratch space for re-writing tables
• Free space is required for vacuum to resort table
• Temporary tables used for intermediate query results
If you’re using DC1 instances, upgrade to the DC2 instance type
• Same price as DC1, significantly faster
• Reserved Instances do not automatically transfer over

Amazon Redshift Spectrum

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
1

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Query is optimised and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
2

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Query plan is sent to
all compute nodes3

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Compute nodes obtain partition info from
Data Catalog; dynamically prune
partitions
4

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Each compute node issues
multiple requests to the Amazon
Redshift Spectrum layer
5

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Amazon Redshift Spectrum nodes
scan your S3 data
6

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
7
Amazon Redshift
Spectrum projects,
filters, and aggregates

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
8

Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
Apache Hive
Result is sent back to client9

Amazon Redshift Spectrum Is Fast
• Leverages Amazon Redshift’s advanced cost-based optimizer
• Pushes down projections, filters, aggregations and join reduction
• Dynamic partition pruning to minimize data processed
• Automatic parallelization of query execution against S3 data
• Efficient join processing within the Amazon Redshift cluster

Amazon Redshift Spectrum Is Cost-effective
• You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
• Each query can leverage 1000s of Amazon Redshift Spectrum nodes
• You can reduce the TB scanned and improve query performance by:
• Partitioning data
• Using a columnar file format
• Compressing data

Amazon Redshift Spectrum Uses Standard SQL
• Spectrum seamlessly integrates with your existing SQL & BI apps
• Support for complex joins, nested queries & window functions
• Support for data partitioned in S3 by any key
• Date, Time and any other custom keys
• e.g., Year, Month, Day, Hour

Amazon Redshift + Spectrum
Performance at EB Scale
Fast Queries
Elastic and Highly Available
Elastic
On-demand, pay-per-query
Cost Effective
Multiple clusters access
same data
High Concurrency
Query data in-place using
open file formats
No ETL
Full Amazon Redshift SQL
Support
Standardized

Additional Resources

AWS Labs On Github – Amazon Redshift
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin Scripts
• Collection of utilities for running diagnostics on your cluster
Admin Views
• Collection of utilities for managing your cluster, generating schema DDL, etc.
Analyze Vacuum Utility
• Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster
Column Encoding Utility
• Utility that will apply optimal column encoding to an established schema with data already loaded

Thank You

Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)

Ähnlich wie Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300) (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)