Weitere ähnliche Inhalte Ähnlich wie Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300) (20) Mehr von Amazon Web Services (20) Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jia-Ren, Lin
Cloud Support Engineer, Amazon Web Services
Modernize Your Data Warehouse
With Amazon Redshift And Amazon
Redshift Spectrum
2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why Modernize?
Performance Scalability Cost
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Load
Unload
Query
Backup
Restore
Amazon Redshift Architecture
Massively parallel, shared nothing
columnar architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL
processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore
Amazon Redshift Spectrum nodes
• Execute queries directly against
Amazon Simple Storage Service
(Amazon S3)
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Best Practices
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Distribution
• Distribution Styles
• Goals
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Table Design
• Materialize often filtered columns from dimension tables
into fact tables
• Materialize often calculated values into tables
• Avoid DIST KEYS on temporal columns
• Keep data types as wide as necessary (but no longer than
necessary)
• VARCHAR, CHAR and NUMERIC
• Add compression to columns
• Optimal compression can be found using ANALYSE COMPRESSION
• Add SORT KEYS on the primary columns that are filtered on
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Copy & Unload
• Delimited files are recommend
• Split files so there is a multiple of the number of slices
• Files sizes should be 1MB – 1GB after compression
• Use UNLOAD to extract large amounts of data from the
cluster
• Non-parallel UNLOAD only for very small amounts of data
S3
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Extract, Load & Transform (ELT)
Wrap workflow/statements in an explicit transaction
Consider using DROP TABLE or TRUNCATE instead of DELETE
Staging Tables
• Use temporary table or permanent table with the “BACKUP NO” option
• If possible use DISTSTYLE KEY on both the staging table and production table to speed
up the INSERT AS SELECT statement
• Turn off automatic compression - COMPUPDATE OFF
• Copy compression settings from production table or use ANALYZE COMPRESSION
statement
• Use CREATE TABLE LIKE or write encodings into the DDL
• For copying a large number of rows (> hundreds of millions) consider using ALTER
TABLE APPEND instead of INSERT AS SELECT
SQL
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Vacuum & Analyze
• VACUUM should be run as necessary
• Typically nightly or weekly
• Consider “Deep Copy” for larger or wide tables
• ANALYZE can be run periodically after ingestion on just predicate
columns
• Utility to VACUUM and ANALYZE all the tables in the cluster:
https://github.com/awslabs/amazon-redshift utils/tree/master/src/AnalyzeVacuumUtility
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WLM & QMR
• Keep the number of WLM queues to a minimum, typically just 3 queues
to avoid having unused queues
• https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_apex_hourly.sql
• To maximize query throughput use WLM to throttle number of
concurrent queries to 15 or less
• Use QMR rather than WLM to set query timeouts
• Use QMR to log long running queries
• Save the superuser queue for administration tasks and cancelling
queries
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cluster Sizing
Use at least two computes nodes (multi-node cluster) in production for data mirroring
• Leader Node is given for no additional cost
Amazon Redshift is significantly faster in a VPC compared to EC2 Classic
Maintain at least 20% free space or 3x the size of the largest table
• Scratch space for re-writing tables
• Free space is required for vacuum to resort table
• Temporary tables used for intermediate query results
If you’re using DC1 instances, upgrade to the DC2 instance type
• Same price as DC1, significantly faster
• Reserved Instances do not automatically transfer over
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum
14. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Query is optimised and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
2
15. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Query plan is sent to
all compute nodes3
16. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Compute nodes obtain partition info from
Data Catalog; dynamically prune
partitions
4
17. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Each compute node issues
multiple requests to the Amazon
Redshift Spectrum layer
5
18. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Amazon Redshift Spectrum nodes
scan your S3 data
6
19. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
7
Amazon Redshift
Spectrum projects,
filters, and aggregates
20. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
8
21. Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Result is sent back to client9
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Is Fast
• Leverages Amazon Redshift’s advanced cost-based optimizer
• Pushes down projections, filters, aggregations and join reduction
• Dynamic partition pruning to minimize data processed
• Automatic parallelization of query execution against S3 data
• Efficient join processing within the Amazon Redshift cluster
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Is Cost-effective
• You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
• Each query can leverage 1000s of Amazon Redshift Spectrum nodes
• You can reduce the TB scanned and improve query performance by:
• Partitioning data
• Using a columnar file format
• Compressing data
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Uses Standard SQL
• Spectrum seamlessly integrates with your existing SQL & BI apps
• Support for complex joins, nested queries & window functions
• Support for data partitioned in S3 by any key
• Date, Time and any other custom keys
• e.g., Year, Month, Day, Hour
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift + Spectrum
Performance at EB Scale
Fast Queries
Elastic and Highly Available
Elastic
On-demand, pay-per-query
Cost Effective
Multiple clusters access
same data
High Concurrency
Query data in-place using
open file formats
No ETL
Full Amazon Redshift SQL
Support
Standardized
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Additional Resources
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Labs On Github – Amazon Redshift
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin Scripts
• Collection of utilities for running diagnostics on your cluster
Admin Views
• Collection of utilities for managing your cluster, generating schema DDL, etc.
Analyze Vacuum Utility
• Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster
Column Encoding Utility
• Utility that will apply optimal column encoding to an established schema with data already loaded
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank You