SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jia-Ren, Lin
Cloud Support Engineer, Amazon Web Services
Modernize Your Data Warehouse
With Amazon Redshift And Amazon
Redshift Spectrum
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why Modernize?
Performance Scalability Cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Load
Unload
Query
Backup
Restore
Amazon Redshift Architecture
Massively parallel, shared nothing
columnar architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL
processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore
Amazon Redshift Spectrum nodes
• Execute queries directly against
Amazon Simple Storage Service
(Amazon S3)
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Best Practices
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Distribution
• Distribution Styles
• Goals
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Table Design
• Materialize often filtered columns from dimension tables
into fact tables
• Materialize often calculated values into tables
• Avoid DIST KEYS on temporal columns
• Keep data types as wide as necessary (but no longer than
necessary)
• VARCHAR, CHAR and NUMERIC
• Add compression to columns
• Optimal compression can be found using ANALYSE COMPRESSION
• Add SORT KEYS on the primary columns that are filtered on
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Copy & Unload
• Delimited files are recommend
• Split files so there is a multiple of the number of slices
• Files sizes should be 1MB – 1GB after compression
• Use UNLOAD to extract large amounts of data from the
cluster
• Non-parallel UNLOAD only for very small amounts of data
S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Extract, Load & Transform (ELT)
Wrap workflow/statements in an explicit transaction
Consider using DROP TABLE or TRUNCATE instead of DELETE
Staging Tables
• Use temporary table or permanent table with the “BACKUP NO” option
• If possible use DISTSTYLE KEY on both the staging table and production table to speed
up the INSERT AS SELECT statement
• Turn off automatic compression - COMPUPDATE OFF
• Copy compression settings from production table or use ANALYZE COMPRESSION
statement
• Use CREATE TABLE LIKE or write encodings into the DDL
• For copying a large number of rows (> hundreds of millions) consider using ALTER
TABLE APPEND instead of INSERT AS SELECT
SQL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Vacuum & Analyze
• VACUUM should be run as necessary
• Typically nightly or weekly
• Consider “Deep Copy” for larger or wide tables
• ANALYZE can be run periodically after ingestion on just predicate
columns
• Utility to VACUUM and ANALYZE all the tables in the cluster:
https://github.com/awslabs/amazon-redshift utils/tree/master/src/AnalyzeVacuumUtility
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
WLM & QMR
• Keep the number of WLM queues to a minimum, typically just 3 queues
to avoid having unused queues
• https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_apex_hourly.sql
• To maximize query throughput use WLM to throttle number of
concurrent queries to 15 or less
• Use QMR rather than WLM to set query timeouts
• Use QMR to log long running queries
• Save the superuser queue for administration tasks and cancelling
queries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cluster Sizing
Use at least two computes nodes (multi-node cluster) in production for data mirroring
• Leader Node is given for no additional cost
Amazon Redshift is significantly faster in a VPC compared to EC2 Classic
Maintain at least 20% free space or 3x the size of the largest table
• Scratch space for re-writing tables
• Free space is required for vacuum to resort table
• Temporary tables used for intermediate query results
If you’re using DC1 instances, upgrade to the DC2 instance type
• Same price as DC1, significantly faster
• Reserved Instances do not automatically transfer over
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
1
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Query is optimised and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
2
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Query plan is sent to
all compute nodes3
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Compute nodes obtain partition info from
Data Catalog; dynamically prune
partitions
4
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Each compute node issues
multiple requests to the Amazon
Redshift Spectrum layer
5
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Amazon Redshift Spectrum nodes
scan your S3 data
6
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
7
Amazon Redshift
Spectrum projects,
filters, and aggregates
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
8
Life Of A Query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive
Compatible Metastore
Result is sent back to client9
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Is Fast
• Leverages Amazon Redshift’s advanced cost-based optimizer
• Pushes down projections, filters, aggregations and join reduction
• Dynamic partition pruning to minimize data processed
• Automatic parallelization of query execution against S3 data
• Efficient join processing within the Amazon Redshift cluster
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Is Cost-effective
• You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
• Each query can leverage 1000s of Amazon Redshift Spectrum nodes
• You can reduce the TB scanned and improve query performance by:
• Partitioning data
• Using a columnar file format
• Compressing data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Uses Standard SQL
• Spectrum seamlessly integrates with your existing SQL & BI apps
• Support for complex joins, nested queries & window functions
• Support for data partitioned in S3 by any key
• Date, Time and any other custom keys
• e.g., Year, Month, Day, Hour
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift + Spectrum
Performance at EB Scale
Fast Queries
Elastic and Highly Available
Elastic
On-demand, pay-per-query
Cost Effective
Multiple clusters access
same data
High Concurrency
Query data in-place using
open file formats
No ETL
Full Amazon Redshift SQL
Support
Standardized
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Additional Resources
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Labs On Github – Amazon Redshift
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin Scripts
• Collection of utilities for running diagnostics on your cluster
Admin Views
• Collection of utilities for managing your cluster, generating schema DDL, etc.
Analyze Vacuum Utility
• Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster
Column Encoding Utility
• Utility that will apply optimal column encoding to an established schema with data already loaded
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Tips and Tricks for Building and Deploying Serverless Apps In Minutes - AWS O...
Tips and Tricks for Building and Deploying Serverless Apps In Minutes - AWS O...Tips and Tricks for Building and Deploying Serverless Apps In Minutes - AWS O...
Tips and Tricks for Building and Deploying Serverless Apps In Minutes - AWS O...
 
Work Anywhere with Amazon Workspaces (Level: 200)
Work Anywhere with Amazon Workspaces (Level: 200)Work Anywhere with Amazon Workspaces (Level: 200)
Work Anywhere with Amazon Workspaces (Level: 200)
 
Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances Introduction to Amazon EC2 F1 Instances
Introduction to Amazon EC2 F1 Instances
 
Building Well Architected .NET Apps (WIN304) - AWS re:Invent 2018
Building Well Architected .NET Apps (WIN304) - AWS re:Invent 2018Building Well Architected .NET Apps (WIN304) - AWS re:Invent 2018
Building Well Architected .NET Apps (WIN304) - AWS re:Invent 2018
 
SRV319 Amazon EC2 Foundations
SRV319 Amazon EC2 FoundationsSRV319 Amazon EC2 Foundations
SRV319 Amazon EC2 Foundations
 
MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018
MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018
MySQL High Availability & Disaster Recovery (DAT361) - AWS re:Invent 2018
 
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
 
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
Hands-On with Amazon ElastiCache for Redis - Workshop (DAT309-R1) - AWS re:In...
 
SRV328 Designing and Implementing a Serverless Media-Processing Workflow
SRV328 Designing and Implementing a Serverless Media-Processing WorkflowSRV328 Designing and Implementing a Serverless Media-Processing Workflow
SRV328 Designing and Implementing a Serverless Media-Processing Workflow
 
Digital Transformation | AWS Webinar
Digital Transformation | AWS WebinarDigital Transformation | AWS Webinar
Digital Transformation | AWS Webinar
 
Amazon Aurora 深度探討
Amazon Aurora 深度探討Amazon Aurora 深度探討
Amazon Aurora 深度探討
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 
The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...
The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...
The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
High Performance Computing on AWS: Driving Innovation without Infrastructure ...
High Performance Computing on AWS: Driving Innovation without Infrastructure ...High Performance Computing on AWS: Driving Innovation without Infrastructure ...
High Performance Computing on AWS: Driving Innovation without Infrastructure ...
 
SRV205 Architectures and Strategies for Building Modern Applications on AWS
 SRV205 Architectures and Strategies for Building Modern Applications on AWS SRV205 Architectures and Strategies for Building Modern Applications on AWS
SRV205 Architectures and Strategies for Building Modern Applications on AWS
 
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a WeekDEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
DEM18 How SendBird Built a Serverless Log-Processing Pipeline in a Week
 
AWS DeepLens Workshop_Build Computer Vision Applications
AWS DeepLens Workshop_Build Computer Vision Applications AWS DeepLens Workshop_Build Computer Vision Applications
AWS DeepLens Workshop_Build Computer Vision Applications
 
Building low latency apps with a serverless architecture and in-memory data I...
Building low latency apps with a serverless architecture and in-memory data I...Building low latency apps with a serverless architecture and in-memory data I...
Building low latency apps with a serverless architecture and in-memory data I...
 

Ähnlich wie Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)

Ähnlich wie Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300) (20)

Optimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak PerformanceOptimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak Performance
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
Build on Amazon Aurora with MySQL Compatibility (DAT348-R4) - AWS re:Invent 2018
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Jia-Ren, Lin Cloud Support Engineer, Amazon Web Services Modernize Your Data Warehouse With Amazon Redshift And Amazon Redshift Spectrum
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why Modernize? Performance Scalability Cost
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Load Unload Query Backup Restore Amazon Redshift Architecture Massively parallel, shared nothing columnar architecture Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, unload, backup, restore Amazon Redshift Spectrum nodes • Execute queries directly against Amazon Simple Storage Service (Amazon S3) SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node Amazon S3 ... 1 2 3 4 N Amazon Redshift Spectrum
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Best Practices
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Distribution • Distribution Styles • Goals • Distribute data evenly for parallel processing • Minimize data movement during query processing KEY Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 EVEN ALL Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Table Design • Materialize often filtered columns from dimension tables into fact tables • Materialize often calculated values into tables • Avoid DIST KEYS on temporal columns • Keep data types as wide as necessary (but no longer than necessary) • VARCHAR, CHAR and NUMERIC • Add compression to columns • Optimal compression can be found using ANALYSE COMPRESSION • Add SORT KEYS on the primary columns that are filtered on
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Copy & Unload • Delimited files are recommend • Split files so there is a multiple of the number of slices • Files sizes should be 1MB – 1GB after compression • Use UNLOAD to extract large amounts of data from the cluster • Non-parallel UNLOAD only for very small amounts of data S3
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Extract, Load & Transform (ELT) Wrap workflow/statements in an explicit transaction Consider using DROP TABLE or TRUNCATE instead of DELETE Staging Tables • Use temporary table or permanent table with the “BACKUP NO” option • If possible use DISTSTYLE KEY on both the staging table and production table to speed up the INSERT AS SELECT statement • Turn off automatic compression - COMPUPDATE OFF • Copy compression settings from production table or use ANALYZE COMPRESSION statement • Use CREATE TABLE LIKE or write encodings into the DDL • For copying a large number of rows (> hundreds of millions) consider using ALTER TABLE APPEND instead of INSERT AS SELECT SQL
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Vacuum & Analyze • VACUUM should be run as necessary • Typically nightly or weekly • Consider “Deep Copy” for larger or wide tables • ANALYZE can be run periodically after ingestion on just predicate columns • Utility to VACUUM and ANALYZE all the tables in the cluster: https://github.com/awslabs/amazon-redshift utils/tree/master/src/AnalyzeVacuumUtility
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. WLM & QMR • Keep the number of WLM queues to a minimum, typically just 3 queues to avoid having unused queues • https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_apex_hourly.sql • To maximize query throughput use WLM to throttle number of concurrent queries to 15 or less • Use QMR rather than WLM to set query timeouts • Use QMR to log long running queries • Save the superuser queue for administration tasks and cancelling queries
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cluster Sizing Use at least two computes nodes (multi-node cluster) in production for data mirroring • Leader Node is given for no additional cost Amazon Redshift is significantly faster in a VPC compared to EC2 Classic Maintain at least 20% free space or 3x the size of the largest table • Scratch space for re-writing tables • Free space is required for vacuum to resort table • Temporary tables used for intermediate query results If you’re using DC1 instances, upgrade to the DC2 instance type • Same price as DC1, significantly faster • Reserved Instances do not automatically transfer over
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum
  • 13. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore 1
  • 14. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Query is optimised and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum 2
  • 15. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Query plan is sent to all compute nodes3
  • 16. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Compute nodes obtain partition info from Data Catalog; dynamically prune partitions 4
  • 17. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Each compute node issues multiple requests to the Amazon Redshift Spectrum layer 5
  • 18. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Amazon Redshift Spectrum nodes scan your S3 data 6
  • 19. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore 7 Amazon Redshift Spectrum projects, filters, and aggregates
  • 20. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Final aggregations and joins with local Amazon Redshift tables done in-cluster 8
  • 21. Life Of A Query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Compatible Metastore Result is sent back to client9
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Is Fast • Leverages Amazon Redshift’s advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automatic parallelization of query execution against S3 data • Efficient join processing within the Amazon Redshift cluster
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Is Cost-effective • You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 • Each query can leverage 1000s of Amazon Redshift Spectrum nodes • You can reduce the TB scanned and improve query performance by: • Partitioning data • Using a columnar file format • Compressing data
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Uses Standard SQL • Spectrum seamlessly integrates with your existing SQL & BI apps • Support for complex joins, nested queries & window functions • Support for data partitioned in S3 by any key • Date, Time and any other custom keys • e.g., Year, Month, Day, Hour
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift + Spectrum Performance at EB Scale Fast Queries Elastic and Highly Available Elastic On-demand, pay-per-query Cost Effective Multiple clusters access same data High Concurrency Query data in-place using open file formats No ETL Full Amazon Redshift SQL Support Standardized
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Additional Resources
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Labs On Github – Amazon Redshift https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs Admin Scripts • Collection of utilities for running diagnostics on your cluster Admin Views • Collection of utilities for managing your cluster, generating schema DDL, etc. Analyze Vacuum Utility • Utility that can be scheduled to vacuum and analyze the tables within your Amazon Redshift cluster Column Encoding Utility • Utility that will apply optimal column encoding to an established schema with data already loaded
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank You