Amazon Redshift is a fully managed data warehouse service that allows for petabyte-scale analytics on data stored in columns. It uses a massively parallel processing architecture and columnar data storage to improve query performance. Defining sort keys and distribution keys appropriately is crucial to influence how data is stored and queries are processed in parallel across nodes. Automatic features like concurrency scaling, resize operations, and backups help ensure the warehouse scales and remains available as data and usage grow over time.
3. Agenda
1. What is AWS Redshift?
2. Columnar vs Row-based storage
3. Data compression
4. Redshift as MPP
5. Distkey and Sortkey
6. Scaling
7. Features and Bugs
8. Q&A
5. What is Amazon Redshift?
Fully managed, petabyte-scale data warehouse service in the cloud
Enterprise-class relational database query and management system.
.
.
6. Amazon Redshift SQL
based on PostgreSQL 8.0.2.
Amazon Redshift and PostgreSQL have a number of very important differences.
.
.
8. Columnar vs Row data storage
.
.Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com Block 1
2 Juan Rodrigez JuanRodrigez@somemail.com Block 2
3 Ian Noel IanNoel@somemail.com Block 3
4 John Smith JohnSmith@somemail.com Block 4
9. Row - oriented
.
.Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com Block 1
2 Juan Rodrigez JuanRodrigez@somemail.com Block 2
3 Ian Noel IanNoel@somemail.com Block 3
4 John Smith JohnSmith@somemail.com Block 4
10. Row - oriented
Data blocks store values sequentially
Inefficient use of disk space
.
.
11. Row - oriented
Designed to return a record in as few operations as possible
Optimal for OLTP databases
Disadvantage: inefficient use of disk space
.
.
.
12. Columnar
.
.
Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com
2 Juan Rodrigez JuanRodrigez@somemail.com
3 Ian Noel IanNoel@somemail.com
4 John Smith JohnSmith@somemail.com
1 2 3 4 Block 1
Ivan Juan Ian John Block 2
Sidorov Rodrigez Noel Smith Block 3
IvanSidorov@somemail.com JuanRodrigez@somemail.com IanNoel@somemail.com JohnSmith@somemail.com Block 4
13. Columnar
Data block stores values of a single column for multiple rows
Much less I/O operations for reading same number of column
field values for the same number of records compared
to row-wise storage
Same type of data in block ⇒ can use a compression scheme
.
.
.
15. Data compression in Redshift
Specifies the type of compression that is applied to a column
of data values as rows are added to a table
Applied during table design stage
.
.
17. Default encodings
Columns that are defined as sort keys are assigned RAW
Compression.
Columns that are defined as BOOLEAN, REAL, or DOUBLE
PRECISION data types are assigned RAW compression.
All other columns are assigned LZO compression.
.
.
.
19. Encodings
data is stored in raw, uncompressed form..
Raw Encoding
a separate dictionary of unique values is created
for each block of column values on disk
effective when a column contains a limited number
(<256) of unique values
.
.
Byte-Dictionary Encoding
20. Encodings
provides a very high compression ratio
with good performance
works especially well for CHAR and VARCHAR
columns that store very long character strings
.
.
LZO Encoding
useful when the data type for a column is larger
than most of the stored values require
.
Mostly Encoding
21. Encodings
compresses data by recording the difference between
values that follow each other in the column
.
Delta Encoding
replaces a value that is repeated consecutively
with a token that consists of the value and a count
of the number of consecutive occurrences
DON'T apply on SORTKEY
.
.
Runlength Encoding
22. Encodings
useful for compressing VARCHAR columns in which
the same words recur often
a separate dictionary of unique words is created
for each block of column values on disk
.
.
Text255 and Text32k Encodings
provides a high compression ratio with very good
performance across diverse data sets
.
Zstandard Encoding
23. Encoding Type Keyword in CREATE TABLE Data types
Raw (no compression) RAW All
Byte dictionary BYTEDICT All except BOOLEAN
Delta DELTA
DELTA32K
SMALLINT, INT, BIGINT, DATE, TIMESTAMP,
DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
LZO LZO All except BOOLEAN, REAL, and DOUBLE
PRECISION
Mostlyn MOSTLY8
MOSTLY16
MOSTLY32
SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
Run-length RUNLENGTH All
Text TEXT255
TEXT32K
VARCHAR Only
VARCHAR Only
Zstandard ZSTD All
24. In computing, massively parallel refers to
the use of a large number of processors to
perform a set of coordinated
computations in parallel.
What is MPP?
32. Sortkey and Distkey
Applied during table design stage - initial DDL
Can be imagined as indices
Improve performance dramatically
.
.
.
33. Both specified at the table design stage
Create table dwh.fact_page_views (
page_type varchar(32) encode zstd,
page_view_ts timestamp SORTKEY,
event_id varchar(36) encode zstd DISTKEY,
session_id varchar(100) encode zstd
...
Sortkey and Distkey
34. Sortkey
Amazon Redshift stores your data on disk
in sorted order according to the sort key.
The Amazon Redshift query optimizer uses
sort order when it determines optimal query plans.
.
.
35. Best Sortkey
If recent data is queried most frequently, specify
the timestamp column as the leading column
for the sort key.
Queries will be more efficient because they can
skip entire blocks that fall outside the time range.
.
.
36. Best Sortkey
If you do frequent range filtering or equality filtering
on one column, specify that column as the sort key.
Redshift can skip reading entire blocks of data for that
column because it keeps track of the minimum
and maximum column values stored on each block.
.
.
37. Best Sortkey
If you frequently join a table, specify the join
column as both the sort key and the distribution key.
This enables the query optimizer to choose
a sort merge join instead of a slower hash join.
.
.
38. Main Rule for Sortkey
Define the column which is/(will be) used
to filter and make it a SORTKEY
For Developers:
39. Main Rule for Sortkey
Define which column is a SORTKEY
and use it in your queries to filter the data
For Data Users:
40. The MOST important Rule for Sortkey
Let your Data USERS know the SORTKEY for the tables
For Developers:
41. Sortkey benefits
Queries will be more efficient because they can skip
entire blocks that fall outside the range as it keeps
track of the minimum and maximum column values
stored on each block
Because the data is already sorted on the join key,
the query optimizer can bypass the sort phase
of the sort merge join.
1.
2.
49. Compound Sortkey
More efficient when query predicates use a prefix,
which is a subset of the sort key columns in order
Is the default
.
.
50. Example
...
,local_session_ts timestamp encode lzo
,vendor_id varchar(80) encode text255
,is_onsite boolean encode runlength
)
SORTKEY (session_type, session_first_ts);
alter table dwh.fact_traffic_united owner to etl;
...
51. Interleaved sort key
Gives equal weight to each column in the sort key,
so query predicates can use any subset of the columns
that make up the sort key, in any order
Can use a maximum of eight columns.
Prevents Concurrency Scaling
.
.
.
53. Vacuum and analyze
Reclaims space and resorts rows in either a specified
table or all tables in the current database.
.
VACUUM
Updates table statistics for use by the query planner..
ANALYZE
54. Auto Vacuum and Auto Analyze
Since 19 Dec 2018 - Auto Vacuum on DELETE
routinely scheduled VACUUM DELETE jobs don't need
to be modified
all vacuum operations now run only on a portion
of a table at a given time
.
.
.
Auto Vacuum
Since Jan 2019 - Auto Analyze
in the background
explicit ANALYZE skips tables with up-to-date table
statistics
.
.
.
Auto Analyze
55. Columnar and Sortkey
When columns are sorted appropriately, the query
processor is able to rapidly filter out a large subset of
data blocks.
56. MPP and DISTKEY
Redshift distributes the rows of a table to the compute
nodes so that the data can be processed in parallel.
57. MPP and DISTKEY
Optimizer decides how the data needs to be located
Some rows or entire table is moved
Substantial data movements slow overall system
performance
Using DISTKEY minimizes data redistribution
.
.
.
.
58. Data distribution goals
To distribute the workload uniformly among the nodes
in the cluster.
To minimize data movement during query execution.
.
.
64. automatically adds additional cluster capacity
when you need it
In WLM queues manage which queries are sent
to the Concurrency Scaling cluster
.
.
Node types: dc2.8xlarge, ds2.8xlarge, dc2.large, or ds2.xlarge
1 < Node amount < 32
.
.
Cluster requirements
Concurrency Scaling
65. Concurrency Scaling
The following types of queries are candidates for Concurrency Scaling:
Read-only SELECT queries.
Queries that don't reference tables that use
an interleaved sort key.
The query doesn't use Redshift Spectrum to reference
external tables.
The query must encounter queueing to be routed
to a Concurrency Scaling cluster.
.
.
.
.
66. Cluster Resize
Quickly add or remove nodes from
a cluster
The cluster is unavailable briefly, usually
only a few minutes.
Redshift tries to hold connections open
and queries are paused temporarily.
.
.
.
Elastic resize
Change the node type, the number
of nodes, or both
Your cluster is put into a read-only state
for the duration of the operation
.
.
Classic resize
As your data warehousing capacity and performance needs change
or grow, you can resize your cluster by using one of the following approaches:
Snapshot, restore, and resize – To keep your
cluster available during a classic resize
67. No
When to use Elastic vs Classic Resize
Classic Resize
Yes
Yes
Yes
Elastic Resize
Scaling for a 'new normal'
(not transient spike)?
More than doubling/halving
the number of nodes?
Changing node type?
No
No
68. Bonus
Omio BI team's Redshift resize experience to share
Resize operation
Using the Snapshot, Restore, and Resize Operations
to Resize a Cluster
Our difficulties and our own way
.
.
.
69. Snapshot, restore, and resize experience
Pause Snowplow pipeline.
Take a snapshot of production cluster (30 mins)
Spin up a new cluster from snapshot (3 hours)
Resize the new cluster (~8 hours)
Rename new cluster id to production.
* Give the IAM roles as they were
Resume Snowplow pipeline
1.
2.
3.
4.
5.
6.
7.
We enable people to find and book tickets for trains, buses and flights in more than 35 countries across Europe. We’re fully operational in 15 of those countries, where it’s possible to book travel to major cities and towns, and even lots of smaller villages.
Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools.
My:Amazon Redshift is scalable DWH in cloud. It is columnar datastorageIt is mpp
In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously).
Массово-параллельная_архитектура
Before switching to next slide: "Here comes the question:"
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
Queries will be more efficient because they can skip entire blocks that fall outside the time range.
If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.
If you frequently join a table, specify the join column as both the sort key and the distribution key.
This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node.