1. Amazon Redshift
for Data Analysts
Amazon Redshift
For Data Analysts
D. Can Abacıgil, CTO, DataRow
Eren Baydemir, CEO, DataRow
w w w . d a t a r o w . c o m
2. Are you an
Amazon Redshift user?
Have you used
TeamSQL before?
Do you know
what DataRow is?
3. Today’s Overview
Amazon Redshift System Overview
Cluster Management
Importing & Exporting Data
Break
Data Modeling and Table Design
Maintenance
6. Amazon Redshift Performance
Massively Parallel Processing
Fast execution of the most complex queries operating on large amounts of data.
Columnar Data Storage
Drastically reduces the overall disk I/O requirements.
Data Compression
Reduces storage requirements, thereby reducing disk I/O, which improves query performance.
Query Optimizer
Implements significant enhancements and extensions for processing complex analytic queries.
Result Caching
Caches the results of certain types of queries in memory on the leader node.
Compiled Code
The leader node distributes fully optimized compiled code across all of the nodes of a cluster.
8. Launch an Amazon Redshift Cluster
1. Decide on what type of node you’ll use
2. Figure out how many nodes to use
3. Additional setup and the networking options
4. Configure the networking options
5. Launch the cluster
9. User Management
● Cluster Management Permissions
○ Authentication
■ AWS account root user
■ IAM user
■ IAM role
○ Access Control
Creating an Amazon Redshift cluster, IP addresses, Security Groups, Snapshots and
more.
● Access to Database Permissions
Ability to have control over a database’s objects like tables and views. You must be a superuser to
create an Amazon Redshift user.
11. Load Data Into Amazon Redshift
● Access Rights and Credentials
To grant access to an Amazon Redshift instance to access and manipulate other resources, you need to
authenticate it. There are two options available: Role Based and Key Based Access.
● Importing Data
The COPY command loads data into a table from data files or from an Amazon DynamoDB table.
● Sources to Load your Data
The COPY command supports a wide number of different sources to load data.
○ Amazon S3
○ Amazon EMR Cluster
○ Remote Hosts
○ DynamoDB
12. Overview of System Tables and Views
An Amazon Redshift cluster has many system tables and views you can query to
understand how your system behaves.
● STL_LOAD_ERRORS
Displays the records of all Amazon Redshift load errors.
● STL_FILE_SCAN
Returns the files that Amazon Redshift read while loading data via the COPY command.
● STL_S3CLIENT_ERROR
Records errors encountered by a slice while loading a file from Amazon S3.
13. Export Data from Amazon Redshift
● What is UNLOAD command?
Unload the result of a query to one or more files on Amazon S3.
● UNLOAD command syntax
Create a sample table and insert a few records into it.
● DataRow UNLOAD Command Wizard
Perform your UNLOAD command in seconds, and easily upload data to a table.
● Reading Data directly from Amazon Redshift
Access your data directly on Amazon Redshift.
16. Table Distribution Styles
● Understanding Redshift Distribution Key
Redshift Distribution Keys (DIST Keys) determine where data is stored in
Redshift.
● Amazon Redshift Distribution Styles
○ All
○ Even
○ Key
● Choosing the right Distribution Styles
Choose columns used in the query that leads to least skewness as the DISTKEY. The good choice is the
column with maximum distinct values, such as the timestamp.
17. Understanding and Selecting Sort Keys
● Introduction to Redshift Sort Key
Redshift Sort Key determines the order in which rows in a table are stored. Amazon Redshift supports
two kinds of Sort Keys:
○ Compound Sort Keys
○ Interleaved Sort Key
● Choosing Sorting Keys
Selecting the right kind needs the knowledge of the queries.
18. Column Compression Settings
● How Column Compression Works
It is possible to define a Column Compression Encoding manually or ask Amazon Redshift to select an
Encoding automatically during the execution of a COPY command.
● Compression Encoding
A compression encoding specifies the type of compression that is applied to a column of data values as
rows are added to a table.
● Analyze Compression
Performs a compression analysis on your data and returns suggestions for the compression encoding to
be used.
19. Choosing a Column Compression Type
The following statement creates a CUSTOMER table that has columns with various data types. This CREATE
TABLE statement shows one of many possible combinations of compression encodings for these columns.
21. Why to Vacuum Amazon Redshift?
● Why Vacuum?
Amazon Redshift reclaims deleted space and sorts the new data when VACUUM
query is issued.
● When to run Vacuum?
It is recommended to perform VACUUM depending on the amount of space that
needs to be reclaimed and also upon unsorted data.
● Vacuum types
You can issue vacuum either on a table or on the complete database, running a
query or using DataRow.
22. Why Redshift Analyze?
● Why Analyze?
The ANALYZE operation updates the statistical metadata that the query planner
uses to choose optimal plans.
● When to run Analyze?
COPY command performs an ANALYZE after it loads data into an empty table.
● How to run Analyze?
Analyze command can be performed by running a query. Alternatively, and more
easily, you can use DataRow to perform an ANALYZE command.
23. Monitoring Query Performance
Amazon Redshift provides performance metrics and data so that you can track the
health and performance of your clusters and databases.
You can get information about the query:
1. Query ID
2. Run time
3. Start time
24. LET’S KEEP IN TOUCH!
https://datarow.com
support@datarow.com
@getdatarow