Immersion Day - Como simplificar o acesso ao seu ambiente analítico

discovery, interpretation communication meaningful
patterns in data
What is my
revenue growth
month by month?
How is my marketing
campaign working?
Which age group
had the most
insurance claims?
What is the
crime rate by
cities?

Organic revenue growth
every 5
years
15
years
live for
1,000x
scale
>10x
grows
11 8 5 4
How do I provide democratized
access to data to enable
informed decisions while at the
same time enforce data
governance and prevent
mismanagement of the data?
more valuable
Hadoop Elasticsearch Presto Spark
Democratization
of data
Governance
& control

Democratization
of data
Governance
& control
Hadoop Elasticsearch Presto Spark

// //
Broken view of your business and your customers

Any scale, concurrency, with low cost, high throughput &
performance
Data from new sources, streaming, batch, real-time
Increasingly diverse types of data
Democratization of data – usage by many people of various skills,
make it easy run & operate
Choice of tools, techniques, and applications
I WANT SUPPORT FOR . . .

Data Lakes Provide Customers with what they want…
Single source of truth in a single store (data lake)
Flexibility to grow to any scale, with low costs
Choice to analyze data in a variety of ways
Avoid lock-in, store data in open formats
Democratize analytics with security & governance

AWS Marketplace
Redshift
Data warehousing
EMR
Hadoop +
Spark
Athena
Interactive analytics
Kinesis
Analytics Real-
time
Elasticsearch service
Operational Analytics
RDS
MySQL, PostgreSQL,
MariaDB, Oracle, SQL Server
Aurora
MySQL, PostgreSQL
QuickSight SageMaker
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
S3/Glacier
Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect
Data Movement
AnalyticsDatabases
Business Intelligence & Machine Learning
Data Lake
Managed
Blockchain
Blockchain
Templates
Blockchain
Comprehend Rekognition Lex Transcribe DeepLens 250+ solutions
730+ Database
solutions
600+ Analytics
solutions
25+ Blockchain
solutions
20+ Data lake
solutions
30+ solutions
RDS on VMWare
#awsanalytics / #awsbuilders

process more than 2 Exabytes of data
Most popular Fastest
More than 15K customers

Based on the cloud DW benchmark derived from TPS-DS 30 TB dataset, 4-node cluster
Redshift Vendor 1 Vendor 2
Queries Per Hour
(Higher is better)

Fastest Most cost-effective
up to 75% $758,845
average annual benefits
per TB per year
$319,300
higher revenue
per 100TB per year
469%
Less than the #2 cloud DW with
on-demand pricing and 75% less
with Reserved Instances (RIs)
*based on IDC’s “ROI of Amazon Redshift paper”, 2017

Fastest Most cost-effective Integrates with your data lake

Forrester Wave™ Big Data Warehouse Q4 2018
AWS rated top in the
leader bracket and
received a score of
5/5 (the highest
score possible) in a
number of areas
such as Use Cases,
Roadmap, Market
Awareness, and
Ability to Execute
AWS positioned as
a Leader in the
Gartner Magic
Quadrant for Data
Management
Gartner Magic Quadrant, 2018
Solutions for
Analytics
a data warehouse that extends to the data lake
data warehousing that
integrates seamlessly with the data lake
• Fully-managed
• Massively parallel OLAP architecture
scales to query GBs to EBs of data
• Automatic scaling
• Secure
• Highly-rated most popular

Five Key Highlights
• Amazon Redshift has a service SLA of 99.9%
• Amazon Redshift mirrors data onto a second node
• Amazon Redshift automatically detects and recovers
from a disk or node failure
• Amazon Redshift automatically backups your data
• Amazon Redshift can automatically replicate your
backups to another AWS region (e.g. DR site)

Load
Unload
Backup
Restore
metadata
processing
nodes
parallel
restore
Amazon Simple Storage
Service )
SQL clients/BI tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128gb ram
16TB disk
16 coresCompute
node
128gb ram
16TB disk
16 coresCompute
node
128gb ram
16TB disk
16 coresCompute
node
Leader
node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum
Load
Query

• A compute node is partitioned into either 2 or 16 slices;
a slice can be thought of as a “virtual compute node”
• Each slice is allocated a portion of the compute node's
memory and disk space, where it processes a portion of
the workload assigned to the compute node by the
leader node
• The leader node manages distributing data to the slices
and apportions the workload for any queries or other
database operations to the slices
• Slices are Redshift’s Symmetric Multiprocessing (SMP)
mechanism – they work in parallel to complete
operations
Compute Node

A Redshift cluster can have up to128 ds2.8xlarge
nodes (2 petabytes of local storage) and can support
exabytes of data with its Redshift Spectrum feature
Note: AWS reserves the right to change instance
types at any time. For example, DC1 is a
DEPRECATED dense-compute instance type that
SHOULD NOT BE USED. Instead, upgrade from DC1
to DC2 for the same price with better performance
Instance Family Instance Type
Storage
Memory
#
CPUs
# Slices $
Disk type Capacity
Dense-
Compute
DC2 large NVMe
SSD
160 GB 16 GB 2 2 $
DC2 8xlarge NVMe
SSD
2.56 TB 244 GB 32 16 $$
Dense-Storage
DS2 xlarge Magnetic 2 TB 32 GB 4 2 $
DS2 8xlarge Magnetic 16 TB 244 GB 36 16 $$
• Dense-Compute (DC2) Nodes – Solid state disks
• Dense-Storage (DS2) Nodes – Magnetic disks
• The key difference between instance types is the
compute/storage ratio and storage latency (SSD vs
Magnetic storage)
Redshift instance types
are named according to
their corresponding
Amazon EC2 instance
types; for more
information, visit
https://aws.amazon.com/ec2/insta
nce-types/

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
...
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Query is optimized and compiled using
ML at the leader node. Determine what
gets run locally and what goes to
Amazon Redshift Spectrum
2
Query plan is sent to
all compute nodes
3
Compute nodes obtain partition info
from Data Catalog; dynamically prune
partitions
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
5
Amazon Redshift Spectrum nodes
scan your S3 data
6
7
Amazon Redshift
Spectrum
projects, filters,
joins and
aggregates
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
8
Result is sent back to client9 Leader Node
Compute
Nodes
10 GigE
(HPC)
JDBC/ODBC
SQL Clients /
BI Tools
Redshift Spectrum Fleet

• Redshift Spectrum seamlessly integrates with
your existing SQL & BI apps
• Support for complex joins, nested queries and
window functions
• Support for data partitioned in S3 by any key
• Date, Time, and any other custom keys (e.g. Year,
Month, Day, Hour)
• Leverages Amazon Glue’s data catalog or an
Amazon EMR Hive Metastore
no data loading required
read different file
formats
read compressed files
read encrypted files
ansi sql
https://docs.amazonaws.cn/en_us/redshift/latest/dg/c-using-
spectrum.html

• Amazon Athena, Redshift, & EMR have some shared analytical &
data lake use cases, but they each address different needs &
scenarios
• Amazon Redshift provides the fastest query performance for
enterprise reporting & business intelligence workloads, particularly
those involving extremely complex SQL with multiple joins and
sub-queries. Redshift also supports querying an S3 data lake &
joins between S3 data with local cluster data
• Amazon EMR makes it simple & cost effective to run Hadoop,
Spark, & Presto. EMR is flexible - you can run custom applications
and code, and define specific compute, memory, storage, and
application parameters to optimize your analytic requirements
• Amazon Athena is a standalone service that provides the easiest
way to run data exploration and discovery queries, as well as
analytical queries on data lakes, geospatial data, and service logs
without the need to setup or manage any servers
When is Redshift strongly recommended
over Athena?
• Latency has to be sub-second;
Redshift employs multiple caches &
an optimized query planner
• Data and workloads require data
warehouse
• Data is highly-relational (e.g.
normalized data that would be
difficult or otherwise
disadvantageous for the use
case to de-normalize)
• Data has a transactional nature
to it (e.g. data gets updated)
• Workloads involve many,
complex joins
• Workloads involve joins between
data warehouse data & an S3 data
lake – use Redshift (Redshift
Spectrum)

• Redshift is a fully ACID and ANSI SQL Compliant Data
Warehouse
• Use cases relying on indexes can alternatively achieve
fast query performance through parallelism, and
efficient data storage & I/O
• Table distribution styles, data compression, and sort
keys significantly impact parallelism, and efficient data
storage and I/O
• Redshift creates one database by default, but other
databases can be created (Note: having multiple
databases could lead to one DB monopolizing the
cluster’s resources)
• Databases are autonomous units in Redshift – i.e.
queries can join tables within a single database only

Redshift: Popular Data Models
STAR
Highly
Denormalized
Redshift can be used with a number of data models including…
Snowflake

• Row storage (e.g. MySQL): all row fields are stored
together on disk (typically in a sequential file)
• Accessing a column (example: scanning SSN of all
residents) with row storage:
• Scan entire table
• Resultant unnecessary I/O and caching overhead
• Column storage (e.g. Amazon Redshift): each table
column is stored separately on disk (typically in a
separate file or set of files)
• Accessing column (example: scanning SSN of all
residents) with columnar storage:
• Only scan blocks for relevant column(s)
• Significant less I/O

CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
Given the following table definition and data for the deep_dive table, how will a
simple SQL query behave in a row-based data store, and then in a column-based
store?
SELECT min(dt) FROM deep_dive;
Row-based storage behavior
• Need to read everything
• Excessive & unnecessary
I/O
Column-based storage
behavior
• Orelevantnly scan blocks
for column
• Significantly less I/O

• Redshift is a columnar database,
which means data on disk is
physically organised by column
• Column data is stored to 1MB
immutable blocks; a full block can
contain as little as one value or as
many as millions of values
• Each slice stores a set of blocks
that contain a range of the values
for each column

• Column-stores compress very nicely
• Each value in a single column is the same data type
• Likely to have repeating values in a single column
• Redshift can typically achieve between 3x-4x data
compression ratios
• Compression reduces storage requirements, but
also improves performance by reducing I/O
• Columns grow and shrink independently in
Redshift Note:
In Redshift jargon, “Column Encoding” refers
to “Compression”
Redshift typically
achieves 3x-4x
data
compression
ratios

• Redshift supports a number of compression
algorithms (e.g. LZO, ZSTD, RUNLENGTH, etc.)
• Compression algorithms can achieve different
compression ratios for different data types
• Use PG_TABLE_DEF to verify/view the current
encoding applied to each column in a table:
SELECT * FROM PG_TABLE_DEF WHERE
SCHEMANAME = ‘myschema’ AND TABLENAME =
‘mytable’

• Columnar compression is automatically and intelligently applied
by the COPY command to empty tables
• Redshift’s ANALYZE COMPRESSION command will analyze an
existing table and recommend the best compression settings
• Compress everything except sort key columns
• In some cases, RAW (no compression) is the best compression
option (e.g. sparse columns or relatively small tables: ~10k rows)
• Redshift’s Column Encoding Utility automates the use of the
ANALYZE COMPRESSION command with a data migration to
change compression in-place
Note: Beware cases where you’ve tested COPY
with a small number of rows before doing a
full load; COPY will not be re-evaluate on
non-empty tables
ANALYZE COMPRESSION
[ [ table_name ]
[ ( column_name [, ...] ) ] ]
[COMPROWS numrows]
COMPUDATE
PRESET
Column compression is set based on column’s data
type; no data is sampled
COMPUDATE [ON]
Best column compression is determined & set by
applying different compression codecs on sample
set of column data
COMPUDATE OFF Skips any compression analysis

• Redshift is a distributed database with a single leader
and one or more compute nodes, where data is stored on
compute nodes or on Amazon S3
• Distribution Style - a table property which dictates how
that table’s data is distributed on internal storage
• Distribution goals
• Distribute data evenly for parallel processing
• Ensure each node has the same amount of data
• Minimize data movement during query processing
Data Distribution Tips
• A sub-optimal data distribution can
lead to data skew and poor query
performance; if unsure which
distribution style to choose for a
table, let Redshift pick for you. #auto
• Redshift’s Column Encoding Utility
can be used to change a table’s
distribution style

Four distribution styles to choose from in Redshift
A column value is hashed, and the
same hash value is placed on the
same slice
KEY
Full table data is placed on each
compute node’s first slice
ALL
Data is evenly distributed across all
slices using a round-robin distribution
EVEN
Default option; Redshift starts the
table with ALL, but switches the table
to EVEN when the table grows larger
AUTO
• Consider using the ALL distribution
style for all infrequently-modified
small tables (~3 million rows or less)
• Distributions keys should have high
cardinality to avoid data skew and
“hot” nodes

• Good distribution keys are frequently joined to other
tables (e.g. fact table joined with a dimension table)
• High cardinality
• High frequency of values relative to overall row
count
select count(distinct <my_column>) unique_values,
count(9) total_rows from <my_table>;
• Low skew
• Each unique value in the column appears the same
number of times as every other value
• Use a date column only if cardinality is high enough, and
queries don’t typically filter on a very narrow date period
(to avoid workload skew among the node slices)
Use the query below to see the
distribution of unique values for your key
column (an even distribution is better)

• Zone Maps are minimum-and-maximum values for each
block of data
• Zone maps are stored in-memory and automatically
generated
• Zone Maps allow Redshift to effectively prune blocks
that cannot contain data needed for a given query
• Minimizes unnecessary I/O
• Along with sort keys, zone maps play a crucial role in
enabling range-restricted scans to prune blocks and
reduce I/O
1MB
Bloc
k
MIN
MAX
2001052
32001052
7
Block 42334
sales_dt
412.07
1269.33
price
…
…
…
##
##
Col
n
Redshift
stores
data in
1MB
blocks
1MB
Bloc
k
1MB
Bloc
k
ZoneMap
Block 863 Block
n

Redshift Sorting
Sort keys can be added to a table by
specifying the SORTKEY table property
on one or more columns
• Redshift uses sort keys to physically order data on
disk
• In combination with zone maps, sort keys enable
range-restricted scans to prune blocks and reduce
I/O
• Sort keys combined with zone maps function like
an index for a given set of columns
• Sort keys benefit MERGE JOIN performance with a
much faster sort
• Redshift supports two types of sort keys
• Compound Sort Key (default)
• Interleaved Sort Key

• Optimal Sort Key
• Should consist of columns most commonly found in
WHERE clause filter predicates
• Extremely common for sort key to be a date
• Compound Sort Key Tips
• Column order matters – no skip scanning
• Order columns by lowest cardinality to highest, if
possible
• Define four or less sort key columns—more will result in
marginal gains and increased ingestion overhead
• If your table is frequently joined, then include the DISTKEY in
the sort key as the first column
• A column that is CAST() to be joined or filtered, will not be
used as a sort key (e.g. casting DATE to TIMESTAMPTZ); modify
underlying data & then set this value as sort key
• Sort keys are less beneficial on small tables
Redshift Sorting
• Columns added to a sort key after a
high-cardinality column are not
effective
• With an established workload, the
Redshift GitHub has scripts to help you
find sort key suggestions

SELECT count(*)
FROM deep_dive
WHERE dt = '06-09-2017';
MIN: 01-JUNE-2017
MAX: 06-JUNE-2017
MIN: 07-JUNE-2017
MAX: 12-JUNE-2017
MIN: 13-JUNE-2017
MAX: 21-JUNE-2017
MIN: 21-JUNE-2017
MAX: 30-JUNE-2017
Sorted by date
MIN: 01-JUNE-2017
MAX: 20-JUNE-2017
MIN: 08-JUNE-2017
MAX: 30-JUNE-2017
MIN: 12-JUNE-2017
MAX: 20-JUNE-2017
MIN: 02-JUNE-2017
MAX: 25-JUNE-2017
Unsorted table
Zone maps and sort
keys can serve as a
significant optimization
by reducing the number
of blocks examined (and
therefore I/O) during
query execution

• Redshift supports the TEMPORARY keyword on CREATE
TABLE, CREATE TABLE AS, and through the #<NAME>
marker on SELECT
SELECT ... INTO #MY_TEMP_TABLE FROM ...
• Temporary table characteristics
• Stored like all other Redshift tables, but only have
a lifetime of the session (dropped on session
termination)
• Default to no columnar compression & even
distribution
• Do not have statistics by default
Redshift Temporary Tables
Define temporary tables with columnar
compression and an appropriate
distribution style to increase performance
This often is the worst
possible configuration for
table storage,
Tip: Define temporary tables
as you would a permanent
table

• Capabilities
• Temp tables can be used exactly as permanent tables
would in ETL jobs or analytics
• Temp tables can participate in complex/multi transactions
• Temp tables exhibit faster I/O (not mirrored to other
nodes)
• Can COPY and UNLOAD temporary tables
• SELECT INTO # does not provide the ability to set
DISTSTYLE or column encoding
• Best Practices
• Avoid the use of SELECT INTO # (use explicit CREATE
TEMPORARY TABLE (AS) statements instead)
• Include column encoding settings on CREATE command
• Include distribution keys or style when creating temp tables
• Compute statistics when creating large temp tables as part
of an ETL process
Redshift Temporary Tables
Create a temporary table that is LIKE
another table so that it inherits the parent
table’s column definitions, distribution
style and sort keys
create temp table
temp_tbl
(like parent_tbl);

SET DW
Sort Keys: Ensure sort keys exist to facilitate filters in the where clauseS
Encoding (compression): Reduced I/O improves query performanceE
Table Maintenance (vacuum, analyze): Current table statistics increase
sort key effectiveness, and table defragmentation reduces wasted
storage while improving query performance
T
Data Distribution: Ensure distribution keys exist to facilitate most
common joinsD
Workload Management: Machine learning algorithms profile queries to
place them in appropriate queue with the appropriate resourcesW

• A table is a table, right? Nope – with Data Lakes,
tables are collections of files
• Data Lake file types have a huge influence on
performance of Redshift Spectrum queries
• Best Practices
• Number of files in Data Lake should be a multiple of your
Redshift slice count (general best practice)
Redshift Spectrum can automatically split Parquet, ORC, text-
format, and Bz2 files for optimal processing
• File sizes should be in the range 64MB – 512MB
• Files should be of uniform size (especially for files that
can’t be automatically split by Redshift Spectrum such as
Avro and Gzip) to avoid execution skew
Redshift/Data Lake Interactions
A data lake is a centralized repository that
allows you to store all your structured and
unstructured data at any scale.

• Understanding query mechanics can maximize the
work done by Redshift Spectrum
• Do as much as possible on Redshift Spectrum
before bringing data back to your cluster
• Data Lake Best Practices for Redshift Spectrum
• Use Data Lake file formats that are optimized for
read by Redshift Spectrum (and Athena!)
• ORC and Parquet apply columnar encoding, similar
to how data is stored inside Redshift
• Redshift Spectrum can also work with Avro, CSV and
JSON data, but these files are *much* larger on S3
than ORC/Parquet
Redshift Spectrum is a feature of Redshift
that enables queries to reference external
tables

• Partitions should be based on:
• Frequently filtered columns (either through join or where clause)
• Business Units
• Business Groups (user cohorts, application names, business units,
etc.)
• Date & Time
• Consider how your users query data
• Do they look month by month, or current month and year vs
previous year for same month, etc.
• Do they understand the columns you have created?
• Date based partition columns have a type
• Full dates included in a single value may be formatted or not
(yyyy-mm-dd or yyyymmdd)
• Formatted dates can only be strings
• Either type of date needs to consider ordering (date=dd-mm-yyyy
cannot be used in order by clause, but date=yyyy-mm-dd can!)
Open file formats such as Parquet
and ORC are optimal for
Redshift/Data Lake interactions,
because of their columnar structure

• Keep data with similar security model in the same prefix:
s3://mybucket/data
• Application or business unit prefixes can be helpful:
s3://mybucket/data/marketing
• Each table resides in its own prefix:
s3://mybucket/data/marketing/impressions
• Add high level business unit partitions:
s3://mybucket/data/marketing/impressions/application=flux_capacitor
s3://mybucket/data/marketing/impressions/application=cold_fusion
• Add dates:
s3://mybucket/data/marketing/impressions/application=flux_capacitor/date=20180122
s3://mybucket/data/marketing/impressions/application=cold_fusion/date=20180123
OR
s3://mybucket/data/marketing/impressions/application=flux_capacitor/yyyy=2018/mm=01/dd=22
Redshift Spectrum extends the same
MPP principle used by Redshift clusters,
to query external data, using multiple
Redshift Spectrum instances as needed
to scan files. Place the files in a separate
folder for each table.

Workload management (WLM) is a feature that helps
manage workloads and avoid short, fast-running queries
getting stuck in queues behind long-running queries
Three WLM methods that are complimentary to each other
Redshift Workload Management
The amount of memory available to a
query is a function of
• WLM queue where it runs
• Percentage of memory assigned to
the WLM queue
• Number of query slots being
consumed by the query
Queues
(basic WLM)
WLM always assigns every query executed in Redshift to a
specific queue on the basis of user group, query group, or
WLM rules (e.g. [return_row_count > 1000000])
Short-Query
Acceleration
(SQA)
Redshift uses machine learning to determine what
constitutes a “short” running query in your cluster
“Short” running queries are then automatically identified &
run immediately in short-query queue if queuing occurs
Concurrency
Scaling
Redshift uses machine learning to predict queuing in your
cluster and when queuing occurs, transient Amazon
Redshift clusters are added to your cluster where queries
are routed for execution

Default WLM setup
• One default WLM queue: concurrency level of five (enables up to 5
queries to run concurrently) and no timeout
• Auto WLM enabled (auto query concurrency & memory allocation)
• One superuser queue: concurrency level of one and no timeout
• SQA enabled (enabled/disabled via checkbox in Redshift console)
• Concurrency scaling disabled (enabled/disabled via Redshift
console)
Customizing WLM
• Customize WLM queues via a few clicks on the Redshift console
• Up to 8 custom queues are allowed in a Redshift cluster
• WLM queues have four main “levers”
• Concurrency level (aka “query slots”)
• Memory allocation (%)
• Targets (i.e. user, query groups, or query monitoring rules)
• Timeout (ms)
Redshift WLM Queue Setup via Redshift Console
1. Click on Parameter Groups in navigation pane and
choose Create Cluster Parameter Group
2. Click Add Queue button to add a new WLM queue
3. Associate parameter group with your cluster

Auto WLM
Auto WLM is enabled by default when the default
parameter group is used, and must be explicitly
enabled when a custom parameter group is used.
Auto WLM can be enabled in a customer parameter
group through the Amazon Redshift console by
choosing Switch WLM mode and then
choosing Auto WLM. With this choice, one queue is
used to manage queries, and
the memory and concurrency on main fields are
both set to auto.
When Auto WLM is not enabled, manual WLM
requires you to specify values for query
concurrency and memory allocation.
• Automatic workload management (“Auto WLM”) lets
Amazon Redshift automatically manage query concurrency
and memory allocation
• Auto WLM could create up to eight queues with each
queue having a priority
• Auto WLM will automatically determine the amount of
resources that queries need, and adjust the concurrency
based on the workload
• Concurrency is set lower when queries requiring large
amounts of resources are in the system (e.g. hash joins
between large tables)
• Concurrency is set higher when lighter queries (e.g. inserts,
deletes, scans, or simple aggregations) are submitted
• Auto WLM & SQA work together to allow short running and
lightweight queries to complete even while long running,
resource intensive queries are active

Redshift Query Priorities
There are six possible query priorities
The CRITICAL priority is a higher priority
than HIGHEST and is available to superusers.
To set this priority, you can use the
functions:
• CHANGE_QUERY_PRIORITY
• CHANGE_SESSION_PRIORITY
• CHANGE_USER_PRIORITY
Only one CRITICAL query can run at a time
• WLM queues can be defined with a specific priority (relative
importance) & queries will inherit a queue’s priority
• Administrators can use priorities to prioritize different
workloads (e.g. ETL, Ingestion, Audit, BI, etc.)
• Amazon Redshift uses priority when letting queries into the
system to determine amount of resources allocated to query
• Predictable performance for high priority workload comes at
the cost of other, lower priority workloads
Lower priority queries are not starved, but might run longer because waiting
behind more important queries or running with fewer resources
• Can enable concurrency scaling to maintain predictable
performance for lower priority workloads
• Auto WLM automatically creates and assigns queues
corresponding to priorities
• NORMAL (default)
• LOW
• LOWEST
• CRITICAL (admins)
• HIGHEST
• HIGH

• Query Monitoring Rules (QMR) are intended to help
automatically handle runaway (poorly written) local or
Spectrum queries
• QMR can be defined for a WLM queue via the Redshift console
(max 25 rules for all WLM queues) and each rule can take one
of four actions for offending queries
• LOG – log info about the query in STL_WLM_RULE_ACTION table
• ABORT – log the action and terminate the query
• HOP – log the action and move query to another appropriate
queue if one exists, otherwise terminate it
• PRIORITY - Change query priority (only available with Auto-WLM)
• Common QMR use cases
• Guard against wasteful resource utilization, runaway costs, etc.
• Log resource-intensive queries
Redshift QMR
Each query monitoring rule includes up to
three conditions, or predicates, and one
action, similar to an if-then statement:
if {predicate(s)} then {action}
A predicate consists of a metric, operator,
and value (e.g. rows_scanned > 1000000)
If all of the predicates for any rule are met,
that rule's action is triggered. Possible rule
actions are log, hop, and abort
“That User”: every DB has that user that loves to execute queries
with unnecessarily expensive behavior (e.g. Cartesian product)

• Concurrency Scaling is a Redshift feature that automatically
adds transient clusters to your cluster within seconds to
handle concurrent requests with consistently fast
performance
• Free for over 97% of Redshift customers
• For every 24 hours that your main cluster is in use, you
accrue a one-hour credit for Concurrency Scaling. Beyond
that, customers are billed on a per-second basis per
transient cluster
• Applies to Redshift local & spectrum queries
• Email notifications are issued when concurrency scaling
occurs

Elastic Resize
• Existing cluster is modified to add or remove nodes
• During the actual resize, existing connections to the Redshift
cluster are put on hold, no new connections are accepted until the
resize finishes, and the cluster is unavailable for querying
• Typically completes within ~15 minutes or less
Resizing a cluster is easily
achieved with a few clicks on
the Redshift console, and there
are two resizing approaches to
choose from
Classic Resize
• Redshift cluster can be reconfigured to a different node
count and instance type
• Involves streaming all data from the original Redshift
cluster to a newly created Redshift cluster with the new
configuration. During the resize, the original Redshift
cluster is in read-only mode, and the customer is only
charged for one cluster
• Depending on data size, may take several hours to
complete

• Redshift provides a Postgres compliant driver endpoint
• Two driver options for connecting to Redshift
• JDBC/ODBC Postgres driver
• Proprietary Redshift driver
• 35% faster than Postgres Driver
• Support for IAM SSO
• Like other Postgres clients, you connect to Redshift as a
database user, using a hostname, port, and database
name (viewable on Redshift console)
Examples
jdbc:[redshift|postgresql]://endpoint:port/databaseName
• jdbc:redshift://demo.dsi9zn4ccku4.us-east-
1.redshift.amazonaws.com:8192/pocdb
• jdbc:postgresql://demo.dsi9zn4ccku4.us-east-
1.redshift.amazonaws.com:8192/pocdb

• Query Editor is a web-based query interface to run single SQL
statement queries in Amazon Redshift cluster directly from your
AWS Management console without having to install & setup an
external JDBC/ODBC client
• Query results viewable in console & downloadable to CSV file
• Queries can be saved for convenient repeat execution
• Query execution steps & times can be viewed to isolate
bottlenecks & optimize queries
• Other considerations
• Max 50 Query Editor users at the same time per cluster
• Query Editor applicable for short queries (runtime < 10 min)
• Query result sets are paginated with 100 rows per page
• Transactions & Enhanced VPC Routing are not supported
• Access to Query Editor requires specific IAM permissions

• By default, Amazon Redshift clusters are locked down
so nobody has access
• To grant other users inbound access, must associate
the Redshift cluster with a security group
• Use security groups to authorize other VPC Security
Groups, or CIDR Blocks to connect
• VPC Security Groups should be used for AWS
Service & EC2 connectivity, or Cross Account Access
(recommended approach)
• CIDR Blocks should be used for connections from
on-prem/other side of Customer Gateway
• Having separate cluster security groups per application
or cluster is a good practice

Collections of database tables and other
database objects (similar to namespaces)Schemas
Named user accounts that can connect
to a databaseUsers
Collections of users that can be
collectively assigned privileges for easier
security maintenance
Groups
Redshift Security
In Amazon Redshift, schemas are
similar to operating system
directories, except that schemas
cannot be nested
Users can be granted access to a
single schema or to multiple
schemas

Redshift Security
Create a view to conceal row or
columns for which a user or group of
users are not authorized to access
CREATE VIEW secure_view AS
SELECT col1, col3 from
underlying_table;
GRANT SELECT ON secure_view
TO GROUP restricted_group;
REVOKE ALL ON
underlying_table FROM GROUP
restricted_group;

Security has always been priority-one at AWS, & Amazon Redshift is no exception
End-to-end data encryption
IAM integration & integration with SAML IdP’s for Federation (SSO)
Amazon VPC for network isolation
Database security model (users, groups, privileges)
Audit logging and notifications
Certifications that include SOC 1/2/3, PCI-DSS, FedRAMP, & HIPAA

require_ssl true
Redshift Security
SSL encryption can be used with client
connections to Amazon Redshift

• Redshift can encrypt data at rest (data stored locally or
in S3 backups) using AES algorithm with 256 bit key
• Key management can be performed by Redshift, AWS
KMS, or your HSM
• You control rotation of encryption keys via API
• Redshift blocks of data backed up to S3 are encrypted
using the cluster’s encryption key
• Redshift uses hardware based crypto modules to keep
impact to performance to ~ 20% or less
• Redshift clusters that need to comply with PCI, SOX,
HIPAA must be configured with encryption enabled
Redshift Security
Redshift clusters can be configured to
encrypt data at rest through a simple
checkbox in the Redshift console

Audit Logs
• Stored in three log files: Connection log,
User log, and User Activity log
• Must be explicitly enabled
• Stored indefinitely unless S3 lifecycle rules
are in place to archive or delete files
automatically
• Cluster restarts don't affect audit logs in S3
• Access to log files does not require access
to the Redshift database
• S3 charges apply
System (STL) Tables
• Stored in multiple tables, including
SVL_STATEMENTTEXT and
STL_CONNECTION_LOG
• Automatically available on every node in
the data warehouse cluster
• Log history is stored for two to five days,
depending on log usage and available disk
space
• Access to STL tables requires access to the
Amazon Redshift database

CloudTrail Logs
• Stored indefinitely in S3, unless S3 lifecycle rules are in
place to archive or delete files automatically
• CloudTrail captures last 90 days of Management Events by
default without charge (available using CloudTrail APIs or
via the Console)
• Maintaining longer history of events is possible but
additional deliveries may apply, including S3 charges
• Access to log files does not require access to the Redshift
database

Amazon RDS
Amazon
Athena
Amazon
EMR
AWS Glue
Amazon
SageMaker
Amazon S3
Amazon
QuickSight
Amazon
Redshift
AWS Glue
Crawlers
Web app data
On-premises data
Streaming data
Other databases AWS Glue
Data Catalog
1. Crawlers scan your
data sets and populate
the Glue Data Catalog
2. The Glue Data
Catalog serves as a
central metadata
repository
3. Once catalogued in Glue,
your data is immediately
available for analytics

ETL
processS3
Spectrum
Athena
Marketing
data source
Other source
systems
S3

aws.amazon.com/

Ricardo Serafim
Analytics Specialist Solutions Architect
rserafim@amazon.com

Immersion Day - Como simplificar o acesso ao seu ambiente analítico

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Immersion Day - Como simplificar o acesso ao seu ambiente analítico

Ähnlich wie Immersion Day - Como simplificar o acesso ao seu ambiente analítico (20)

Mehr von Amazon Web Services LATAM

Mehr von Amazon Web Services LATAM (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Immersion Day - Como simplificar o acesso ao seu ambiente analítico