5. Common Customer Use Cases
• Reduce costs by
extending DW rather than
adding HW
• Migrate completely from
existing DW systems
• Respond faster to
business
• Improve performance by
an order of magnitude
• Make more data
available for analysis
• Access business data via
standard reporting tools
• Add analytic functionality
to applications
• Scale DW capacity as
demand grows
• Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
8. Data Loading Options
• Parallel upload to Amazon S3
• AWS Direct Connect
• AWS Import/Export
• Amazon Kinesis
• Systems integrators
Data Integration Systems Integrators
AWS Summit 2014
9. Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
AWS Summit 2014
10. Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24
Spindles 16 TB compressed, 2 GB/sec scan
rate
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage
AWS Summit 2014
11. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • With row storage you do
unnecessary I/O
• To get total amount, you have to
read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
AWS Summit 2014
12. • With column storage, you only
read the data you need
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
AWS Summit 2014
17. • Load in parallel from Amazon S3 or
Amazon DynamoDB or any SSH
connection
• Data automatically distributed and
sorted according to DDL
• Scales linearly with number of nodes
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
18. • Backups to Amazon S3 are automatic, continuous
and incremental
• Configurable system snapshot retention period. Take
user snapshots on-demand
• Cross region backups for disaster recovery
• Streaming restores enable you to resume querying
faster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
19. • Resize while remaining online
• Provision a new cluster in the background
• Copy data in parallel from node to node
• Only charged for source cluster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
20. • Automatic SQL endpoint switchover
via DNS
• Decommission the source cluster
• Simple operation via Console or API
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
AWS Summit 2014
21. Amazon Redshift is priced to let you analyze all your data
• Number of nodes x cost per
hour
• No charge for leader node
• No upfront costs
• Pay as you go
DW1 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation $ 0.161 $ 8,794
3 Year Reservation $ 0.100 $ 5,498
22. Amazon Redshift has security built-in
• SSL to secure data in transit; load
encrypted from S3; ECDHE perfect forward
security
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks & in Amazon S3 encrypted
– On-premises HSM & CloudHSM support
• No direct access to compute nodes
• Audit logging & AWS CloudTrail integration
• Amazon VPC support
• SOC 1/2/3, PCI-DSS Level 1, FedRAMP
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
AWS Summit 2014
23. Amazon Redshift continuously backs up your
data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple
copies of data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
• Easily enable backups to a second region for disaster recovery
AWS Summit 2014
24. 50+ new features since launch in Feb 2013
• Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
• Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others
• Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials,
HSM, ECDHE for perfect forward security
• Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-
region backups
• Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL,
Concurrency to 50 slots
• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character
substitution, CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR
AWS Summit 2014
27. Deriving Big Insights from Big Data
Trends in business analytics
Popular use cases
Agile Business Intelligence
Governed Self-Service
Information Driven Mobile Apps
Redshift Certified
Customer Success
28. Popular Use Cases
• Information driven purchasing
– reviews
– searches
– pricing
– comparisons
– social networks
– recommendations
• Omni-channel
• Customer 360
• Improve sales efficiency and effectiveness
• Data blending
o sales
o product
o Marketing
• CRM integration
• Mobility
o BYOD
o Caching
• Information driven experience
o Interaction
o Videos
o Documents
• In-store apps
o analytics
o customer 360
o personal shopper
• Store of the future
• Real-time decisions
• Inventory management
Customer Analytics Sales Enablement Retail
29. Self-Service Analytics Revolutionizes Traditional BI
Boost user satisfaction while massively increasing productivity
More Productive
More content per creator
More Producers
More users can create
content
More Collaborative
Peer-to-peer sharing
5-10x
More Content
5-10x
More Content
Creators
5-10x
More Sharing
>100x
more content
creation and
consumption
31. World-Class Information-Driven Mobile Apps
The future of dashboards
• More than graphs
• Multi-media
• Transaction enabled
• Live data
• Comprehensive data
• Intuitive
• Easy to use
• Guided workflow for consistent
user experience
• Personalized for each user
32. Business User Access to 1000s of Data Sources
Faster access to your data
MicroStrategy
Modeled Data
Personal or
Departmental
Cloud-
Based Data
Relational
Databases
Big Data &
Hadoop
Enterprise
Applications
SAP, Oracle e-Business,
Siebel, Peoplesoft, etc.
Redshift, Oracle, SQL
Server, MySQL,
Teradata, Netezza, etc.
Salesforce.com, NetSuite.
Facebook, Eloqua, Google
Docs, etc.
Spreadsheets, Access
databases, CSV, public data
downloads, etc.
EMR, MapR, Cloudera,
Hortonworks, etc…
Enterprise-certified single-
version of the truth
Redshift
Certified Redshift
integration
33. Customer Success
Netflix has deployed the MicroStrategy
business intelligence software platform on top
of Amazon Elastic MapReduce (Amazon EMR)
for interactive insights. Netflix analysts use
advanced visualizations to explore the
performance of its streaming service closer to
recorded time by directly accessing Hadoop
data on an ad-hoc basis and without writing
code.
34. Big Data
Analytics
Transform your growing
Big Data resources into
insight and profit.
Self-Service
Analytics
See and understand data in
minutes. No IT needed.
Enterprise-Grade
Business Intelligence
Produce and publish trusted
analytics to improve your
business operations.
AWS Partner
Comprehensive Analytics Platform
#1 Mobile Analytics
MicroStrategy Analytics Enterprise
Business Agility with Trusted, Governed Data
36. Ingestion – Best Practices
• Goal
• 1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead
• Best Practices
• Preferred method - COPY from S3
• Loads data in sorted order through the compute nodes
• Single Copy command, Split data into multiple files
• Strongly recommend that you gzip large datasets
copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-
ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;
• If you must ingest through SQL
• Multi-row inserts
• Avoid large number of singleton
insert/update/delete operations
• To copy from another table
• CREATE TABLE AS or INSERT INTO SELECT
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
AWS Summit 2014
37. Ingestion – Best Practices (Cotd..)
• Best Practices
– Verifying load data
files
• For US east – S3
provides
eventual
consistency.
– Verify files are in
S3
Listing Object Keys
– Query Redshift after
load. This query
returns entries for
loading the tables in
the TICKIT database
select query, trim(filename), curtime, status
from stl_load_commits
where filename like '%tickit%' order by query;
query | btrim | curtime | status
-------+---------------------------+----------------------------+--------
22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1
22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1
22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1
22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1
22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1
22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1
22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1
22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1
22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1
22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1
22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1
AWS Summit 2014
38. Ingestion – Best Practices (Cotd..)
• Best Practices
– We do not support an upsert statement. Use staging tables to perform an upsert by doing
a join on the staging table with the target – Update then Insert.
– We do NOT enforce primary key constraint, if you COPY same data twice, there will be a
duplicate copy.
– Increase the memory available to a COPY or VACUUM by increasing
wlm_query_slot_count
set wlm_query_slot_count to 3;
– Run the ANALYZE command whenever you’ve made a non-trivial number of changes to
your data to ensure your table statistics are current
– Amazon Redshift system table that can be helpful in troubleshooting data load
issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust
MAX ERRORS as needed.
– Check character set : Support only UTF8
AWS Summit 2014
39. Choose a Sort key
• Goal
• Skip over data blocks to minimize IO
• Best Practice
• Sort based on range or equality predicate (WHERE clause)
• If you access recent data frequently, sort based on TIMESTAMP
AWS Summit 2014
40. Choose a Distribution Key
• Goal
• Distribute data evenly across nodes
• Minimize data movement among nodes : Co-located Joins and
Co-located Aggregates
• Best Practice
• Consider using Join key as distribution key (JOIN clause)
• If multiple joins, use the foreign key of the largest dimension as
distribution key
• Consider using Group By column as distribution key.( GROUP BY
clause)
• Avoid
• Keys used as equality filter as your distribution key
• If de-normalized tables and no aggregates, do not specify a
distribution key. Redshift will use round robin
AWS Summit 2014
41. Query Performance – Best practices
• Encode date and time using “TIMESTAMP” data type instead of “Char”
• Specify Constraints
• RedShift does not enforce constraints (primary key, foreign key, unique values) but
the optimizer uses it.
• Loading and/or applications need to be aware
• Specify redundant predicate on the sort column
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > '1/1/2013'
AND tab2.timestamp > '1/1/2013';
AWS Summit 2014
42. Workload Manager
• Allows you to manage and adjust query concurrency
• WLM allows you to
• Increase query concurrency up to 50
• Define user groups and query groups
• Segregate short and long running queries
• Help improve performance of individual queries
• Be aware that query workload is distributed to every compute node.
• Increasing concurrency may not always help due to resource contention (cpu,
memory and I/O).
• Total throughput may increase by letting one query to complete first and other
queries to wait.
AWS Summit 2014
43. Workload Best Practices
• Organizing and keeping your load files in S3 allows for re-run or scenario
testing as you evolve your workflow in the platform.
– Keep in S3 or Glacier for fiscal/legal reasons
• Data updated for short-term
– consider having a short-term version of the table for staging and a long term
version once data gets stable.
• Round Robin distribution key
– When you don’t have a good Distribution Key
– Check Part 1 for query on checking for distribution skew
– Trade off with collocated joins
• Loading the target (final) table
– Use a chronological date/timestamp columns for first sortkey. Vacuum is needed
less often and runs faster
– When first sort column has low cardinality/resolution (i.e, date instead of
timestamp), subsequent columns should match common filters and/or grouping
columns
AWS Summit 2014
44. Workload Best Practices cont.
• Use UNLOAD command to archive data that is not needed for business
reasons
– Data that needs to exist only for fiscal/legal reasons can be re-loaded as
needed.
• Consider applying retention policies less often than the regular workflow
– Weekly/Monthly process during a less busy time
– Make space provision for the data growth
– Make sure all queries have date/timestamp range filters (> and <)
– Keep a sliding window of data to minimize block re-write during vacuum
• Take manual snapshots to save status at specific mileposts (year-end).
AWS Summit 2014
45. Workload Best Practices cont.
• Ratio between Load/Query performance needs
– Low ratio: Consider Load -> Snapshot -> Spin “Query” clusters -> Tear down
– High ratio: Consider Performance above space needs when choosing
number of nodes
• Normalization Rule of Thumb
– De-normalize only to avoid large non-collocated joins
– Slow Changing Dimensions (type II): Keep normalized, match distkey with fact
table
AWS Summit 2014
46. Space Management
• Redshift has a single pool of space used for tables and temporary
segments.
– Loads need 2.5 times the space of the data being loaded if table has a
sortkey
– Vacuum may need 2.5 times the size of the table.
• Monitor the free space
– Performance Tab in the console
– Cloudwatch Alarms
– SQL
AWS Summit 2014
47. Space Management cont.
• Tables Sizes
select trim(pgdb.datname) as Database, trim(pgn.nspname) as
Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by mbytes desc, a.db_id, a.name;
• Free Space
select sum(capacity)/1024 as capacity_gbytes,
sum(used)/1024 as used_gbytes,
(sum(capacity) - sum(used))/1024
as free_gbytes
from stv_partitions
where part_begin=0;
• Redshift allows you to resize your cluster up and down and across node types.
Online (read-only access).
AWS Summit 2014
48. For more best practices, search youtube for
“Amazon Redshift Best Practices “
Thank You !
AWS Summit 2014