This document provides an overview and use cases for Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse service from Amazon Web Services. It summarizes Redshift's features including columnar storage, data compression, and massively parallel query processing. It also provides examples of how Redshift is used by companies to reduce costs, improve query performance, and scale their data warehousing needs. Specific use cases and customers of Redshift are highlighted.
5. Common Customer Use Cases
• Reduce costs by
extending DW rather than
adding HW
• Migrate completely from
existing DW systems
• Respond faster to
business
• Improve performance by
an order of magnitude
• Make more data
available for analysis
• Access business data via
standard reporting tools
• Add analytic functionality
to applications
• Scale DW capacity as
demand grows
• Reduce HW & SW costs
by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
8. AWS Marketplace
• Find software to use with
Amazon Redshift
• One-click deployments
• Flexible pricing options
http://aws.amazon.com/marketplace/redshift
9. Data Loading Options
• Parallel upload to Amazon S3
• AWS Direct Connect
• AWS Import/Export
• Amazon Kinesis
• Systems integrators
Data Integration Systems Integrators
10. Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
11. Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24
Spindles 16 TB compressed, 2 GB/sec scan
rate
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage
12. Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • With row storage you do
unnecessary I/O
• To get total amount, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
13. • With column storage, you
only read the data you need
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
18. • Load in parallel from Amazon S3,
Amazon EMR, Amazon DynamoDB
or any SSH connection
• Data automatically distributed and
sorted according to DDL
• Scales linearly with number of nodes
Amazon Redshift parallelizes and distributes everything
• Query
• Load
•
• Backup/Restore
• Resize
19. • Backups to Amazon S3 are automatic,
continuous and incremental
• Configurable system snapshot retention period.
Take user snapshots on-demand
• Cross region backups for disaster recovery
• Streaming restores enable you to resume
querying faster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
20. • Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to node
• Only charged for source cluster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
21. • Automatic SQL endpoint
switchover via DNS
• Decommission the source cluster
• Simple operation via Console or
API
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
22. Amazon Redshift is priced to let you analyze all your data
• Number of nodes x cost
per hour
• No charge for leader
node
• No upfront costs
• Pay as you go
DW1 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation $ 0.161 $ 8,794
3 Year Reservation $ 0.100 $ 5,498
23. Amazon Redshift is easy to use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
24. Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in Amazon S3
encrypted
– HSM Support so you control keys
• Audit logging & AWS CloudTrail
integration
• Amazon VPC support
• SOC 1/2/3, PCI DSS Level 1, Fedramp
and more
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
25. Amazon Redshift continuously backs up your
data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of
data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
• Easily enable backups to a second region for disaster recovery
26. 60+ new features since launch
• Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
• Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others
• Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE
for perfect forward security
• Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region backups
• Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50
• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution,
CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR
27. Amazon Redshift Feature Delivery
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency,
Resize Perf., Approximate Count
Distinct, SNS Alerts, Cross Region
Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables
and diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH,
Fetch size support for single node
clusters, new system tables with commit
stats, row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
28. New Features
• UNLOAD to single file
• COPY from multiple regions
• Percentile_cont & percentile_disc window
functions
29. Try Amazon Redshift with BI & ETL for Free!
• http://aws.amazon.com/redshift/free-trial
• 2 months, 750 hours/month of free DW2.Large usage
• Also try BI & ETL for free from nine partners
31. What’s Upworthy
• We’ve been called:
– “Social media with a mission” by our About Page
– “The fastest growing media site of all time” by Fast Company
– “The Fastest Rising Startup” by The Crunchies
– “That thing that’s all over my newsfeed” by my annoyed friends
– “The most data-driven media company in history” by me,
optimistically
32. What We Do
• We aim to drive
massive amounts of
attention to things that
really matter.
• We do that by finding,
packaging, and
distributing great,
meaningful content.
34. When We Started
• Building a data warehouse from scratch
• One engineer on the project
• Object data in MongoDB
• Had discovered MoSQL
• Knew which two we’d choose:
– Comprehensive
– Ad Hoc
– Real-Time
36. Building it out initially
• ~50 events/second
• Snowflake or denormalized? Both.
Raw
S3
Events
Redshift
MongoDB
Browser Web Service S3 Drain
MoSQL PostgreSQL Objects
Master
EMR
Processed
Objects
37. Our system now
• Stats:
– ~5 TB of compressed data.
– Two main tables = 13 billion rows
– Average: ~1085 events/second
– Peak: ~2500 events/second
• 5-10 minute ETL cycle (Kinesis session later)
• Lots of rollup tables
• COPY happens quickly (5-10s) every 1-2
minutes
39. Initial Lessons
• Columnar can be disorienting:
– What’s in the SELECT matters. A lot.
– SELECT COUNT(*) is really, really fast.
– SELECT * is really, really slow.
– Wide, tall tables with lots of NULLs = Fine.
• Sortkeys are hugely powerful.
• Bulk Operations (COPY, not INSERT) FTW
40. Staging to Master Pattern
INSERT INTO master
WITH to_insert as (
SELECT
col1,
col2,
col3
FROM
staging
)
SELECT
s.*
FROM
to_insert s
LEFT JOIN master m on m.col1 = s.col1
WHERE
m.col1 is NULL;
41. Hash multiple join keys into one
...
FROM
table1 t1
LEFT JOIN table2 t2 on
t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3
md5(col1 || col2 || col3) as hashed_join_key
...
FROM
table1 t1
LEFT JOIN table2 t2 on t1.hashed_join_key = t2.hashed_join_key
42. Pain Points
• Data loading can be a bit tricky at first
– Actually read the documentation. It’ll help.
• COUNT(DISTINCT col1) can be painful
– Use APPROXIMATE COUNT(DISTINCT col1) instead
• No null-safe operator. NULL = NULL returns NULL
– NVL(t1.col1, 0) = NVL(t2.col1, 0) is null-safe
• Error messages can be less than informative
– AWS Redshift Forum is fantastic. Someone else has probably seen that error
before you.
44. In The Next Few Months
• More structure to our ETL (implementing luigi)
• Need to avoid Serializable Isolation Violations
• Dozens of people querying a cluster of 4 dw1.xlarge
nodes can get ugly
• “Write” node of dw1 nodes and “Read” cluster of dw2
“dense compute” nodes