This document provides an overview of Amazon Redshift presented by Pavan Pothukuchi and Chris Liu. The agenda includes an introduction to Redshift, its benefits, use cases, and Coursera's experience using Redshift. Some key benefits highlighted are that Redshift is fast, inexpensive, fully managed, secure, and innovates quickly. Example use cases from NTT Docomo and Nasdaq are discussed. Chris Liu then discusses Coursera's experience moving from no data warehouse to using Redshift over three years, including their current ecosystem involving Redshift, other AWS services, and business intelligence applications. Lessons learned around thinking in Redshift, communicating with users, surprises, and reflections are also shared.
1. Š 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pavan Pothukuchi, Principal Product Manager, AWS
Chris Liu, Data Infrastructure Engineer, Coursera
July 13, 2016
Getting Started with
Amazon Redshift
3. AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
Amazon
CloudSearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon
Machine
Learning
Amazon
Elasticsearch
Service
AWS Database
Migration Service
Amazon
QuickSight
Amazon
Kinesis
Firehose
AWS Import/Export
Snowball
AWS Direct
Connect
Collect
Amazon Kinesis
Streams
4. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
5. The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine learning,
streaming
Analysis inline with process
flows
Pay as you go, grow as you
need
Managed availability and
disaster recovery
Enterprise Big data SaaS
6. The Forrester Wave⢠is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave⢠are trademarks of Forrester Research, Inc. The Forrester Wave⢠is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave⢠enterprise data warehouse Q4 â15
10. Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
11. Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, over 1 million packets/sec/node
Choice of storage type, instance size
Regular cadence of autopatched improvements
12. Benefit #1: Amazon Redshift is fast
New dense storage (HDD) instance type (Jun 2015)
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: Same as our prior generation!
Performance improvement: 50%
Enhanced I/O and commit improvements (Jan 2016)
Performance improvement: 35%
Memory allocation improvements (May 2016)
Performance improvement: 60%
13. Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)
Price per hour for
DW1.XL single node
Effective annual
price per TB compressed
On demand $ 0.850 $ 3,725
1-year reservation $ 0.500 $ 2,190
3-year reservation $ 0.228 $ 999
Dc1 (SSD)
Price per hour for
DW2.L single node
Effective annual
price per TB compressed
On demand $ 0.250 $ 13,690
1-year reservation $ 0.161 $ 8,795
3-year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
14. Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
15. Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
16. Benefit #4: Security is built in
⢠Load encrypted from Amazon S3
⢠SSL to secure data in transit
⢠ECDHE perfect forward security
⢠Amazon VPC for network isolation
⢠Encryption to secure data at rest
⢠All blocks on disks and in Amazon S3 encrypted
⢠Block key, cluster key, master key (AES-256)
⢠On-premises HSM and AWS CloudHSM support
⢠Audit logging and AWS CloudTrail integration
⢠SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
17. Benefit #5: We innovate quickly
Well over 125 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize
Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters, new
system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE
ciphers (4/22)
3 new regex features, Unload to single
file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions,
percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
18. Benefit #6: Amazon Redshift is powerful
⢠Approximate functions
⢠User-defined functions
⢠Machine learning
⢠Data science
19. Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
21. Performance
Ease of use
Security
Analytics and
functionality
SOA
Recent launches Dynamic WLM parameters
Queue hopping for timed-out queries
Merge rows from staging to prod. table
2x improvement in query throughput
10x latency improvement for UNION ALL queries
Bzip2 format for ingestion
Table level restore
10x improvement in vacuum perf.
Default access privileges
Tag-based AWS IAM access
IAM roles for COPY/UNLOAD
SAS connector enhancements,
Implicit conversion of SAS
queries to Amazon Redshift
DMS support from OLTP sources
Enhanced data ingestion from
Kinesis Firehose
Improved data schema conversion
to Amazon ML
23. NTT Docomo: Japanâs largest mobile service
provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
24. NTT Docomo: Japanâs largest mobile service
provider
125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS
Direct
Connect
Client
Forwarder
LoaderState
management
SandboxAmazon Redshift
S3
25. Nasdaq: powering 100 marketplaces in 50
countries
Orders, quotes, trade executions,
market âtickâ data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
26. 23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
Âź the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50
countries
31. Outline
⢠Moving from no data warehouse to the Amazon
Redshift ecosystem
⢠No warehouse: m2.2xlarge read replica â 4 CPUs, 32 GB RAM on Amazon RDS
⢠First Amazon Redshift cluster: 1 ds1.xl node â 2 CPU, 16 GB RAM
32. Outline
⢠Moving from no data warehouse to the Amazon
Redshift ecosystem
⢠No warehouse: m2.2xlarge read replica â 4 CPUs, 32 GB RAM on Amazon RDS
⢠First Amazon Redshift cluster: 1 ds1.xl node â 2 CPU, 16 GB RAM
⢠The Amazon Redshift ecosystem at Coursera
⢠Current day: 9 dc1.8xl nodes â 288 CPUs, 2.4 TB RAM
33. Outline
⢠Moving from no data warehouse to the Amazon
Redshift ecosystem
⢠No warehouse: m2.2xlarge read replica â 4 CPUs, 32 GB RAM on Amazon RDS
⢠First Amazon Redshift cluster: 1 ds1.xl node â 2 CPU, 16 GB RAM
⢠The Amazon Redshift ecosystem at Coursera
⢠Current day: 9 dc1.8xl nodes â 288 CPUs, 2.4 TB RAM
⢠Learnings from 3 years on Amazon Redshift
⢠Lessons in communication, surprises, reflections
34. Starting point
⢠Querying production read replica
⢠Makeshift libraries providing thin abstraction layer
⢠45 minutes to provide aggregate metrics over all classes running on Coursera =(
35. Starting point
⢠Querying production read replica
⢠Makeshift libraries providing thin abstraction layer
⢠45 minutes to provide aggregate metrics over all classes running on Coursera =(
36. Move in progress
⢠Risk-free deployment
⢠"Let's try it out"
⢠Few clicks to deploy cluster, connect to cluster, resize
37. Move in progress
⢠Risk-free deployment
⢠"Let's try it out"
⢠Few clicks to deploy cluster, connect to cluster, resize
⢠AWS ecosystem integration
⢠COPY from S3/EMR/SSH
⢠Unload to S3
⢠UNLOAD(COPY(data)) == COPY(UNLOAD(data)) == data
38. Move in progress
⢠Risk-free deployment
⢠"Let's try it out"
⢠Few clicks to deploy cluster, connect to cluster, resize
⢠AWS ecosystem integration
⢠COPY from S3/EMR/SSH
⢠Unload to S3
⢠UNLOAD(COPY(data)) == COPY(UNLOAD(data)) == data
⢠Minimal administration
⢠In aggregate, less than 1 full-time employee for administration
⢠Automation and tooling for monitoring usage and performance
40. Amazon Redshift ecosystem at Coursera
⢠Data flow in and out of Amazon Redshift
⢠Business insights and reporting
⢠Deriving value from data
⢠Democratizing data access
41. Amazon Redshift ecosystem at Coursera
⢠Data flow in and out of Amazon Redshift
⢠Business insights and reporting
⢠Deriving value from data
⢠Democratizing data access
43. Amazon Redshift ecosystem at Coursera
⢠Data flow in and out of Amazon Redshift
⢠Business insights and reporting
⢠Data products
⢠Democratizing data access
44. ⢠Provide directional insights and aggregate metrics
⢠Aggregate metrics over all classes on Coursera: < 5 seconds
⢠Goal: Insight at the speed of thought
⢠Results1: 0.8s median, 28s p95, 120s p99
⢠Companywide goal tracking
⢠Scheduled reports to internal and external stakeholders
⢠Crucial part of data informed culture
1 Results for ad hoc queries ran in the last 90 days
Business insights and reporting
45. Amazon Redshift ecosystem at Coursera
⢠Data flow in and out of Amazon Redshift
⢠Business insights and reporting
⢠Data products
⢠Democratizing data access
46. ⢠AB experimentation
⢠300 M impression table joined with 1.8 B events table in 12 minutes.
⢠Recommendations model
⢠Amazon Redshift for relational transformation
⢠Unload to S3 for model training
⢠Providing university partners with analytical dashboard
and research exports
Data Products
47. Amazon Redshift ecosystem at Coursera
⢠Data flow in and out of Amazon Redshift
⢠Business insights and reporting
⢠Data products
⢠Democratizing data access
50. Learnings from 3 years on Amazon Redshift
⢠Thinking in Amazon Redshift
⢠Communicate to users
⢠Surprises
⢠Reflections
51. Thinking in Amazon Redshift
⢠Columnar
⢠SELECT * considered harmful in most cases
52. Thinking in Amazon Redshift
⢠Columnar
⢠SELECT * considered harmful in most cases
⢠Nodes, slices, blocks
⢠1 MB blocks per slice, n slices per node depending on node type
53. Thinking in Amazon Redshift
⢠Columnar
⢠SELECT * considered harmful in most cases
⢠Nodes, slices, blocks
⢠1 MB blocks per slice, n slices per node depending on node type
⢠Sorting and distribution
⢠Share nothing massively parallel processing => data is sorted per slice
⢠Up to 2 orders of magnitude increase in JOIN/GROUP BY for merge join vs hash join
54. Communicating to users
⢠Prefer the scientific method over gut feel
⢠Investigate how many rows were materialized with svl_query_report
⢠Understand EXPLAIN plan for data distribution, join strategy, predicate order
55. Communicating to users
⢠Prefer the scientific method over gut feel
⢠Investigate how many rows were materialized with svl_query_report
⢠Understand EXPLAIN plan for data distribution, join strategy, predicate order
⢠SQL style guide for readability
⢠Leading commas, capitalized SQL keywords, conventions for handling dates/timestamps,
conventions for table names, mapping tables
56. Communicating to users
⢠Prefer the scientific method over gut feel
⢠Investigate how many rows were materialized with svl_query_report
⢠Understand EXPLAIN plan for data distribution, join strategy, predicate order
⢠SQL style guide for readability
⢠Leading commas, capitalized SQL keywords, conventions for handling dates/timestamps,
conventions for table names, mapping tables
⢠Use the right tool for the right task
⢠Amazon Redshift is not for online traffic serving
⢠Amazon Redshift is not for stream processing
57. Surprises
⢠"Fundamental theorem of Redshift at Coursera"
⢠Most queries involve full table scans
⢠9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column
⢠Store ~75 M integer values and maintain 1 block per slice
58. Surprises
⢠"Fundamental theorem of Redshift at Coursera"
⢠Most queries involve full table scans
⢠9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column
⢠Store ~75 M integer values and maintain 1 block per slice
⢠Features may behave in unexpected fashions
⢠Sort key compression
⢠Primary and foreign keys
59. Surprises
⢠"Fundamental theorem of Redshift at Coursera"
⢠Most queries involve full table scans
⢠9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column
⢠Store ~75 M integer values and maintain 1 block per slice
⢠Features may behave in unexpected fashions
⢠Sort key compression
⢠Primary and foreign keys
⢠Features may be unexpectedly expensive
⢠COMMIT â Batch work, monitor with stl_commit_stats
⢠VACUUM â Prefer TRUNCATE, monitor with stl_vacuum, stl_query
⢠Your mileage may vary
60. Reflections
⢠Simplicity â Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
61. Reflections
⢠Simplicity â Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
⢠Scalability â Scaled cluster up 5 times in the last 3 years as data volume
and usage increased.
62. Reflections
⢠Simplicity â Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
⢠Scalability â Scaled cluster up 5 times in the last 3 years as data volume
and usage increased.
⢠Flexibility â No strict requirement on data modeling; dusty knobs for
tuning in majority of cases. Handles both heavily normalized data model
and denormalized clickstream data.
63. Reflections
⢠Simplicity â Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal tuning.
⢠Scalability â Scaled cluster up 5 times in the last 3 years as data volume
and usage increased.
⢠Flexibility â No strict requirement on data modeling; dusty knobs for
tuning in majority of cases. Handles both heavily normalized data model
and denormalized clickstream data.
⢠Extensibility â Standard API (JDBC/ODBC/libpq) and integration points.
64. Resources
Pavan Pothukuchi | pavanpo@amazon.com |
Chris Liu | cliu@coursera.org |
Detail pages
⢠http://aws.amazon.com/redshift
⢠https://aws.amazon.com/marketplace/redshift/
Best practices
⢠http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
⢠http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
⢠http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
Related breakout sessions
⢠Deep Dive on Amazon QuickSight (2:15â3:15 pm)
⢠Getting Started with Amazon QuickSight (2:15â3:15 pm)
⢠Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with Minimal Downtime
(4:45â5:45 pm)