2. What is Redshift?
“Redshift is a fast, fully managed, petabyte-scale
data warehouse service”
-Amazon
With Redshift Monetate is able to generate all of our
analytics data for a day in ~ 2 hours
A process that consumes billions of rows and yields millions
3. What isn’t Redshift?
warehouse=# insert into fact_page_view values
warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4);
INSERT 0 1
Time: 4600.094 ms
warehouse=# select fact_time from fact_page_view
warehouse-# where fact_date = '2014-10-02';
fact_time
---------------------
2014-10-02 18:30:00
(1 row)
Time: 618.303 ms
4. Who am I?
Jeff Patti
jeffpatti@gmail.com
Backend Engineer at Monetate
Monetate was in Redshifts Beta in late 2012
and has been actively developing on it since.
We’re hiring - monetate.com/jobs/
5. Leaving Hive For Redshift
● Unusual failure modes
● Slower and pricier than
Redshift, at least in our
configuration
● Custom query language
○ Didn’t play nicely with
our sql libraries
● Fully Managed
● Performant & Scalable
● Excellent integration with
other AWS offerings
● PostgreSQL interface
○ command line interface
○ libraries for PostgreSQL
work against Redshift
6. Fully Managed
● Easy to deploy
● Easy to scale out
● Software updates - handled
● Hardware failures - taken care of
● Automatic backups - baked in
7.
8.
9.
10.
11.
12. Automatic Backups
● Periodically taken as delta from prior backup
● Easy to create new cluster from backup, or
overwrite existing cluster
● Queryable during recovery, after short delay
○ Preferentially recovers needed blocks to perform
commands
● This is how Monetate keeps our
development cluster in sync with production
13.
14. Maintenance Window
● Required half hour window once a week for
routine maintenance, such as software
updates
● During this time the cluster is unresponsive
● You pick when it happens
15. Scaling Out
You: Change cluster size through AWS console
AWS:
1. Existing cluster put into read only state
2. New cluster caught up with existing cluster
3. Swapped during maintenance window,
unless specified as immediate
a. Immediate swap causes temporary unavailability
during canonical name record swap ( a few minutes)
16. Monetate
● Core products are merchandising, web &
email personalization, testing
● A/B & Multivariate testing to determine
impact of experiments
● Involved with >20% of US ecommerce spend
each holiday season for the past 3 years
running
17. Monetate Data Collection
To compute analytics and reports on our clients
experiments, for that we collect a lot of data
● Billions of page views a week
● Billions of experiment views a week
● Millions of purchases a week
● etc.
This is where Redshift comes in handy
18. Redshift In Monetate
App
App
App
App
App
Monetate is Multi-region
& Multi-AZ
in AWS
Amazon
S3
Amazon
Redshift
Our
Clients
Data Warehousing Analytics & Reporting
19. Under The Covers
● Fork of PostgreSQL 8.0.2, get nice things like
○ Common Table Expressions
○ Window Functions
● Column oriented database
● Clusters can have many machines
○ Each machine has many slices
○ Queries run in parallel on all slices
● Concurrent query support & memory limiting
22. Example Redshift Table
CREATE TABLE fact_url (
fact_date DATE NOT NULL ENCODE lzo,
account_id INT NOT NULL ENCODE lzo,
fact_time TIMESTAMP NOT NULL ENCODE lzo,
mid BIGINT NOT NULL ENCODE lzo,
uri VARCHAR(2048) ENCODE lzo,
referer_uri VARCHAR(2048) ENCODE lzo,
PRIMARY KEY (account_id, fact_time, mid)
)
DISTKEY (mid)
SORTKEY (fact_date, account_id, fact_time, mid);
23. Per Column Compression
● Used to fit more rows in each 1MB block
● Trade off between CPU and IO
● Allows Redshift to read rows from disk faster
● Has to use more CPU to decompress data
● Our Redshift queries are IO bound
○ We use compression extensively
24. Constraints
“Uniqueness, primary key, and foreign key
constraints are informational only; they are not
enforced by Amazon Redshift.”
However, “If your application allows invalid
foreign keys or primary keys, some queries
could return incorrect results.” [emphasis added]
25. Distribution Style
Controls how Redshift distributes rows
● Styles
○ Even - round robin rows (default)
○ Key - data with the same key goes to same slice
■ Based on a single column from the table
○ All - data is copied to all slices
■ Good for small tables
26. DISTKEY impacts Joins
DS_DIST_NONE
No redistribution is required, because
corresponding slices are collocated on the
compute nodes. You will typically have only one
DS_DIST_NONE step, the join between the fact
table and one dimension table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner
join table used DISTSTYLE ALL. The entire
table is located on every node.
These two are very performant
DS_DIST_INNER
The inner table is redistributed.
DS_BCAST_INNER
A copy of the entire inner table is broadcast to all
the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single
slice because the outer table uses DISTSTYLE
ALL.
DS_DIST_BOTH
Both tables are redistributed.
28. Sort Key
● Data is stored on disk in sorted order
○ After being inserted into an empty table, or vacuumed
● Sort Key impacts vacuum performance
● Columnar data stored in 1MB blocks
○ min/max data stored as metadata
● Metadata used to improve query performance
○ Allows Redshift to skip unnecessary blocks
29. Sort Key Take 1
SORTKEY (account_id, fact_time, mid)
● As we added new facts, bad things started happening
account 1
time ordered
account 2
time ordered
... account n
time ordered
● Resorting rows for vacuuming had to reorder almost all the rows :(
● This made vacuuming unreasonably slow, affecting how often we could
vacuum and therefore query performance
new facts for all
accounts
account 1
time ordered
account 2
time ordered
... account n
time ordered
30. Sort Key Take 2
SORTKEY (fact_time, account_id, mid)
● Now our table is like an append only log, but had poor query performance
00:00
account ordered
00:01
account ordered
● For many of our queries, we only look at one account at a time
● Redshift blocks are 1MB each, each spanned many accounts
● When querying a single account, had to read from disk and ignore many
rows from other accounts
... Now
account ordered
31. Sort Key Take 3
SORTKEY (fact_date, account_id, fact_time, mid)
Jan 1st
account ordered
Jan 2nd
account ordered
● Append only log ✓
○ Cheap vacuuming ✓
... Today
● Single or few accounts per block ✓
account ordered
○ Significantly improved query performance ✓
32. Redshift ⇔ S3
Redshift & S3 have excellent integration
● Unload from Redshift to S3 via UNLOAD
○ Each slice unloads separately to S3
○ We unload into a CSV format
● Load into Redshift from S3 via COPY
○ Applies all as inserts
○ Primary keys aren’t enforced by Redshift
■ Use staging table to detect duplicate keys
33. Redshift UNLOAD
unload ('select * from venue order by venueid')
to 's3://mybucket/tickit/venue/reload_'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>'
manifest
delimiter ',';
34. Redshift UNLOAD Tip
unload ('select * from venue order by venueid')
● Query in unload is quoted which wreaks havoc with
quotes around dates, fact_time <= '2014-10-02'
● Instead of escaping the quotes around the date times
○ unload ($$ select * from venue order by
venueid $$)
36. Try it Yourself! For Free!!!
Amazon Redshift documentation is well written
It contains great tutorials with pricing estimates
Amazon offers a 750 hour free trial of redshift
DW2.Large nodes