SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Amazon Redshift 
Jeff Patti
What is Redshift? 
“Redshift is a fast, fully managed, petabyte-scale 
data warehouse service” 
-Amazon 
With Redshift Monetate is able to generate all of our 
analytics data for a day in ~ 2 hours 
A process that consumes billions of rows and yields millions
What isn’t Redshift? 
warehouse=# insert into fact_page_view values 
warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4); 
INSERT 0 1 
Time: 4600.094 ms 
warehouse=# select fact_time from fact_page_view 
warehouse-# where fact_date = '2014-10-02'; 
fact_time 
--------------------- 
2014-10-02 18:30:00 
(1 row) 
Time: 618.303 ms
Who am I? 
Jeff Patti 
jeffpatti@gmail.com 
Backend Engineer at Monetate 
Monetate was in Redshifts Beta in late 2012 
and has been actively developing on it since. 
We’re hiring - monetate.com/jobs/
Leaving Hive For Redshift 
● Unusual failure modes 
● Slower and pricier than 
Redshift, at least in our 
configuration 
● Custom query language 
○ Didn’t play nicely with 
our sql libraries 
● Fully Managed 
● Performant & Scalable 
● Excellent integration with 
other AWS offerings 
● PostgreSQL interface 
○ command line interface 
○ libraries for PostgreSQL 
work against Redshift
Fully Managed 
● Easy to deploy 
● Easy to scale out 
● Software updates - handled 
● Hardware failures - taken care of 
● Automatic backups - baked in
Automatic Backups 
● Periodically taken as delta from prior backup 
● Easy to create new cluster from backup, or 
overwrite existing cluster 
● Queryable during recovery, after short delay 
○ Preferentially recovers needed blocks to perform 
commands 
● This is how Monetate keeps our 
development cluster in sync with production
Maintenance Window 
● Required half hour window once a week for 
routine maintenance, such as software 
updates 
● During this time the cluster is unresponsive 
● You pick when it happens
Scaling Out 
You: Change cluster size through AWS console 
AWS: 
1. Existing cluster put into read only state 
2. New cluster caught up with existing cluster 
3. Swapped during maintenance window, 
unless specified as immediate 
a. Immediate swap causes temporary unavailability 
during canonical name record swap ( a few minutes)
Monetate 
● Core products are merchandising, web & 
email personalization, testing 
● A/B & Multivariate testing to determine 
impact of experiments 
● Involved with >20% of US ecommerce spend 
each holiday season for the past 3 years 
running
Monetate Data Collection 
To compute analytics and reports on our clients 
experiments, for that we collect a lot of data 
● Billions of page views a week 
● Billions of experiment views a week 
● Millions of purchases a week 
● etc. 
This is where Redshift comes in handy
Redshift In Monetate 
App 
App 
App 
App 
App 
Monetate is Multi-region 
& Multi-AZ 
in AWS 
Amazon 
S3 
Amazon 
Redshift 
Our 
Clients 
Data Warehousing Analytics & Reporting
Under The Covers 
● Fork of PostgreSQL 8.0.2, get nice things like 
○ Common Table Expressions 
○ Window Functions 
● Column oriented database 
● Clusters can have many machines 
○ Each machine has many slices 
○ Queries run in parallel on all slices 
● Concurrent query support & memory limiting
Instance Types
Query Concurrency
Example Redshift Table 
CREATE TABLE fact_url ( 
fact_date DATE NOT NULL ENCODE lzo, 
account_id INT NOT NULL ENCODE lzo, 
fact_time TIMESTAMP NOT NULL ENCODE lzo, 
mid BIGINT NOT NULL ENCODE lzo, 
uri VARCHAR(2048) ENCODE lzo, 
referer_uri VARCHAR(2048) ENCODE lzo, 
PRIMARY KEY (account_id, fact_time, mid) 
) 
DISTKEY (mid) 
SORTKEY (fact_date, account_id, fact_time, mid);
Per Column Compression 
● Used to fit more rows in each 1MB block 
● Trade off between CPU and IO 
● Allows Redshift to read rows from disk faster 
● Has to use more CPU to decompress data 
● Our Redshift queries are IO bound 
○ We use compression extensively
Constraints 
“Uniqueness, primary key, and foreign key 
constraints are informational only; they are not 
enforced by Amazon Redshift.” 
However, “If your application allows invalid 
foreign keys or primary keys, some queries 
could return incorrect results.” [emphasis added]
Distribution Style 
Controls how Redshift distributes rows 
● Styles 
○ Even - round robin rows (default) 
○ Key - data with the same key goes to same slice 
■ Based on a single column from the table 
○ All - data is copied to all slices 
■ Good for small tables
DISTKEY impacts Joins 
DS_DIST_NONE 
No redistribution is required, because 
corresponding slices are collocated on the 
compute nodes. You will typically have only one 
DS_DIST_NONE step, the join between the fact 
table and one dimension table. 
DS_DIST_ALL_NONE 
No redistribution is required, because the inner 
join table used DISTSTYLE ALL. The entire 
table is located on every node. 
These two are very performant 
DS_DIST_INNER 
The inner table is redistributed. 
DS_BCAST_INNER 
A copy of the entire inner table is broadcast to all 
the compute nodes. 
DS_DIST_ALL_INNER 
The entire inner table is redistributed to a single 
slice because the outer table uses DISTSTYLE 
ALL. 
DS_DIST_BOTH 
Both tables are redistributed.
Query Plan From Explain 
-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84) 
Hash Cond: ("outer".venueid = "inner".venueid) 
-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47) 
Hash Cond: ("outer".eventid = "inner".eventid) 
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30) 
Merge Cond: ("outer".listid = "inner".listid) 
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14) 
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)
Sort Key 
● Data is stored on disk in sorted order 
○ After being inserted into an empty table, or vacuumed 
● Sort Key impacts vacuum performance 
● Columnar data stored in 1MB blocks 
○ min/max data stored as metadata 
● Metadata used to improve query performance 
○ Allows Redshift to skip unnecessary blocks
Sort Key Take 1 
SORTKEY (account_id, fact_time, mid) 
● As we added new facts, bad things started happening 
account 1 
time ordered 
account 2 
time ordered 
... account n 
time ordered 
● Resorting rows for vacuuming had to reorder almost all the rows :( 
● This made vacuuming unreasonably slow, affecting how often we could 
vacuum and therefore query performance 
new facts for all 
accounts 
account 1 
time ordered 
account 2 
time ordered 
... account n 
time ordered
Sort Key Take 2 
SORTKEY (fact_time, account_id, mid) 
● Now our table is like an append only log, but had poor query performance 
00:00 
account ordered 
00:01 
account ordered 
● For many of our queries, we only look at one account at a time 
● Redshift blocks are 1MB each, each spanned many accounts 
● When querying a single account, had to read from disk and ignore many 
rows from other accounts 
... Now 
account ordered
Sort Key Take 3 
SORTKEY (fact_date, account_id, fact_time, mid) 
Jan 1st 
account ordered 
Jan 2nd 
account ordered 
● Append only log ✓ 
○ Cheap vacuuming ✓ 
... Today 
● Single or few accounts per block ✓ 
account ordered 
○ Significantly improved query performance ✓
Redshift ⇔ S3 
Redshift & S3 have excellent integration 
● Unload from Redshift to S3 via UNLOAD 
○ Each slice unloads separately to S3 
○ We unload into a CSV format 
● Load into Redshift from S3 via COPY 
○ Applies all as inserts 
○ Primary keys aren’t enforced by Redshift 
■ Use staging table to detect duplicate keys
Redshift UNLOAD 
unload ('select * from venue order by venueid') 
to 's3://mybucket/tickit/venue/reload_' 
credentials 'aws_access_key_id=<access-key-id>; 
aws_secret_access_key=<secret-access-key>' 
manifest 
delimiter ',';
Redshift UNLOAD Tip 
unload ('select * from venue order by venueid') 
● Query in unload is quoted which wreaks havoc with 
quotes around dates, fact_time <= '2014-10-02' 
● Instead of escaping the quotes around the date times 
○ unload ($$ select * from venue order by 
venueid $$)
Redshift COPY 
copy venue 
from 's3://mybucket/tickit/venue/reload_manifest' 
credentials 'aws_access_key_id=<access-key-id>; 
aws_secret_access_key=<secret-access-key>' 
manifest 
delimiter ',';
Try it Yourself! For Free!!! 
Amazon Redshift documentation is well written 
It contains great tutorials with pricing estimates 
Amazon offers a 750 hour free trial of redshift 
DW2.Large nodes
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2Paulraj Pappaiah
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAmazon Web Services
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftAmazon Web Services
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기Amazon Web Services Korea
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013Amazon Web Services
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayAmazon Web Services Korea
 
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Amazon Web Services
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Web Services
 

Was ist angesagt? (20)

BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon Redshift
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
 
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
 

Ähnlich wie Amazon Redshift

Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce HBaseCon
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with RedshiftAmazon Web Services
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDBAWS Germany
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Future of Data Meetup
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습Amazon Web Services Korea
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters MongoDB
 

Ähnlich wie Amazon Redshift (20)

Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 

Kürzlich hochgeladen

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Kürzlich hochgeladen (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Amazon Redshift

  • 2. What is Redshift? “Redshift is a fast, fully managed, petabyte-scale data warehouse service” -Amazon With Redshift Monetate is able to generate all of our analytics data for a day in ~ 2 hours A process that consumes billions of rows and yields millions
  • 3. What isn’t Redshift? warehouse=# insert into fact_page_view values warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4); INSERT 0 1 Time: 4600.094 ms warehouse=# select fact_time from fact_page_view warehouse-# where fact_date = '2014-10-02'; fact_time --------------------- 2014-10-02 18:30:00 (1 row) Time: 618.303 ms
  • 4. Who am I? Jeff Patti jeffpatti@gmail.com Backend Engineer at Monetate Monetate was in Redshifts Beta in late 2012 and has been actively developing on it since. We’re hiring - monetate.com/jobs/
  • 5. Leaving Hive For Redshift ● Unusual failure modes ● Slower and pricier than Redshift, at least in our configuration ● Custom query language ○ Didn’t play nicely with our sql libraries ● Fully Managed ● Performant & Scalable ● Excellent integration with other AWS offerings ● PostgreSQL interface ○ command line interface ○ libraries for PostgreSQL work against Redshift
  • 6. Fully Managed ● Easy to deploy ● Easy to scale out ● Software updates - handled ● Hardware failures - taken care of ● Automatic backups - baked in
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Automatic Backups ● Periodically taken as delta from prior backup ● Easy to create new cluster from backup, or overwrite existing cluster ● Queryable during recovery, after short delay ○ Preferentially recovers needed blocks to perform commands ● This is how Monetate keeps our development cluster in sync with production
  • 13.
  • 14. Maintenance Window ● Required half hour window once a week for routine maintenance, such as software updates ● During this time the cluster is unresponsive ● You pick when it happens
  • 15. Scaling Out You: Change cluster size through AWS console AWS: 1. Existing cluster put into read only state 2. New cluster caught up with existing cluster 3. Swapped during maintenance window, unless specified as immediate a. Immediate swap causes temporary unavailability during canonical name record swap ( a few minutes)
  • 16. Monetate ● Core products are merchandising, web & email personalization, testing ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of US ecommerce spend each holiday season for the past 3 years running
  • 17. Monetate Data Collection To compute analytics and reports on our clients experiments, for that we collect a lot of data ● Billions of page views a week ● Billions of experiment views a week ● Millions of purchases a week ● etc. This is where Redshift comes in handy
  • 18. Redshift In Monetate App App App App App Monetate is Multi-region & Multi-AZ in AWS Amazon S3 Amazon Redshift Our Clients Data Warehousing Analytics & Reporting
  • 19. Under The Covers ● Fork of PostgreSQL 8.0.2, get nice things like ○ Common Table Expressions ○ Window Functions ● Column oriented database ● Clusters can have many machines ○ Each machine has many slices ○ Queries run in parallel on all slices ● Concurrent query support & memory limiting
  • 22. Example Redshift Table CREATE TABLE fact_url ( fact_date DATE NOT NULL ENCODE lzo, account_id INT NOT NULL ENCODE lzo, fact_time TIMESTAMP NOT NULL ENCODE lzo, mid BIGINT NOT NULL ENCODE lzo, uri VARCHAR(2048) ENCODE lzo, referer_uri VARCHAR(2048) ENCODE lzo, PRIMARY KEY (account_id, fact_time, mid) ) DISTKEY (mid) SORTKEY (fact_date, account_id, fact_time, mid);
  • 23. Per Column Compression ● Used to fit more rows in each 1MB block ● Trade off between CPU and IO ● Allows Redshift to read rows from disk faster ● Has to use more CPU to decompress data ● Our Redshift queries are IO bound ○ We use compression extensively
  • 24. Constraints “Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift.” However, “If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.” [emphasis added]
  • 25. Distribution Style Controls how Redshift distributes rows ● Styles ○ Even - round robin rows (default) ○ Key - data with the same key goes to same slice ■ Based on a single column from the table ○ All - data is copied to all slices ■ Good for small tables
  • 26. DISTKEY impacts Joins DS_DIST_NONE No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table. DS_DIST_ALL_NONE No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node. These two are very performant DS_DIST_INNER The inner table is redistributed. DS_BCAST_INNER A copy of the entire inner table is broadcast to all the compute nodes. DS_DIST_ALL_INNER The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL. DS_DIST_BOTH Both tables are redistributed.
  • 27. Query Plan From Explain -> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84) Hash Cond: ("outer".venueid = "inner".venueid) -> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47) Hash Cond: ("outer".eventid = "inner".eventid) -> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30) Merge Cond: ("outer".listid = "inner".listid) -> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14) -> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)
  • 28. Sort Key ● Data is stored on disk in sorted order ○ After being inserted into an empty table, or vacuumed ● Sort Key impacts vacuum performance ● Columnar data stored in 1MB blocks ○ min/max data stored as metadata ● Metadata used to improve query performance ○ Allows Redshift to skip unnecessary blocks
  • 29. Sort Key Take 1 SORTKEY (account_id, fact_time, mid) ● As we added new facts, bad things started happening account 1 time ordered account 2 time ordered ... account n time ordered ● Resorting rows for vacuuming had to reorder almost all the rows :( ● This made vacuuming unreasonably slow, affecting how often we could vacuum and therefore query performance new facts for all accounts account 1 time ordered account 2 time ordered ... account n time ordered
  • 30. Sort Key Take 2 SORTKEY (fact_time, account_id, mid) ● Now our table is like an append only log, but had poor query performance 00:00 account ordered 00:01 account ordered ● For many of our queries, we only look at one account at a time ● Redshift blocks are 1MB each, each spanned many accounts ● When querying a single account, had to read from disk and ignore many rows from other accounts ... Now account ordered
  • 31. Sort Key Take 3 SORTKEY (fact_date, account_id, fact_time, mid) Jan 1st account ordered Jan 2nd account ordered ● Append only log ✓ ○ Cheap vacuuming ✓ ... Today ● Single or few accounts per block ✓ account ordered ○ Significantly improved query performance ✓
  • 32. Redshift ⇔ S3 Redshift & S3 have excellent integration ● Unload from Redshift to S3 via UNLOAD ○ Each slice unloads separately to S3 ○ We unload into a CSV format ● Load into Redshift from S3 via COPY ○ Applies all as inserts ○ Primary keys aren’t enforced by Redshift ■ Use staging table to detect duplicate keys
  • 33. Redshift UNLOAD unload ('select * from venue order by venueid') to 's3://mybucket/tickit/venue/reload_' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' manifest delimiter ',';
  • 34. Redshift UNLOAD Tip unload ('select * from venue order by venueid') ● Query in unload is quoted which wreaks havoc with quotes around dates, fact_time <= '2014-10-02' ● Instead of escaping the quotes around the date times ○ unload ($$ select * from venue order by venueid $$)
  • 35. Redshift COPY copy venue from 's3://mybucket/tickit/venue/reload_manifest' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' manifest delimiter ',';
  • 36. Try it Yourself! For Free!!! Amazon Redshift documentation is well written It contains great tutorials with pricing estimates Amazon offers a 750 hour free trial of redshift DW2.Large nodes