Amazon Redshift is a cloud-hosted data warehouse service from AWS that allows for petabyte-scale analytics on large datasets using massive parallel processing. It stores data in a column-oriented format and integrates with other AWS services like S3, DynamoDB, and EMR. Redshift provides features like columnar storage, parallel query processing across multiple nodes, automated backups and restores, encryption, and integration with SQL and BI tools. The document demonstrates using Redshift alongside S3, Pipeline, EC2/MySQL, and Qlik Sense to build a scalable data warehouse solution in the cloud.
2. What is Redshift?
• Cloud-Hosted data warehouse services: AWS
• Massive parallel processing (MPP)
• Analytics workloads on large scale datasets
• Stored by a column-oriented DBMS principle.
• Large scale datasets. Up petabytes
2
9. Data growing fast!
9
• Enterprise Data is growing at an exponential
rate
• Structured and Unstructured data
• Data requirements change rapidly
• Cost to maintain data is prohibitive
• Hardware not scalable
• Expensive to support
• Business agility suffers
• Reporting unable to change with the pace
of business
• Data silos create bottlenecks
10. Solution Proposal
10
• Leverage the flexibility of
Amazon Web Services
• Scalable
• Flexible
• Cost-Effective
• AWS Redshift
• Data Warehouse
• AWS S3
• Persistent Storage
• AWS Data Pipeline
• Data Orchestration and ETL
• AWS EC2 / MySQL
• Transaction Processing
• Qlik Sense Desktop
• Business Intelligence Reporting
11. AWS Redshift
11
Petabyte-Scale Data Warehouse
• Optimized for DW
• Columnar Storage
• Data Compression
• Zone Maps to reduce I/O
• Scalable
• Easily change # of Nodes
• 1-32 node configurations
• Cost-Efficient
• On-Demand pricing starts @ $.25/hr.
• Run as low as $1,000 per TB/yr.
12. AWS Redshift
12
Petabyte-Scale Data Warehouse
• Get Started in Minutes
• Web Console
• CLI
• Full Managed
• Fault Tolerant
• Automated Backups / Fast Restores
• Encryption
• Data at Rest – AES-256
• Can manage own keys
• Compatible
• SQL
• Data Integrations
13. AWS Simple Storage Service (S3)
13
Online File/Object Storage
• Durable
• Data redundantly stored across
multiple facilities/devices
• Available
• 99.99% availability
• Choose from different AWS regions
• Secure
• SSL – Data Transfer
• At Rest – Auto-Encrypted
• Scalable
• Flexible capacity based on data
demands
• Low Cost
• Pay for what you use
14. AWS Simple Storage Service (S3)
14
Reliable Simple
Scalable Low Cost
• Distributed Infrastructure
ensures activity completion
• Integrated with SNS for event
notifications
Data Processing and Transfer Platform
• Drag-and drop console
• Pre-built templates for other
AWS services
• Visual Pipeline editor
• Dispatch work to one machine
or many
• Serial and/or Parallel
processing
• Charged per Pipeline
• Frequency
• Volume
15. AWS Elastic Compute Cloud (EC2) + MySQL
15
Cloud Infrastructure for Applications & Development
• Flexible
• Linux and Windows virtual machines
• Supports multiple instance types, software packages, resource configs
• Elastic
• Increase/Decrease capacity within minutes
• Commission any number of server instances simultaneously
• Secure
• Security Groups / Network ACLs
• VPC / VPN
• Low Cost
• On-Demand / Reserved / Spot Instance options
16. Qlik Sense Desktop
16
Data Visualization / BI Tool
• Drag-and-drop Visualizations
• Smart Search
• Explore Multiple data sources in
single dashboard/report
• Access analytics on multiple device
types
• Collaborate and share insights within
reports
• Enables self-service simplicity
19. Tech Demo
19
• During this demonstration, we will discuss the setup and execution of using Amazon Redshift as an on-
demand, cloud-based, data warehouse solution.
• Our sample data comes from the “Million Song Dataset” available from Columbia University -
http://labrosa.ee.columbia.edu/millionsong/
• The BI Tool that is used to create a business-focused dashboard is Qlik Sense Desktop, a Windows-
based desktop application - http://www.qlik.com/us/explore/products/sense
• In addition, the following services in the Amazon Web Services stack are used: Amazon Redshift,
Amazon S3, Pipeline, and EC2 (Linux AMI running MySQL serves as a transactional database for the
demo).
20. Demo Steps
1. Create new Linux AMI that will host
MySQL for transaction data processing.
• Start new Linux instance and update security groups
for MySQL accessibility
• Install MySQL
• Create new MySQL users, database, and populate
with demonstration dataset (using MySQL
Workbench)
2. Create new S3 bucket for Pipeline ETL
processes
3. Create Redshift Cluster (data warehouse)
• Instantiate cluster
• Connect using SQL Workbench (via JDBC)
• Create initial data table
4. Create AWS Pipeline(s) for data processing
• MySQL -> S3
• Activate Pipeline for initial ETL from MySQL to S3
• S3 -> Redshift
• Activate Pipeline for initial ETL from S3 to Redshift
5. Install Qlik Sense Desktop
• Install Redshift ODBC Drivers locally on desktop
• Create Qlik Sense “Report” (Included in FP
submission for simplicity). Verify initial data in
report.
6. Solution Demonstration
(Using Amazon CLI – Command Line Interface)
• Simulate transactional data load in MySQL
• Verify new data (record count) in MySQL using
MySQL Workbench
• Delete initial data in S3 bucket (from Round 1)
• Trigger AWS Pipeline that loads data to S3 from
MySQL
• Verify data load (CSV file) in S3 bucket
• Trigger AWS Pipeline that loads data to Redshift
from S3.
• Verify data load in Redshift (using SQL Workbench)
• Refresh Qlik report to view analytics of initial data
load.
20
28. Results
28
• Amazon Web Services provides a powerful
platform to extend on-premise Infrastructure to
the cloud
• Enables massive data consolidation
• Efficient ETL orchestration & workflow
• Simplifies resource management and drives
down computing costs across multiple
services
• Changing needs of Business Executives can be
made quickly and efficiently
• AWS supports industry standard data
source connections
• Existing Reporting/Dashboards can
consume AWS Redshift data with no code
changes