AWS Redshift Data Warehouse

Amazon Redshift
1
Author: Douglas Bernardini

What is Redshift?
• Cloud-Hosted data warehouse services: AWS
• Massive parallel processing (MPP)
• Analytics workloads on large scale datasets
• Stored by a column-oriented DBMS principle.
• Large scale datasets. Up petabytes
2

Features and Benefits
• Columnar storage
• Parallelizing queries
• Multiple nodes
• Custom JDBC and ODBC drivers
• Ready integraded:
• Amazon S3;
• Amazon DynamoDB;
• Amazon Elastic MapReduce;
• Amazon Kinesis
• Any SSH-enabled host.
• Fault Tolerant
• Automated Backups
• Fast Restores
• Secure:
• Encryption
• Network Isolation
• Audit and Compliance
• SQL friendly
3

MarketPlace
BI Tools
• Actian
• Actuate Corporation
• Birst
• Chartio
• ClearStory Data
• Dundas Data Visualization
• Infor
• Jaspersoft
• Jreport
• Logi Analytics
• Looker (Software)
• MicroStrategy
• Pentaho
• Periscope.io
Data Integrations Tools
• Attunity
• FlyData
• Informatica
• SnapLogic
• Talend
• Xplenty
4
• Qlik
• Redrock BI
• SAS (software)
• SiSense
• Spotfire
• Tableau Software

Data growing fast!
9
• Enterprise Data is growing at an exponential
rate
• Structured and Unstructured data
• Data requirements change rapidly
• Cost to maintain data is prohibitive
• Hardware not scalable
• Expensive to support
• Business agility suffers
• Reporting unable to change with the pace
of business
• Data silos create bottlenecks

Solution Proposal
10
• Leverage the flexibility of
Amazon Web Services
• Scalable
• Flexible
• Cost-Effective
• AWS Redshift
• Data Warehouse
• AWS S3
• Persistent Storage
• AWS Data Pipeline
• Data Orchestration and ETL
• AWS EC2 / MySQL
• Transaction Processing
• Qlik Sense Desktop
• Business Intelligence Reporting

AWS Redshift
11
Petabyte-Scale Data Warehouse
• Optimized for DW
• Columnar Storage
• Data Compression
• Zone Maps to reduce I/O
• Scalable
• Easily change # of Nodes
• 1-32 node configurations
• Cost-Efficient
• On-Demand pricing starts @ $.25/hr.
• Run as low as $1,000 per TB/yr.

AWS Redshift
12
Petabyte-Scale Data Warehouse
• Get Started in Minutes
• Web Console
• CLI
• Full Managed
• Fault Tolerant
• Automated Backups / Fast Restores
• Encryption
• Data at Rest – AES-256
• Can manage own keys
• Compatible
• SQL
• Data Integrations

AWS Simple Storage Service (S3)
13
Online File/Object Storage
• Durable
• Data redundantly stored across
multiple facilities/devices
• Available
• 99.99% availability
• Choose from different AWS regions
• Secure
• SSL – Data Transfer
• At Rest – Auto-Encrypted
• Scalable
• Flexible capacity based on data
demands
• Low Cost
• Pay for what you use

AWS Simple Storage Service (S3)
14
Reliable Simple
Scalable Low Cost
• Distributed Infrastructure
ensures activity completion
• Integrated with SNS for event
notifications
Data Processing and Transfer Platform
• Drag-and drop console
• Pre-built templates for other
AWS services
• Visual Pipeline editor
• Dispatch work to one machine
or many
• Serial and/or Parallel
processing
• Charged per Pipeline
• Frequency
• Volume

AWS Elastic Compute Cloud (EC2) + MySQL
15
Cloud Infrastructure for Applications & Development
• Flexible
• Linux and Windows virtual machines
• Supports multiple instance types, software packages, resource configs
• Elastic
• Increase/Decrease capacity within minutes
• Commission any number of server instances simultaneously
• Secure
• Security Groups / Network ACLs
• VPC / VPN
• Low Cost
• On-Demand / Reserved / Spot Instance options

Qlik Sense Desktop
16
Data Visualization / BI Tool
• Drag-and-drop Visualizations
• Smart Search
• Explore Multiple data sources in
single dashboard/report
• Access analytics on multiple device
types
• Collaborate and share insights within
reports
• Enables self-service simplicity

Tech Demo
19
• During this demonstration, we will discuss the setup and execution of using Amazon Redshift as an on-
demand, cloud-based, data warehouse solution.
• Our sample data comes from the “Million Song Dataset” available from Columbia University -
http://labrosa.ee.columbia.edu/millionsong/
• The BI Tool that is used to create a business-focused dashboard is Qlik Sense Desktop, a Windows-
based desktop application - http://www.qlik.com/us/explore/products/sense
• In addition, the following services in the Amazon Web Services stack are used: Amazon Redshift,
Amazon S3, Pipeline, and EC2 (Linux AMI running MySQL serves as a transactional database for the
demo).

Demo Steps
1. Create new Linux AMI that will host
MySQL for transaction data processing.
• Start new Linux instance and update security groups
for MySQL accessibility
• Install MySQL
• Create new MySQL users, database, and populate
with demonstration dataset (using MySQL
Workbench)
2. Create new S3 bucket for Pipeline ETL
processes
3. Create Redshift Cluster (data warehouse)
• Instantiate cluster
• Connect using SQL Workbench (via JDBC)
• Create initial data table
4. Create AWS Pipeline(s) for data processing
• MySQL -> S3
• Activate Pipeline for initial ETL from MySQL to S3
• S3 -> Redshift
• Activate Pipeline for initial ETL from S3 to Redshift
5. Install Qlik Sense Desktop
• Install Redshift ODBC Drivers locally on desktop
• Create Qlik Sense “Report” (Included in FP
submission for simplicity). Verify initial data in
report.
6. Solution Demonstration
(Using Amazon CLI – Command Line Interface)
• Simulate transactional data load in MySQL
• Verify new data (record count) in MySQL using
MySQL Workbench
• Delete initial data in S3 bucket (from Round 1)
• Trigger AWS Pipeline that loads data to S3 from
MySQL
• Verify data load (CSV file) in S3 bucket
• Trigger AWS Pipeline that loads data to Redshift
from S3.
• Verify data load in Redshift (using SQL Workbench)
• Refresh Qlik report to view analytics of initial data
load.
20

Add New data into MySQL
25
Insert songs_data
Count (*)

Checking Redshift
26
Select count (*) from song_data

Results
28
• Amazon Web Services provides a powerful
platform to extend on-premise Infrastructure to
the cloud
• Enables massive data consolidation
• Efficient ETL orchestration & workflow
• Simplifies resource management and drives
down computing costs across multiple
services
• Changing needs of Business Executives can be
made quickly and efficiently
• AWS supports industry standard data
source connections
• Existing Reporting/Dashboards can
consume AWS Redshift data with no code
changes

douglas.bernardini@d2-data.com
Questions?
29

AWS Redshift Data Warehouse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Redshift Data Warehouse

Similar to AWS Redshift Data Warehouse (20)

More from Douglas Bernardini

More from Douglas Bernardini (20)

Recently uploaded

Recently uploaded (20)

AWS Redshift Data Warehouse