Migrating your traditional Data Warehouse to a Modern Data Lake

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Migrating Your Traditional Data
Warehouse to a Modern Data Lake
Level 200
Osemeke Isibor
Solutions Architect

Agenda
• Challenges of Traditional Data Architectures
• Data lake Architecture
• Benefits of Data lake
• AWS Approach to Data lake
• Extending your Data lake with Amazon Redshift Data
warehouse service
• Redshift spectrum
• Customer story
• Recent and upcoming launches
• Conclusion and Q&A

Unlocking Data
Most companies and organizations are embarking on
ambitious innovation initiatives to unlock their data.
The data already exists but goes unused or is locked away
from complimentary data sets in isolated data silos.

Challenges with Legacy Data Architectures
•Can’t move data across silos
•Can’t deal with dynamic data and real-time processing
•Can’t deal with format diversity and change rate
•Complex ETL processes
•Difficult to find the people, adequate skills to configure and
manage these systems
•Can’t integrate with the explosion of available social and
behavior tracking data

Legacy Data Architectures Exist as Isolated
and monolithic Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance

Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.

Benefits of a Data Lake – All Data in One
Place
Store and analyze all of your data,
from all of your sources, in one
centralized location.
“Why is the data distributed in
many locations? Where is the
single source of truth ?”

Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”

Benefits of a Data Lake – Storage vs
Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”

Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.

AWS Approach to Data Lake

Amazon S3
is the Data Lake
AWS Approach to Data Lake

Processing & Analytics
Real-time Batch
AI & Predictive
BI & Data Visualization
Transactional &
RDBMS
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
Kinesis Streams
& Firehose

Redshift
Data Warehouse

Amazon Redshift overview

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor,
product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
“Amazon Redshift has the largest adoption
of BDW in the cloud.”
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud – some over 10
petabytes in size.”
AWS received a score of 5/5 (the highest
score possible) in the: customer base,
market awareness, ability to execute, road
map, support, and partners criteria
Forrester Wave Big Data Warehouse Q2 2017

Amazon Redshift – Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scales from gigabytes to Exabyte
Fast at scale
Columnar storage technology
to improve I/O efficiency and
scale query performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end; extensive
certification and compliance
Analyze optimized data
formats on direct-attached
disks, and all open data
formats in Amazon S3

Amazon Redshift is widely available
Ireland
Frankfurt
London
Beijing
Mumbai
Seoul
Singapore
Sydney
Tokyo
Sao Paulo
US East – N Virginia
US East – Ohio
US West – Oregon
US West – N California
GovCloud
Canada – Central, Montreal
Currently Available

Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e
S3 data lakeAmazon
Redshift data
Redshift Spectrum
query engine
Exabyte Amazon Redshift SQL queries against S3
Join data across Amazon Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
Parquet, ORC, Grok, Avro, & CSV data formats
Pay only for the amount of data scanned

Redshift
Spectrum
Query your
dat a lake
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …

Spectrum: Exabyte query in less than three minutes
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
• Roughly 140 TB of customer item order detail
records for each day over past 20 years
• 190 million files across 15,000 partitions in S3
• One partition per day for USA and rest of world
• Total data size is over an exabyte
Optimization:
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Amazon Redshift query optimizer..40X
Hive (1000 nodes) Redshift Spectrum
5 years 155 seconds
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data format
used by Amazon Retail.

NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES

• Nasdaq implements an S3 data lake + Redshift data warehouse architecture
• Most recent two years of data is kept in the Redshift data warehouse and
snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon Redshift & S3
data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set (~1100 tables)

High Level Architecture Overview

Can Handle Many Data Sources
• Internal DBs, CSV files, stream captures, etc.
• Data from all 7 exchanges operated by Nasdaq
• Orders, quotes, trade executions
• Market “tick” data
• Security master
• Membership
• All highly structured and consistent row-oriented data

AWS Architecture Passed Numerous Security Audits
Internal
• Information Security
• Internal Audit
• Nasdaq Risk Committee
External
• SOX
• US Securities & Exchange Commission (SEC)

Selected Amazon Redshift Customers

Selected Amazon Redshift Partners
Data Integration Systems IntegratorsBusiness Intelligence

Recent and upcoming launches

New Dense Compute Node - DC2
2X Performance @ Same Price as DC1
3x more I/O with 30% better storage utilization than DC1
“We saw a 9x reduction in month-end reporting time
with Amazon Redshift dc2 nodes as compared to dc1”
- Bradley Todd,
Technical Architect, Liberty Mutual
NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)

BI / Dashboard tools
Analytics and
Amazon
Redshift
Queries go to
leader node
1
If cache contains
query result, it’s
returned with no
processing
2
If query is not in
cache, it’s
executed and
result is cached
3
• In-memory leader node cache,
resulting in sub-second response
• Transparent – it just works
• Skip WLM, Skip processing, Skip
optimization
• Cache persists across sessions
• Caching frees up the Amazon Redshift
cluster, increasing performance for
other non-repetitive queries
RESULTS CACHE
QUERY_ID RESULT
QUERY_ID RESULT
Result Caching - Sub-second query response times

Result Caching: From the lab
• Higher is better! (Queries per hour)
• Read-write workload with a mix of
small and large queries, Inserts,
Copy and Vacuum
• 4-node ds2.8xL cluster
Dashboard Heavy Reporting
138
8
2979
117
QUERY THROUGHPUT (QPH) WITH
RESULT CACHING
No Caching Caching

Result Caching: A customer perspective
• Lower is better! (Query Latency)
• 4-node dc2.8xL cluster
• Tableau dashboard; 10-user test
Caching
No Caching “That’s not a mistake...the results for average
execution time on the caching test run were
sub-second and so don't show up on the y-axis
at this scale”
Various dashboard queries
(names removed for confidentiality)

Short Query Acceleration – Express lane for Short
queries
BI / Dashboard tools
Analytics and
Amazon
Redshift
• Short queries do not get stuck
behind long running queries
• Higher throughput, less variability
• Customized for your workload
• Transparent – it just works!
Machine Learning
Classifier

Short Query Acceleration: Results
No SQA; 5 concurrency
SQA; 5 concurrency
“This configuration showed a distinct
improvement in short query runtimes
with the SQA feature enabled. Many
of the shortest queries saw a 5x or
greater improvement while the longer
running queries saw a corresponding
increase. This is exactly how we
expect the feature to work.”
 Average wait time reduces from 36
seconds to 0 for queries that execute
under a second
 P90 wait time on a very busy cluster
reduces from 370 seconds to 32.1
seconds

Coming soon: Nested data support
• Analyze nested and semi-structured data in Amazon S3 with Spectrum
• Allows easy ETL of nested data in to Amazon Redshift using CTAS
• Support for open file formats: Parquet, ORC, JSON, Ion and AVRO
• Uses dot notation to extend your existing SQL
s3data.clickStream: <<
{ “session_time”: “20171013 14:05:00”,
“clicks”: [ {“page”: “/home”, “referrer”: “”},
{“page”: “/products”, “referrer”: “/home”} ]
},
{ “session_time”: “20171013 14:06:00”,
“clicks”: [ {“page”: “/contact”, “referrer”: “/home”} ]
} >>
SELECT c.page,
COUNT(*) AS count
FROM s3data.clickStream s,
s.clicks c
WHERE s.session_time > ‘2017-10-01 00:00:00’
AND c.referrer = “/home”
GROUP BY c.page;
Example: Find click frequency for links on “/home”:

Coming soon: Nested data support
Improve query performance by analyzing nested data
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
OrdersWithItems
ItemID Quantity Price
23 10.00 12.50
16 1.00 1.99
32 1.00 5.60
24 5.00 26.50
OrderItems
OrderID ItemID Quantity Price
5 23 10.00 12.50
8 32 1.00 5.60
5 16 1.00 1.99
8 24 5.00 26.50
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
Orders
OrderItems
To improve query
performance, the
new Orders table
includes the
OrdersWithItems as
a nested column,
eliminating join
processing

Coming Soon: Enhanced Monitoring
Optimize your Amazon Redshift cluster for
peak performance by using query throughput
metrics
Get greater insights into your cluster
performance by accessing database and
workload metrics
Get alerts and notifications via Amazon SNS
Monitor query latency and throughput to optimize your workload

Conclusion
•Use S3 as the storage repository for your data lake, instead of a
Hadoop cluster or data warehouse.
•Decoupled storage and compute is cheaper and more efficient to
operate.
•Use Redshift as your DW and Redshift Spectrum to scale to
Exabytes
•Gain flexibility to use all the analytics tools in the ecosystem around
S3 & future proof the architecture.

Try Amazon Redshift for FREE
Get 2 months free
aws.amazon.com/redshift/free-trial/

THANK YOU!

Migrating your traditional Data Warehouse to a Modern Data Lake

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Migrating your traditional Data Warehouse to a Modern Data Lake

Ähnlich wie Migrating your traditional Data Warehouse to a Modern Data Lake (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Migrating your traditional Data Warehouse to a Modern Data Lake