SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Migrating Your Traditional Data
Warehouse to a Modern Data Lake
Level 200
Osemeke Isibor
Solutions Architect
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Challenges of Traditional Data Architectures
• Data lake Architecture
• Benefits of Data lake
• AWS Approach to Data lake
• Extending your Data lake with Amazon Redshift Data
warehouse service
• Redshift spectrum
• Customer story
• Recent and upcoming launches
• Conclusion and Q&A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unlocking Data
Most companies and organizations are embarking on
ambitious innovation initiatives to unlock their data.
The data already exists but goes unused or is locked away
from complimentary data sets in isolated data silos.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges with Legacy Data Architectures
•Can’t move data across silos
•Can’t deal with dynamic data and real-time processing
•Can’t deal with format diversity and change rate
•Complex ETL processes
•Difficult to find the people, adequate skills to configure and
manage these systems
•Can’t integrate with the explosion of available social and
behavior tracking data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Legacy Data Architectures Exist as Isolated
and monolithic Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – All Data in One
Place
Store and analyze all of your data,
from all of your sources, in one
centralized location.
“Why is the data distributed in
many locations? Where is the
single source of truth ?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Storage vs
Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Approach to Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3
is the Data Lake
AWS Approach to Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing & Analytics
Real-time Batch
AI & Predictive
BI & Data Visualization
Transactional &
RDBMS
AWS Lambda
Apache Storm
on EMR
Apache Flink
on EMR
Spark Streaming
on EMR
Elasticsearch
Service
Kinesis Analytics,
Kinesis Streams
DynamoDB
NoSQL DB Relational Database
Aurora
EMR
Hadoop, Spark,
Presto
Redshift
Data Warehouse
Athena
Query Service
Amazon Lex
Speech
recognition
Amazon
Rekognition
Amazon Polly
Text to speech
Machine Learning
Predictive analytics
Kinesis Streams
& Firehose
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift
Data Warehouse
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift overview
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor,
product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
“Amazon Redshift has the largest adoption
of BDW in the cloud.”
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud – some over 10
petabytes in size.”
AWS received a score of 5/5 (the highest
score possible) in the: customer base,
market awareness, ability to execute, road
map, support, and partners criteria
Forrester Wave Big Data Warehouse Q2 2017
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift – Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scales from gigabytes to Exabyte
Fast at scale
Columnar storage technology
to improve I/O efficiency and
scale query performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end; extensive
certification and compliance
Analyze optimized data
formats on direct-attached
disks, and all open data
formats in Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift is widely available
Ireland
Frankfurt
London
Beijing
Mumbai
Seoul
Singapore
Sydney
Tokyo
Sao Paulo
US East – N Virginia
US East – Ohio
US West – Oregon
US West – N California
GovCloud
Canada – Central, Montreal
Currently Available
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e
S3 data lakeAmazon
Redshift data
Redshift Spectrum
query engine
Exabyte Amazon Redshift SQL queries against S3
Join data across Amazon Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
Parquet, ORC, Grok, Avro, & CSV data formats
Pay only for the amount of data scanned
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift
Spectrum
Query your
dat a lake
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spectrum: Exabyte query in less than three minutes
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
• Roughly 140 TB of customer item order detail
records for each day over past 20 years
• 190 million files across 15,000 partitions in S3
• One partition per day for USA and rest of world
• Total data size is over an exabyte
Optimization:
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Amazon Redshift query optimizer..40X
Hive (1000 nodes) Redshift Spectrum
5 years 155 seconds
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data format
used by Amazon Retail.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S
IN MARKET CAP REPRESENTING
WORTH $9.6TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDSMORE THAN U.S.
1 TRILLIONNATIONAL VALUE IS TIED
TO OUR LIBRARY OF MORE THAN
41,000 GLOBAL INDEXES
N A S D A Q T E C H N O L O G Y
IS USED TO POWER MORE THAN
IN 50 COUNTRIES
100 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION
MESSAGES/SECOND
AT SUB-40 MICROSECONDS
AV E R A G E S P E E D S
1 C L E A R I N G H O U S E
WE OWN AND OPERATE
26 MARKETS
5 CENTRAL SECURITIES
DEPOSITORIES
INCLUDING
A C R O S S A S S E T CL A S SE S
& GEOGRAPHIES
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Nasdaq implements an S3 data lake + Redshift data warehouse architecture
• Most recent two years of data is kept in the Redshift data warehouse and
snapshotted into S3 for disaster recovery
• Data between two and five years old is kept in S3
• Presto on EMR is used to ad-hoc query data in S3
• Transitioned from an on-premises data warehouse to Amazon Redshift & S3
data lake architecture
• Over 1,000 tables migrated
• Average daily ingest of over 7B rows
• Migrated off legacy DW to AWS (start to finish) in 7 man-months
• AWS costs were 43% of legacy budget for the same data set (~1100 tables)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
High Level Architecture Overview
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Can Handle Many Data Sources
• Internal DBs, CSV files, stream captures, etc.
• Data from all 7 exchanges operated by Nasdaq
• Orders, quotes, trade executions
• Market “tick” data
• Security master
• Membership
• All highly structured and consistent row-oriented data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Architecture Passed Numerous Security Audits
Internal
• Information Security
• Internal Audit
• Nasdaq Risk Committee
External
• SOX
• US Securities & Exchange Commission (SEC)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Selected Amazon Redshift Customers
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Selected Amazon Redshift Partners
Data Integration Systems IntegratorsBusiness Intelligence
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recent and upcoming launches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New Dense Compute Node - DC2
2X Performance @ Same Price as DC1
3x more I/O with 30% better storage utilization than DC1
“We saw a 9x reduction in month-end reporting time
with Amazon Redshift dc2 nodes as compared to dc1”
- Bradley Todd,
Technical Architect, Liberty Mutual
NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BI / Dashboard tools
Analytics and
Amazon
Redshift
Queries go to
leader node
1
If cache contains
query result, it’s
returned with no
processing
2
If query is not in
cache, it’s
executed and
result is cached
3
• In-memory leader node cache,
resulting in sub-second response
• Transparent – it just works
• Skip WLM, Skip processing, Skip
optimization
• Cache persists across sessions
• Caching frees up the Amazon Redshift
cluster, increasing performance for
other non-repetitive queries
RESULTS CACHE
QUERY_ID RESULT
QUERY_ID RESULT
Result Caching - Sub-second query response times
Result Caching: From the lab
• Higher is better! (Queries per hour)
• Read-write workload with a mix of
small and large queries, Inserts,
Copy and Vacuum
• 4-node ds2.8xL cluster
Dashboard Heavy Reporting
138
8
2979
117
QUERY THROUGHPUT (QPH) WITH
RESULT CACHING
No Caching Caching
Result Caching: A customer perspective
• Lower is better! (Query Latency)
• 4-node dc2.8xL cluster
• Tableau dashboard; 10-user test
Caching
No Caching “That’s not a mistake...the results for average
execution time on the caching test run were
sub-second and so don't show up on the y-axis
at this scale”
Various dashboard queries
(names removed for confidentiality)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Short Query Acceleration – Express lane for Short
queries
BI / Dashboard tools
Analytics and
Amazon
Redshift
• Short queries do not get stuck
behind long running queries
• Higher throughput, less variability
• Customized for your workload
• Transparent – it just works!
Machine Learning
Classifier
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Short Query Acceleration: Results
No SQA; 5 concurrency
SQA; 5 concurrency
“This configuration showed a distinct
improvement in short query runtimes
with the SQA feature enabled. Many
of the shortest queries saw a 5x or
greater improvement while the longer
running queries saw a corresponding
increase. This is exactly how we
expect the feature to work.”
 Average wait time reduces from 36
seconds to 0 for queries that execute
under a second
 P90 wait time on a very busy cluster
reduces from 370 seconds to 32.1
seconds
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Coming soon: Nested data support
• Analyze nested and semi-structured data in Amazon S3 with Spectrum
• Allows easy ETL of nested data in to Amazon Redshift using CTAS
• Support for open file formats: Parquet, ORC, JSON, Ion and AVRO
• Uses dot notation to extend your existing SQL
s3data.clickStream: <<
{ “session_time”: “20171013 14:05:00”,
“clicks”: [ {“page”: “/home”, “referrer”: “”},
{“page”: “/products”, “referrer”: “/home”} ]
},
{ “session_time”: “20171013 14:06:00”,
“clicks”: [ {“page”: “/contact”, “referrer”: “/home”} ]
} >>
SELECT c.page,
COUNT(*) AS count
FROM s3data.clickStream s,
s.clicks c
WHERE s.session_time > ‘2017-10-01 00:00:00’
AND c.referrer = “/home”
GROUP BY c.page;
Example: Find click frequency for links on “/home”:
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Coming soon: Nested data support
Improve query performance by analyzing nested data
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
OrdersWithItems
ItemID Quantity Price
23 10.00 12.50
16 1.00 1.99
32 1.00 5.60
24 5.00 26.50
OrderItems
OrderID ItemID Quantity Price
5 23 10.00 12.50
8 32 1.00 5.60
5 16 1.00 1.99
8 24 5.00 26.50
OrderID CustomerID OrderTime ShipMode
5 23 10.00 12.50
8 32 1.00 5.60
Orders
OrderItems
To improve query
performance, the
new Orders table
includes the
OrdersWithItems as
a nested column,
eliminating join
processing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Coming Soon: Enhanced Monitoring
Optimize your Amazon Redshift cluster for
peak performance by using query throughput
metrics
Get greater insights into your cluster
performance by accessing database and
workload metrics
Get alerts and notifications via Amazon SNS
Monitor query latency and throughput to optimize your workload
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusion
•Use S3 as the storage repository for your data lake, instead of a
Hadoop cluster or data warehouse.
•Decoupled storage and compute is cheaper and more efficient to
operate.
•Use Redshift as your DW and Redshift Spectrum to scale to
Exabytes
•Gain flexibility to use all the analytics tools in the ecosystem around
S3 & future proof the architecture.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Try Amazon Redshift for FREE
Get 2 months free
aws.amazon.com/redshift/free-trial/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWSAWS Germany
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 

Was ist angesagt? (20)

Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWS
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 

Ähnlich wie Migrating your traditional Data Warehouse to a Modern Data Lake

STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017Amazon Web Services
 
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
Immersion Day - Como simplificar o acesso ao seu ambiente analíticoImmersion Day - Como simplificar o acesso ao seu ambiente analítico
Immersion Day - Como simplificar o acesso ao seu ambiente analíticoAmazon Web Services LATAM
 
ABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake
ABD327_Migrating Your Traditional Data Warehouse to a Modern Data LakeABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake
ABD327_Migrating Your Traditional Data Warehouse to a Modern Data LakeAmazon Web Services
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
 
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Amazon Web Services
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
AWS Database and Analytics State of the Union
AWS Database and Analytics State of the UnionAWS Database and Analytics State of the Union
AWS Database and Analytics State of the UnionAmazon Web Services
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETLAmazon Web Services
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
Applying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS SummitApplying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS SummitAmazon Web Services
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightAmazon Web Services
 
Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)Amazon Web Services
 

Ähnlich wie Migrating your traditional Data Warehouse to a Modern Data Lake (20)

STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017Technology Trends in Data Processing - DAT311 - re:Invent 2017
Technology Trends in Data Processing - DAT311 - re:Invent 2017
 
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
Immersion Day - Como simplificar o acesso ao seu ambiente analíticoImmersion Day - Como simplificar o acesso ao seu ambiente analítico
Immersion Day - Como simplificar o acesso ao seu ambiente analítico
 
ABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake
ABD327_Migrating Your Traditional Data Warehouse to a Modern Data LakeABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake
ABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
AWS Database and Analytics State of the Union
AWS Database and Analytics State of the UnionAWS Database and Analytics State of the Union
AWS Database and Analytics State of the Union
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 
AWS Storage State of the Union
AWS Storage State of the UnionAWS Storage State of the Union
AWS Storage State of the Union
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data Workloads
 
Applying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS SummitApplying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Toronto AWS Summit
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Migrating your traditional Data Warehouse to a Modern Data Lake

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Migrating Your Traditional Data Warehouse to a Modern Data Lake Level 200 Osemeke Isibor Solutions Architect
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Challenges of Traditional Data Architectures • Data lake Architecture • Benefits of Data lake • AWS Approach to Data lake • Extending your Data lake with Amazon Redshift Data warehouse service • Redshift spectrum • Customer story • Recent and upcoming launches • Conclusion and Q&A
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unlocking Data Most companies and organizations are embarking on ambitious innovation initiatives to unlock their data. The data already exists but goes unused or is locked away from complimentary data sets in isolated data silos.
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges with Legacy Data Architectures •Can’t move data across silos •Can’t deal with dynamic data and real-time processing •Can’t deal with format diversity and change rate •Complex ETL processes •Difficult to find the people, adequate skills to configure and manage these systems •Can’t integrate with the explosion of available social and behavior tracking data
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Legacy Data Architectures Exist as Isolated and monolithic Data Silos Hadoop Cluster SQL Database Data Warehouse Appliance
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enter Data Lake Architectures Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data.
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake – All Data in One Place Store and analyze all of your data, from all of your sources, in one centralized location. “Why is the data distributed in many locations? Where is the single source of truth ?”
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake – Quick Ingest Quickly ingest data without needing to force it into a pre-defined schema. “How can I collect data quickly from various sources and store it efficiently?”
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake – Storage vs Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake – Schema on Read “Is there a way I can apply multiple analytics and processing frameworks to the same data?” A Data Lake enables ad-hoc analysis by applying schemas on read, not write.
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Approach to Data Lake
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 is the Data Lake AWS Approach to Data Lake
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Processing & Analytics Real-time Batch AI & Predictive BI & Data Visualization Transactional & RDBMS AWS Lambda Apache Storm on EMR Apache Flink on EMR Spark Streaming on EMR Elasticsearch Service Kinesis Analytics, Kinesis Streams DynamoDB NoSQL DB Relational Database Aurora EMR Hadoop, Spark, Presto Redshift Data Warehouse Athena Query Service Amazon Lex Speech recognition Amazon Rekognition Amazon Polly Text to speech Machine Learning Predictive analytics Kinesis Streams & Firehose
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Data Warehouse
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift overview
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. “Amazon Redshift has the largest adoption of BDW in the cloud.” “With more than 5,000 deployments, Amazon Redshift has the largest data warehouse deployments in the cloud – some over 10 petabytes in size.” AWS received a score of 5/5 (the highest score possible) in the: customer base, market awareness, ability to execute, road map, support, and partners criteria Forrester Wave Big Data Warehouse Q2 2017
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift – Data Warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scales from gigabytes to Exabyte Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance $ Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Analyze optimized data formats on direct-attached disks, and all open data formats in Amazon S3
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift is widely available Ireland Frankfurt London Beijing Mumbai Seoul Singapore Sydney Tokyo Sao Paulo US East – N Virginia US East – Ohio US West – Oregon US West – N California GovCloud Canada – Central, Montreal Currently Available
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e S3 data lakeAmazon Redshift data Redshift Spectrum query engine Exabyte Amazon Redshift SQL queries against S3 Join data across Amazon Redshift and S3 Scale compute and storage separately Stable query performance and unlimited concurrency Parquet, ORC, Grok, Avro, & CSV data formats Pay only for the amount of data scanned
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum Query your dat a lake Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage AWS Glue Data Catalog Redshift Spectrum Scale-out serverless compute Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY …
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spectrum: Exabyte query in less than three minutes SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20; • Roughly 140 TB of customer item order detail records for each day over past 20 years • 190 million files across 15,000 partitions in S3 • One partition per day for USA and rest of world • Total data size is over an exabyte Optimization: • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Amazon Redshift query optimizer..40X Hive (1000 nodes) Redshift Spectrum 5 years 155 seconds * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NASDAQ LISTS3 , 6 0 0 G L O B A L C O M P A N I E S IN MARKET CAP REPRESENTING WORTH $9.6TRILLION DIVERSE INDUSTRIES AND MANY OF THE WORLD’S MOST WELL-KNOWN AND INNOVATIVE BRANDSMORE THAN U.S. 1 TRILLIONNATIONAL VALUE IS TIED TO OUR LIBRARY OF MORE THAN 41,000 GLOBAL INDEXES N A S D A Q T E C H N O L O G Y IS USED TO POWER MORE THAN IN 50 COUNTRIES 100 MARKETPLACES OUR GLOBAL PLATFORM CAN HANDLE MORE THAN 1 MILLION MESSAGES/SECOND AT SUB-40 MICROSECONDS AV E R A G E S P E E D S 1 C L E A R I N G H O U S E WE OWN AND OPERATE 26 MARKETS 5 CENTRAL SECURITIES DEPOSITORIES INCLUDING A C R O S S A S S E T CL A S SE S & GEOGRAPHIES
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Nasdaq implements an S3 data lake + Redshift data warehouse architecture • Most recent two years of data is kept in the Redshift data warehouse and snapshotted into S3 for disaster recovery • Data between two and five years old is kept in S3 • Presto on EMR is used to ad-hoc query data in S3 • Transitioned from an on-premises data warehouse to Amazon Redshift & S3 data lake architecture • Over 1,000 tables migrated • Average daily ingest of over 7B rows • Migrated off legacy DW to AWS (start to finish) in 7 man-months • AWS costs were 43% of legacy budget for the same data set (~1100 tables)
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. High Level Architecture Overview
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Can Handle Many Data Sources • Internal DBs, CSV files, stream captures, etc. • Data from all 7 exchanges operated by Nasdaq • Orders, quotes, trade executions • Market “tick” data • Security master • Membership • All highly structured and consistent row-oriented data
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Architecture Passed Numerous Security Audits Internal • Information Security • Internal Audit • Nasdaq Risk Committee External • SOX • US Securities & Exchange Commission (SEC)
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Selected Amazon Redshift Customers
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Selected Amazon Redshift Partners Data Integration Systems IntegratorsBusiness Intelligence
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recent and upcoming launches
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Dense Compute Node - DC2 2X Performance @ Same Price as DC1 3x more I/O with 30% better storage utilization than DC1 “We saw a 9x reduction in month-end reporting time with Amazon Redshift dc2 nodes as compared to dc1” - Bradley Todd, Technical Architect, Liberty Mutual NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BI / Dashboard tools Analytics and Amazon Redshift Queries go to leader node 1 If cache contains query result, it’s returned with no processing 2 If query is not in cache, it’s executed and result is cached 3 • In-memory leader node cache, resulting in sub-second response • Transparent – it just works • Skip WLM, Skip processing, Skip optimization • Cache persists across sessions • Caching frees up the Amazon Redshift cluster, increasing performance for other non-repetitive queries RESULTS CACHE QUERY_ID RESULT QUERY_ID RESULT Result Caching - Sub-second query response times
  • 32. Result Caching: From the lab • Higher is better! (Queries per hour) • Read-write workload with a mix of small and large queries, Inserts, Copy and Vacuum • 4-node ds2.8xL cluster Dashboard Heavy Reporting 138 8 2979 117 QUERY THROUGHPUT (QPH) WITH RESULT CACHING No Caching Caching
  • 33. Result Caching: A customer perspective • Lower is better! (Query Latency) • 4-node dc2.8xL cluster • Tableau dashboard; 10-user test Caching No Caching “That’s not a mistake...the results for average execution time on the caching test run were sub-second and so don't show up on the y-axis at this scale” Various dashboard queries (names removed for confidentiality)
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Short Query Acceleration – Express lane for Short queries BI / Dashboard tools Analytics and Amazon Redshift • Short queries do not get stuck behind long running queries • Higher throughput, less variability • Customized for your workload • Transparent – it just works! Machine Learning Classifier
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Short Query Acceleration: Results No SQA; 5 concurrency SQA; 5 concurrency “This configuration showed a distinct improvement in short query runtimes with the SQA feature enabled. Many of the shortest queries saw a 5x or greater improvement while the longer running queries saw a corresponding increase. This is exactly how we expect the feature to work.”  Average wait time reduces from 36 seconds to 0 for queries that execute under a second  P90 wait time on a very busy cluster reduces from 370 seconds to 32.1 seconds
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming soon: Nested data support • Analyze nested and semi-structured data in Amazon S3 with Spectrum • Allows easy ETL of nested data in to Amazon Redshift using CTAS • Support for open file formats: Parquet, ORC, JSON, Ion and AVRO • Uses dot notation to extend your existing SQL s3data.clickStream: << { “session_time”: “20171013 14:05:00”, “clicks”: [ {“page”: “/home”, “referrer”: “”}, {“page”: “/products”, “referrer”: “/home”} ] }, { “session_time”: “20171013 14:06:00”, “clicks”: [ {“page”: “/contact”, “referrer”: “/home”} ] } >> SELECT c.page, COUNT(*) AS count FROM s3data.clickStream s, s.clicks c WHERE s.session_time > ‘2017-10-01 00:00:00’ AND c.referrer = “/home” GROUP BY c.page; Example: Find click frequency for links on “/home”:
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming soon: Nested data support Improve query performance by analyzing nested data OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 OrdersWithItems ItemID Quantity Price 23 10.00 12.50 16 1.00 1.99 32 1.00 5.60 24 5.00 26.50 OrderItems OrderID ItemID Quantity Price 5 23 10.00 12.50 8 32 1.00 5.60 5 16 1.00 1.99 8 24 5.00 26.50 OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 Orders OrderItems To improve query performance, the new Orders table includes the OrdersWithItems as a nested column, eliminating join processing
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming Soon: Enhanced Monitoring Optimize your Amazon Redshift cluster for peak performance by using query throughput metrics Get greater insights into your cluster performance by accessing database and workload metrics Get alerts and notifications via Amazon SNS Monitor query latency and throughput to optimize your workload
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Conclusion •Use S3 as the storage repository for your data lake, instead of a Hadoop cluster or data warehouse. •Decoupled storage and compute is cheaper and more efficient to operate. •Use Redshift as your DW and Redshift Spectrum to scale to Exabytes •Gain flexibility to use all the analytics tools in the ecosystem around S3 & future proof the architecture.
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Try Amazon Redshift for FREE Get 2 months free aws.amazon.com/redshift/free-trial/
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!