SlideShare ist ein Scribd-Unternehmen logo
1 von 32
1
Extending Analytic Reach:
From The Warehouse to The Data Lake
Mike Limcaco | CTO
2017 Big Data Day LA
University of Southern California | 2017-08-06
2
(Most) Data is Dark
http://bit.ly/2k4fDJQ
3
4
Big Data
Enormous Data
5
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
6
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
The Virtual Warehouse
7
The Emerging Analytics Architecture (AWS)
Storage
Serverless
Compute
Data
Processing
Amazon S3
Datalake Storage
AWS Glue Data Catalog
Hive compatible Metastore
Amazon Kinesis
Streaming
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
AWS Lambda
Triggered Code
Amazon Redshift
PB-scale MPP Warehouse
Amazon Athena
SQL as a Service
Amazon EMR
Hadoop as a Service
AWS Glue
ETL
8
The Emerging Analytics Architecture (AWS)
Amazon S3
Datalake Storage
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
Amazon Redshift
PB-scale MPP Warehouse
Amazon EMR
Hadoop as a Service
9
Pick one …
• Direct access to object store (S3)
• Scale out to thousands of nodes
• Open Data Formats
• Popular big data frameworks
• Developer-friendly
• Fast local disk performance
• Sophisticated query optimization
• Join-optimized
• Familiar DW/BI workflows
Hadoop (e.g. EMR) SQL-Based Warehousing
(e.g. Amazon Redshift)
11
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
Data Lake
Object Storage
Amazon
Redshift
SQL
Client
Amazon
S3 Storage
SpectrumBridge
MPP
Warehouse
HTTP
JDBC/ODBC
12
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
SQL
Client
JDBC/ODBC
The Enormous
Virtual Warehouse
13
Query Flow
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
HTTP
Spectrum
14
Query Flow
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
1
HTTP
Spectrum
15
Query Flow
Query optimized &
compiled. Plan sent to
all Compute Nodes
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
2
HTTP
Spectrum
16
Query Flow
Compute nodes dynamically
prune partitions based on
Catalog info
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
3
HTTP
Spectrum
17
Query Flow
Spectrum nodes scan
S3, projects/filters/scans
and aggregates
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
4
HTTP
Spectrum
18
Query Flow
Final aggregations and
joins on local tables
done in-cluster
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
5
HTTP
Spectrum
19
Query Flow
Results sent back to
client
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
6
HTTP
Spectrum
Demo
21
Domain Model
Dimensions
Facts
(Online)
Dimensions Dimensions
Data Pond
Data Lake
Data
(RAW)
Facts
(Archive)
Data
(Other)
22LastFM Music Streaming Events
Horizontal Partitioning Datetime User_ID Country
2007 Mike USA
2008 Jack Finland
Datetime User_ID Track Artist
2015 5:00pm Alice Songbird Kenny G
2013 11:14pm Mike Suit and Tie Justin Timberlake
Datetime User_ID Track Artist
1999 5:15pm Mike Ice Ice Baby Vanilla Ice
1994 4:48pm Mike Wannabe Spice Girls
Colder
User Profile
Streaming
Events
(RECENT)
Streaming
Events
(ARCHIVE)
23
Colder
24
Dimensions
FACTS (Online)
Facts
(ARCHIVE)
Amazon S3
Redshift
Spectrum Glue CatalogAthena
25
create external table lastfm_music_streaming_events
(
userid string,
datetime timestamp,
artist_id string,
artist_name string,
track_id string,
track_name string
)
stored as parquet
location 's3://my-archived-facts/lastfm/parquet/events/';
Register EXTERNAL S3 Table
26
SELECT
u.country, COUNT(*) AS plays, 'REDSHIFT' AS source
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
GROUP BY
u.country
Query Redshift ONLINE Data
27
28
SELECT ….
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
…
UNION
SELECT …
FROM
lastfm_users u,
datalake.lastfm_music_streaming_events dl
WHERE
u.userid = dl.userid
…
Query Redshift ONLINE + ARCHIVED S3 Data
Local Redshift Tables
External S3 Data
29
TL;DR
31
Summary
• Online warehousing can participate in extended data lake operations
• External tables in Internet-scale object storage (S3) can be shared
between
• Hadoop workloads (EMR)
• Serverless SQL as a Service (Athena)
• SQL-based MPP Warehousing (Redshift)
• You can readily tap extra capacity, concurrency, throughput via
Amazon Redshift Spectrum
mike@agilisium.com
2629 Townsgate Road Suite 235
Westlake Village, CA 91361
Thank You
contact@agilisium.com
careers@agilisium.com

Weitere ähnliche Inhalte

Was ist angesagt?

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 

Was ist angesagt? (20)

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
 

Ähnlich wie Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 

Ähnlich wie Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco (20)

Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018
 
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 

Mehr von Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

  • 1. 1 Extending Analytic Reach: From The Warehouse to The Data Lake Mike Limcaco | CTO 2017 Big Data Day LA University of Southern California | 2017-08-06
  • 2. 2 (Most) Data is Dark http://bit.ly/2k4fDJQ
  • 3. 3
  • 5. 5 Warehouse (e.g. Amazon Redshift) Vast Object Storage Domain (e.g. Data Lake on Amazon S3)
  • 6. 6 Warehouse (e.g. Amazon Redshift) Vast Object Storage Domain (e.g. Data Lake on Amazon S3) The Virtual Warehouse
  • 7. 7 The Emerging Analytics Architecture (AWS) Storage Serverless Compute Data Processing Amazon S3 Datalake Storage AWS Glue Data Catalog Hive compatible Metastore Amazon Kinesis Streaming Amazon Redshift Spectrum Warehouse-Datalake Bridge AWS Lambda Triggered Code Amazon Redshift PB-scale MPP Warehouse Amazon Athena SQL as a Service Amazon EMR Hadoop as a Service AWS Glue ETL
  • 8. 8 The Emerging Analytics Architecture (AWS) Amazon S3 Datalake Storage Amazon Redshift Spectrum Warehouse-Datalake Bridge Amazon Redshift PB-scale MPP Warehouse Amazon EMR Hadoop as a Service
  • 9. 9 Pick one … • Direct access to object store (S3) • Scale out to thousands of nodes • Open Data Formats • Popular big data frameworks • Developer-friendly • Fast local disk performance • Sophisticated query optimization • Join-optimized • Familiar DW/BI workflows Hadoop (e.g. EMR) SQL-Based Warehousing (e.g. Amazon Redshift)
  • 10.
  • 11. 11 Amazon Redshift Spectrum Run SQL queries against S3 • Leverages Amazon Redshift advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automated parallelization of query execution against S3 data • Efficient join processing with the Amazon Redshift cluster App Data Lake Object Storage Amazon Redshift SQL Client Amazon S3 Storage SpectrumBridge MPP Warehouse HTTP JDBC/ODBC
  • 12. 12 Amazon Redshift Spectrum Run SQL queries against S3 • Leverages Amazon Redshift advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automated parallelization of query execution against S3 data • Efficient join processing with the Amazon Redshift cluster App SQL Client JDBC/ODBC The Enormous Virtual Warehouse
  • 13. 13 Query Flow Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore HTTP Spectrum
  • 14. 14 Query Flow Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1 HTTP Spectrum
  • 15. 15 Query Flow Query optimized & compiled. Plan sent to all Compute Nodes Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2 HTTP Spectrum
  • 16. 16 Query Flow Compute nodes dynamically prune partitions based on Catalog info Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3 HTTP Spectrum
  • 17. 17 Query Flow Spectrum nodes scan S3, projects/filters/scans and aggregates Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4 HTTP Spectrum
  • 18. 18 Query Flow Final aggregations and joins on local tables done in-cluster Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5 HTTP Spectrum
  • 19. 19 Query Flow Results sent back to client Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6 HTTP Spectrum
  • 20. Demo
  • 21. 21 Domain Model Dimensions Facts (Online) Dimensions Dimensions Data Pond Data Lake Data (RAW) Facts (Archive) Data (Other)
  • 22. 22LastFM Music Streaming Events Horizontal Partitioning Datetime User_ID Country 2007 Mike USA 2008 Jack Finland Datetime User_ID Track Artist 2015 5:00pm Alice Songbird Kenny G 2013 11:14pm Mike Suit and Tie Justin Timberlake Datetime User_ID Track Artist 1999 5:15pm Mike Ice Ice Baby Vanilla Ice 1994 4:48pm Mike Wannabe Spice Girls Colder User Profile Streaming Events (RECENT) Streaming Events (ARCHIVE)
  • 25. 25 create external table lastfm_music_streaming_events ( userid string, datetime timestamp, artist_id string, artist_name string, track_id string, track_name string ) stored as parquet location 's3://my-archived-facts/lastfm/parquet/events/'; Register EXTERNAL S3 Table
  • 26. 26 SELECT u.country, COUNT(*) AS plays, 'REDSHIFT' AS source FROM lastfm_users u, lastfm_music_streaming_events s WHERE u.userid = s.userid GROUP BY u.country Query Redshift ONLINE Data
  • 27. 27
  • 28. 28 SELECT …. FROM lastfm_users u, lastfm_music_streaming_events s WHERE u.userid = s.userid … UNION SELECT … FROM lastfm_users u, datalake.lastfm_music_streaming_events dl WHERE u.userid = dl.userid … Query Redshift ONLINE + ARCHIVED S3 Data Local Redshift Tables External S3 Data
  • 29. 29
  • 30. TL;DR
  • 31. 31 Summary • Online warehousing can participate in extended data lake operations • External tables in Internet-scale object storage (S3) can be shared between • Hadoop workloads (EMR) • Serverless SQL as a Service (Athena) • SQL-based MPP Warehousing (Redshift) • You can readily tap extra capacity, concurrency, throughput via Amazon Redshift Spectrum
  • 32. mike@agilisium.com 2629 Townsgate Road Suite 235 Westlake Village, CA 91361 Thank You contact@agilisium.com careers@agilisium.com