SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Big Data, the cloud-native way:
Serverless Data Lake with IBM Cloud
Torsten Steinbach
Cloud Data Lake Lead Architect | IBM
Cloud Data Lake Evolutionary Context
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats &
easy scaling on commodity HW
Cloud-Native: Serverless Analytics-aaS
• Elasticity
• Pay-per-query
• Data in object store
• Disaggregated architecture
• Increasingly real-time first
The 90-ies 2000 Today
Telemetry Data
Explore
ETL or CDC
Replication
Prep Enrich
Streaming
Optimize Batch Query
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
IBM Cloud Data Lake – Big Picture
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Interactive
Query
Transactional
Consistency
DWH
Cloud Data Lakehouse
IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit
IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü Ingestion & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+
IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)
IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console
Meta Data
IBM Cloud Data Lake – Separating Out Responsibilities
Cloud Data
ACID
Serverless Spark (IBM Analytic Engine)
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL (IBM SQL Query)
IBM Cloud
Object
Storage
RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
Serverless Containers (IBM Cloud Code Engine)
IBM Event Streams IBM Cloud Databases
Processing
State
Data Lakehouse Architecture in IBM Cloud
…
BigSQL
Dremio
IBM Cloud
Databases
Event Streams SQL Query
Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation
CDC
Interactive &
DWH Queries
Streaming Data Lakes – EventStreams–COS Integration with SQL Query
New
Stream Landing
Event Streams: Real time event
feeds in Kafka topics
SQL Query: Serverless stream
landing ingests Kafka topics
into tables in COS
COS: Cost-effective permanent
storage and analytics for real-
time data.
Real Time Serverless
Data Lakes
Turn Topics into Tables with a
few clicks
Fully managed ingestion of
message feeds into parquet at
$0,10/hour for 1MB/s capacity
Infinite storage of all your
message data in COS
Run DWH-style SQL on your
message data in serverless
manner
Publish to Kafka to create your
specialised domain COS lake
house
• Log records
• Click Stream data
• IOT data
Combine with Change Data
Capture for real-time replication
of all your systems into data lake
for analytics
Common Ingest Fabric
to Data Lakes
IBM Cloud Data Lake
Real-Time Data Lake Solutions
Audit Trails
Cloud Platform Logs
Application Logs
Network Logs
User Behavior
IoT Feeds
IoT Lakes
Log Lakes AIOps Lakes Compliance Lakes
IBM Solution for Data & AI
Cloud Pak for Data as a Service
Built On
IBM Cloud
Uses
IBM Cloud Data Lake
COS
Storage Analytics
SQL Query
Event Streams
Streaming Transformation
Spark Cloud Databases
Databases
Integrated IBM Solution for Cloud Data Lakes
Integrated IBM Solution for Cloud Data Lakes
IBM Cloud Data Lake
Manage
Explore &
Prepare
Govern
Data Catalogs, Projects & Connections
Automate
Data Stage &
Kubeflow Pipelines
Consume
Watson Studio,
BigSQL
Cloud Pak for Data aaS
Ingest
CDC
Ad-hoc
Application Logs
IoT Streams
User Behavior
ETL
JDBC
Python
Dremio
Presto
ML
Tableau
Data Virtualization
Kafka
Power BI
Cognos
Infuse
Analyze
Organize
Collect
Ladder to AI
Outlook
IBM’s Serverless 2.0 Initiative
Data COS
EventStreams
(Kafka)
State Meta Data Common Hive Metastore
Temp Data NVMe
RAM
Containers IBM Cloud Code Engine
Runtimes Others Apache Spark
Stateless
Compute
Shuffle
100% Elastic with
Hyperscale &
Scale down to Zero
AI & ML DataOps & BI
Petabytes
Backup
I/O Optimization for Analytics
Analytic-Friendly Data Formats
Blog Article:
Data Layout
Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes
How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter
Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Weather/dt=2020-08-17/part-00086.parquet
Weather/dt=2020-08-17/part-00087.parquet
Weather/dt=2020-08-17/part-00088.parquet
Weather/dt=2020-08-18/part-00001.parquet
Weather/dt=2020-08-18/part-00002.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query
Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long
X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
10 TB of Weather Data on COS
A Real-Life Cloud Data Lake
Making trusted COVID-19 data available to broad set of analytics, e.g.:
§ https://accelerator.weather.com/bi
§ Watson Health Return to Work Advisor
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data
COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
Preparation Sequences
External
Data
Sources
Pull
Push
Collectors Sequences
Preparation Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS
IBM Cloud Native Day April 2021: Serverless Data Lake

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
AWS Summit Auckland - Building a Server-less Data Lake on AWS
AWS Summit Auckland - Building a Server-less Data Lake on AWSAWS Summit Auckland - Building a Server-less Data Lake on AWS
AWS Summit Auckland - Building a Server-less Data Lake on AWS
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Data weekender4.2 azure purview erwin de kreuk
Data weekender4.2  azure purview erwin de kreukData weekender4.2  azure purview erwin de kreuk
Data weekender4.2 azure purview erwin de kreuk
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 

Ähnlich wie IBM Cloud Native Day April 2021: Serverless Data Lake

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 

Ähnlich wie IBM Cloud Native Day April 2021: Serverless Data Lake (20)

Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 

Mehr von Torsten Steinbach

esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
Torsten Steinbach
 

Mehr von Torsten Steinbach (11)

IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
 
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
 
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM CloudIBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the CloudIBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
 
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloudIBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

IBM Cloud Native Day April 2021: Serverless Data Lake

  • 1. Big Data, the cloud-native way: Serverless Data Lake with IBM Cloud Torsten Steinbach Cloud Data Lake Lead Architect | IBM
  • 2. Cloud Data Lake Evolutionary Context Enterprise Data Warehouses Tightly integrated and optimized systems Hadoop Introduced open data formats & easy scaling on commodity HW Cloud-Native: Serverless Analytics-aaS • Elasticity • Pay-per-query • Data in object store • Disaggregated architecture • Increasingly real-time first The 90-ies 2000 Today
  • 3. Telemetry Data Explore ETL or CDC Replication Prep Enrich Streaming Optimize Batch Query ü Seamless Elasticity ü Seamless Scalability ü Highly Cost Effective ü Long Term Retention ü Any data formats ETL IBM Cloud Data Lake – Big Picture Databases ü Response Time SLAs ü Warm High-quality Data only Cloud Data Lake Analytics Interactive Query Transactional Consistency DWH Cloud Data Lakehouse
  • 4. IBM Serverless Stack for Analytics Serverless Storage Serverless Runtimes Serverless Analytics Object Storage Cloud Functions Query Only pay for volume of data that you really store Only pay for amount of data that you really scan Only pay for CPU that you really consume Blog Article § Properties of Serverless: – No management of resources, hosts and processes – Auto-scaling and auto-provisioning based on actual load – Precise billing based on really consumed system resources (memory, storage, CPU, network, I/O) – High-Availability is always implicit
  • 5. IBM SQL Query – The Central Cloud Data Lake Service Cloud Data Data Transformation Serverless SQL Query Service Analytics Object Storage RDBMS + Developers Data Engineers Data Analysts ü Supports ad-hoc and unknown data structures ü Ingestion & ELT Support ü 100% Pay-as-you-go (5$/TB) ü 100% API enabled ü Automatic Big Data Scale- Out with Spark ü 100% Self service, No Setup Data Management + Data Scientists ü Built-In Database Catalog & Data Skipping Data Ingestion +
  • 6. IBM SQL Query Architecture 2. Read data 4. Read results Application 3. Write data Cloud Data Services 1. Submit SQL SQL Event Streams Query Db2 on Cloud Geospatial SQL Data Skipping Timeseries SQL Hive Metastore Video Cloud Object Storage • Using IBM Analytic Engine service (Spark aaS) • Large farm of Spark clusters auto- provisioned & auto-managed in background • Managing a hot pool of Spark applications (a.k.a. kernels, using Jupyter Kernel Gateway) • SQL grammar sandbox • Auto-scaling of each serverless SQL job inside large Spark clusters using dynamic resource allocation • Intrinsically HA (dispatching across Spark environments in each availability zone)
  • 7. IBM SQL Query – Access Patterns Create Query SQL Console Watson Studio Notebooks Cloud Functions Integrate Explore Deploy Python SDK REST API JDBC Object Store Console Event Streams Console
  • 8. Meta Data IBM Cloud Data Lake – Separating Out Responsibilities Cloud Data ACID Serverless Spark (IBM Analytic Engine) Data Skipping Indexes Governance Policies & Lineage Schema, Partitioning, Statistics Serverless SQL (IBM SQL Query) IBM Cloud Object Storage RDBMS Hive Metastore Kafka Schema Registry Xskipper Iceberg Watson Knowledge Catalog Deltalake Serverless Containers (IBM Cloud Code Engine) IBM Event Streams IBM Cloud Databases Processing State
  • 9. Data Lakehouse Architecture in IBM Cloud … BigSQL Dremio IBM Cloud Databases Event Streams SQL Query Meta Data Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg) Real-Time Queries COS Batch Queries Stream Xform & Joins Stream data landing Schema management & enforcement ETL & Data Preparation CDC Interactive & DWH Queries
  • 10. Streaming Data Lakes – EventStreams–COS Integration with SQL Query New Stream Landing Event Streams: Real time event feeds in Kafka topics SQL Query: Serverless stream landing ingests Kafka topics into tables in COS COS: Cost-effective permanent storage and analytics for real- time data. Real Time Serverless Data Lakes Turn Topics into Tables with a few clicks Fully managed ingestion of message feeds into parquet at $0,10/hour for 1MB/s capacity Infinite storage of all your message data in COS Run DWH-style SQL on your message data in serverless manner Publish to Kafka to create your specialised domain COS lake house • Log records • Click Stream data • IOT data Combine with Change Data Capture for real-time replication of all your systems into data lake for analytics Common Ingest Fabric to Data Lakes
  • 11. IBM Cloud Data Lake Real-Time Data Lake Solutions Audit Trails Cloud Platform Logs Application Logs Network Logs User Behavior IoT Feeds IoT Lakes Log Lakes AIOps Lakes Compliance Lakes
  • 12. IBM Solution for Data & AI
  • 13. Cloud Pak for Data as a Service Built On IBM Cloud Uses IBM Cloud Data Lake COS Storage Analytics SQL Query Event Streams Streaming Transformation Spark Cloud Databases Databases Integrated IBM Solution for Cloud Data Lakes
  • 14. Integrated IBM Solution for Cloud Data Lakes IBM Cloud Data Lake Manage Explore & Prepare Govern Data Catalogs, Projects & Connections Automate Data Stage & Kubeflow Pipelines Consume Watson Studio, BigSQL Cloud Pak for Data aaS Ingest CDC Ad-hoc Application Logs IoT Streams User Behavior ETL JDBC Python Dremio Presto ML Tableau Data Virtualization Kafka Power BI Cognos Infuse Analyze Organize Collect Ladder to AI
  • 16. IBM’s Serverless 2.0 Initiative Data COS EventStreams (Kafka) State Meta Data Common Hive Metastore Temp Data NVMe RAM Containers IBM Cloud Code Engine Runtimes Others Apache Spark Stateless Compute Shuffle 100% Elastic with Hyperscale & Scale down to Zero AI & ML DataOps & BI Petabytes
  • 18. Analytic-Friendly Data Formats Blog Article: Data Layout
  • 19. Data Skipping in IBM SQL Query • Avoid reading irrelevant objects using indexes • Complements partition pruning -> object level pruning • Stores aggregate metadata per object to enable skipping decisions • Indexes are stored in COS • Supports multiple index types • Currently MinMax, ValueList, BloomFilter, Geospatial • Underlying data skipping library is extensible • New index types can easily be supported • Enables data skipping on SQL UDFs • e.g. ST_Contains, ST_Distance etc. • UDFs are mapped to indexes
  • 20. How Data Skipping Works Spark SQL Query Execution Flow Uses Catalyst optimizer and session extensions API Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter
  • 21. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query
  • 22. Geospatial Data Skipping Example Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((- 78.93 36.00, -78.67 35.78, -79.04 35.90, - 78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Object Name lat Min lat Max ... dt=2020-08-17/part-00085 35.02 36.17 dt=2020-08-17/part-00086 43.59 44.95 dt=2020-08-17/part-00087 34.86 40.62 dt=2020-08-17/part-00088 23.67 25.92 ... Metadata Red objects are not relevant to this query Raleigh Research Triangle (US) Map ST Contains UDF to necessary conditions on lat, long
  • 23. X10 Acceleration with Data Skipping and Catalog Query rewrite approach (yellow) is the baseline • Using already optimized data format: Parquet/ORC For other formats the acceleration is much larger • e.g. CSV/JSON/Avro Experiment uses Raleigh Research Triangle query X10 speedup on average 10 TB of Weather Data on COS
  • 24. A Real-Life Cloud Data Lake
  • 25. Making trusted COVID-19 data available to broad set of analytics, e.g.: § https://accelerator.weather.com/bi § Watson Health Return to Work Advisor The COVID-19 Data Lake Ø Extensible with new data sources easily Ø Maximized velocity and elasticity Ø Full automation of all pipelines Ø New pipeline prototype in hours & productize in 2-3 days Ø Radically minimizing resource and operational costs by using IBM Cloud serverless and full ops automation Cloud Functions Cloud Object Storage - Persist - Trigger - Static Content Creation - Schema Management - Pipeline PoCs - Usage Tutorials Watson Studio SQL Query - Transformation - Transport - Table Catalog (Mart) - Queries - Export - Pipeline -Productization - Automation - Monitoring & Alerting - Pull External Data
  • 26. COVID-19 Data Lake Topology – High Level Landing Zone (E) Landing Buckets Preparation Zone (T) Landing Namespace Preparation Namespace Preparation Buckets Integration Zone (L) Dashboarding DWH Integration Buckets Data Mart Instance Integration Namespace Mart Management Project Data Mart Access Project TWC Scrapers & Pipeline Collectors Sequences Preparation Sequences Mart Sequences Delivery Sequences Pipeline Instance Schema Management Static Content Management Pipeline Instance Usage Notebooks Table Catalog Preparation Sequences External Data Sources Pull Push Collectors Sequences Preparation Sequences Usage Notebooks Usage Notebooks Users Pipeline PoC Project Preliminary Pipeline Notebooks Location Statistics Upload Update Reference Data Add Partitions Query & Extract Transform COGNOS