SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
BASEL | BERN | BRUGG | BUCHAREST | DĂśSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENEVA
HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH
http://guidoschmutz.wordpress.com@gschmutz
Grundlagen der Big-Data und KI-Architektur
DOAG Data Centric Day, 25.9.2019 in Köln
Guido Schmutz
Guido Schmutz
Working at Trivadis for more than 22 years
Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data
Oracle Groundbreaker Ambassador & Oracle ACE Director
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
169th edition
From Data Warehouse …
Data Warehouse Architecture
Enterprise Data
Warehouse
Extract, Transform
& Load (ETL)
Bulk Source
DB
Extract
File
DB
Consumer
RDBMS BI Tools
ETL Engine
high latency
Data Warehouse is an architecture
Layered model, controlled ETL, single point
of truth, query optimized data marts
Tested, optimized, quality assured,
„operated“
Standard-reporting, adHoc-reporting on
DWH Base
Perfect and fast for new requirements to
known and prepared data and structures
Data Warehouse ist not „agile“
No free definition and shaping of arbitrary
analytical questions
= Data Production
Source: https://www.flickr.com/photos/128950981@N04/15452926858
DWH Architecture – what about Streaming Data?
Enterprise Data
Warehouse
Extract, Transform
& Load (ETL)
Bulk Source
DB
Extract
File
DB
Consumer
RDBMS BI Tools
ETL Engine
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
… to Big Data / Data Lake
Initial Idea of a Data Lake …
Adapted from Wikipedia.org
“Reporting,
visualization,
analytics and
machine
learning”
“Single store of
all data in the
enterprise” “Should put an
end to data
silos.”
“Example:
Distributed file
system used in
Apache
Hadoop”
Data
Lake
Data Lake is an Infrastructure
Permanently new Data and Structures
Schema on Read
Really large amounts of Data
Explorative Working (Research)
Established Error-Culture
New user groups ([Data] Scientists)
Freedom of data-choice
Freedom of source-choice
Self-Service Data Labs
adHoc- & One-Shot implementations
Query + Advanced Analytics
= Research & Development
Source: https://www.flickr.com/photos/ian-arlett/34233379390
Data-Lab Interpretation
Schema on Read instead of (only) Schema on Write
"Schema on Write"
• Data quality managed by formalized ETL process
• Data persisted in tabular, agreed and consistent
form
• Data integration happens in ETL
• Structure must be decided before writing
"Schema on Read"
• Interpretation of data captured in code for each
program accessing the data
• Data quality dependent on code quality
• Data integration happens in code
EDWHETLData
Source
Consumer
RDBMS BI Tools
Data LakeData
Source
Consumer
Storage
Script
Data Science
Workbench
Data Science
Workbench
Transform
Transform
Bulk Source
Consumer
• Machine Learning
• Graph Algorithms
• Natural Language Processing
DB
Extract
File
DB
Big Data / Data Lake Architecture
Data Science
Workbench
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Parallel
Processing
Storage
Storage
Raw
Refined/
UsageOpt
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
high latency
Bulk Source
Consumer
DB
Extract
File
DB
Big Data / Data Lake Architecture
BI Tools
Data Science
Workbench
SQL
File Import / SQL Import
“Native” Raw
Hadoop ClusterdHadoop ClusterBig Data Platform
Parallel
Processing
Storage
Storage
Raw
Refined/
UsageOpt
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Query
Engine
Enterprise Data
Warehouse
SQL
SQL Export
Data Lake & EDWH Architecture
Bulk Source
DB
Extract
File
DB
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
“Native” Raw
RDBMS
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
Enterprise Data
Warehouse
SQL / Search
Data Lake & EDWH Architecture
Consumer
BI Apps
Data Science
Workbench
SQL
“Native” Raw
RDBMS
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
File Import / SQL Import
Bulk Source
DB
Extract
File
DB
SQL Export
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
Bulk Source
Enterprise Data
Warehouse
SQL / Search
SQL Export
File Import / SQL Import
DB
Extract
File
DB
Data Lake & EDWH Architecture with Streaming Data
SQL
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Consumer
BI Apps
Data Science
Workbench
Parallel
Processing
Query
Engine
“Native” Raw
Bulk Source
Enterprise Data
Warehouse
SQL / Search
SQL Export
File Import / SQL Import
DB
Extract
File
DB
Data Lake & EDWH Architecture with Streaming Data
Consumer
BI Apps
Data Science
Workbench
SQL
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Event
Hub
Event
Hub
Event
Hub
Event
Stream
B
ulk
D
ata
Im
port
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
high latency
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
“Native” Raw
Keep the data in motion …
Data at Rest Data in Motion
Store
(Re)Act
Visualize/
Analyze
StoreAct
Analyze
11101
01010
10110
11101
01010
10110
vs.
Visualize
Event
Hub
Event
Hub
Event Processing Architecture
Event
Hub
“SQL” / Search
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
Low(est) latency, no history
Consumer
Enterprise
App
Dashboard
Stream Processing Cluster
Stream
Processor
Model /
State
Event
Stream
Service
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Rules
Engine
• Complex Event Processing (CEP)
• Machine Learning Model
Execution (Inference)
• State Transition
Event
Stream
Event Processing & Data Lake
ServiceEvent
Stream
Data Flow
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
DashboardStream Processing Cluster
Stream
Processor
Model /
State
Event
Hub
Yes No
Low High
Yes No
Elasticity
End-to-End Latency
Ad-Hoc (SQL) Queries
Low HighStorage Costs
Yes NoSupports Raw Data
Yes NoSupports Streaming Data
Low HighAccess Latency
Parallel
Processing
Query
Engine
Rules
Engine
Event
Stream
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Event Processing & Data Lake: Lambda Architecture
Event
Stream
Bulk
Data Flow
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Stream Processing Cluster
Stream
Processor
Model /
State
ML Inference
Server
Event
Hub
Consumer
BI Apps
Dashboard
Serving
API
(Merger)
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Event
Stream
Batch
Result
Speed
Result
{ }
Batch Layer
Speed Layer
Parallel
Processing
Query
Engine
Event Processing & Data Lake: Kappa Architecture
Event
Stream
Stream Processing Cluster
Stream Processor V1.0 State V1.0
Event
Hub
Event Source
Location
Weather
IoT
Data
Mobile
Apps
Social
Reply
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Bulk
Data Flow
Consumer
BI Apps
Dashboard
Serving
Stream Processor V2.0 State V2.0
Result V1.0
Result V2.0
API
(Switcher)
{ }
Speed Layer
Parallel
Processing
Query
Engine
Integrate existing systems with CDC
ServiceEvent
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
DashboardStream Processing Cluster
Stream
Processor
Model /
State
Event
Hub
Change Data
Capture
Parallel
Processing
Query
Engine
Rules
Engine
Bulk
Data Flow
Event
Stream
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Applications participate Event-Driven
Service
Event
Stream
Bulk
Data Flow
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Platform
Stream Processing Platform
Stream
Processor
Model /
State
Change Data
Capture
Rules
Engine
Event
Stream
Microservice Data
{ }
API
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Move Processing to Edge
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Microservice Data
{ }
API
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Parallel
Processing
Query
Engine
Rules
Engine
Event
Stream
Event
Stream
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
SQL
Export
Anyone does what they want
No (central?) documentation
No unique data structure
No unique transformations
No unique KPI definitions
No quality assurance
No data flow analysis
Silo-Thinking
Data avalibility? Security? Auditibility?
= No Data Architecture
Data
SwampQuelle https://www.flickr.com/photos/82134796@N03/10603438015
But be careful ….
Data Lake Zones & Data
Catalog
Data Storage
Landing Zone
Archive Zone
Data Lake Zones
Object
Store
Tape
Raw Zone
Sandbox Zone
Usage-
Optimized Zone
Data Source Data Access
File
System
Event Hub
Object
Store
File
System
Event Hub
Object
Store
File
System
Object
Store
File
System
RDBMS
Object
Store
File
System
RDBMS/
NoSQL
Refined Zone
Object
Store
File
System
Event Hub
NoSQL
In-Memory
Grid
Event Hub/
Store
Disk Service
Disk Service
Data Catalog
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
(Machine Learning Augmented) Data Catalog
A data catalog creates and maintains an inventory
of data assets through discovery, description and
organization of distributed datasets.
It provides context to enable data stewards,
data/business analysts, data engineers, data
scientists and other line of business (LOB) data
consumers to find and understand relevant
datasets for the purpose of extracting business
value.
Modern machine-learning-augmented data
catalogs automate various tedious tasks involved
in data cataloging, including metadata discovery,
ingestion, translation, enrichment and the creation
of semantic relationships between metadata.
Data Catalog
Data Catalog Features
Ranking on Utilization
Rate Catalog Objects
Maintain Multiple Versions
of Catalog Object
Search & Navigation for
Content
Content Check in/out
Certify Official Versions of
Metadata
Analyze and Audit Decision
Processes
Integrate Data Lineage
Levels of Access to Catalog
Objects
Impact Analysis
API for Search / Catalog /
Mgmt Functions
Track Usage of Catalog
Objects
Integration with IAM
Automated Crawling of
Source System
Catalog Cloud-Deployed
Sources
Catalog Hadoop-based
Sources
Catalog BI & Data
Visualization Tools
Catalog Databases
Integration with self-service
Tools
Classify Catalog Objects by
Business Glossary
Supports user-defined
Tagging
Integrates with Data
Profiling
Supports Data Sampling
Quality Metrics
Catalog Machine/IoT Data
Supports Discussion
Threads on Catalog Objects
Annotate & Comment on
Catalog Objects
Catalog Unstructured Data
with NLP functionality
Semantic Search
Classify Catalog Objects by
Domain
Publish/Subscribe on
Changes of Catalog Objects
AI/ML based
Recommendation
Detect Similar/Duplicate or
Related Data
Easy to use, intuitive GUI
Supports Manual Curation
Supports Automated (ML
based) tagging
Supports ongoing discovery
of new data sets
Natural Language Search
Facetted based Search
Catalog Object Value
Estimation
Incentive-based
Participation
Encouragement
Assign Data Steward
Traditional vs. Cloud Native
Big Data Platforms
Traditional vs. Cloud Native Big Data
Data Local Compute
(traditional)
Separate Compute and Storage
(cloud native)
Worker #1
Disk
Processing
Master Node
Worker #2
Disk
Processing
Worker #3
Disk
Processing
Network
Storage
Disk Disk Disk
Compute #1
Processing
Compute #2
Processing
Compute #3
Processing
Network
Master Node
Network
Separation of compute
and storage – the
fundamental difference
• store data in Object
Storage instead of HDFS
• bring up Compute nodes
only for data processing
• multiple workloads on
separate clusters can
access same data
Traditional vs. Cloud Native Big Data
Traditional Cloud Native
Data Local Compute Yes No
Network Bandwidth Req. Low High
Scalable, shared-usage of Data No (only within cluster) Yes
Persistence HDFS Object Storage
Data Lifecycle Tiered Storage Built-in (cloud)
Compute Hadoop, Spark Hadoop, Spark
Serverless Processing no yes
Infrastructure Hadoop Cluster Cloud, Container
Orchestration
Entry Threshold high low
Modern Data Platform
Data Platform
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
Modern Data Platform
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
On-Premises – Traditional
Hadoop YARN
Pig
HDFS
HDFS
Kafka
Confluent
Hive
Kafka Streams
Spring Boot NoSQL
RDBMS
NoSQL
RDBMS
RDBMS
Atlas
Debezium Streamsets
Flume
Sqoop Flume
Impala
MapReduce
Spark
SparkSQL
Spark Streaming
Zeppelin
Jupyter
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
Oracle Cloud
Kafka
Confluent
Streamsets
Nifi
Streamsets
Nifi
Object Storage
Archive Storage
Object Storage
Archive Storage
Data Science
Big Data Cloud Service
Machine
Learning
Streaming
Data Science
Functions
Visual Builder
Java
NoSQL DB
Data Catalog
Autonomous
Transaction Proc
NoSQL DB
Autonomous
DWH
Big Data SQL
Cloud Service
GoldenGate
Cloud Service
Kafka Streams/
KSQL
SOA Cloud Service
Container Engine for
Kubernetes
Zeppelin
Jupyter
Transfer Service
Container Pipelines
Container
Registry
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
AWS Cloud
Kafka
Confluent
Streamsets
Nifi
Streamsets
Nifi
Zeppelin
Jupyter
S3
S3 Glacier
Deep Archive
S3
Dynamo DB
Redshift
Redshift
Spectrum
Spark on EMR Glue
Snowball
Data Sync
Athena
Presto on EMR
SageMaker
Deep Learning
Containers
Spark Streaming on EMR
Databricks on AWS
Kinesis Data Analytics
Lambda
Batch
Spring Boot
QuickSight
Zeppelin on EMR
Databricks on AWS
RStudio on EMR
API Gateway
Managed Streaming
for Kafka
Kinesis Data Firehose
Confluent Cloud
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Microservice Cluster
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Rules
Engine
Parallel
Processing
Query
Engine
Microservice Data
{ }
API
Event
Stream
Event
Stream
On-Premises – Cloud Native
Istio
Kubernetes
Docker
SparkMinIO
S3
MinIO
S3
Kafka
Confluent
NoSQL
Presto
Kafka Streams
Spring Boot NoSQL
RDBMS
NoSQL
RDBMS
RDBMS
Atlas
Debezium Streamsets
Nifi
StreamsetsNifi
SparkSQL
Spark Streaming
Zeppelin
Jupyter
Physical Data Lake vs. Virtual
Data Lake
Physical Data Lake
Hadoop ClusterdHadoop ClusterData Lake
Parallel
Processing
Storage
Storage
Raw
Refined/
UsageOpt
Consumer
Query
Engine
BI Apps
Data Source 1
File
Data Source 2
RDBMS
Data Source 3
NoSQL
Data Source 4
Enterprise
App
Governance
Data Catalog Data Lineage EncryptionPolicy Mgmt
Query
Data Ingest
DiscoveryCatalog
Virtual Data Lake
Data Source 1
File
Data Source 2
RDBMS
Data Source 3
NoSQL
Data Source 4
Enterprise
App
Data
Virtuali
zation
Query
Engine
Consumer
BI Apps
Governance
Data LineageLogical Data Catalog EncryptionPolicy Mgmt
DiscoveryCatalog
Catalog
Query
Query
Physical Data Lake as part of Virtual Data Lake
Data Source 1
File
Data Source 2
RDBMS
Data Source 3
NoSQL
Data Source 4
Enterprise
App
Data
Virtuali
zation
Query
Engine
Consumer
BI Apps
Governance
Data LineageLogical Data Catalog
Hadoop ClusterdHadoop ClusterData Lake
Storage
Storage
Raw
Refined/
UsageOpt
EncryptionPolicy Mgmt
Parallel
Processing
Query
Engine
Query
Data Ingest
Query
DiscoveryCatalog
Catalog
Query
Multiple Data Lakes form a Virtual Data Lake
Hadoop ClusterdHadoop ClusterData Lake 1
Storage
Storage
Raw
Refined/
UsageOpt
Hadoop ClusterdHadoop ClusterData Lake 2
Storage
Storage
Raw
Refined/
UsageOpt
Data
Virtuali
zation
Query
Engine
Consumer
BI Apps
Data Source 1
File
Data Source 2
RDBMS
Governance
Data LineageLogical Data Catalog EncryptionPolicy Mgmt
Parallel
Processing
Query
Engine
Parallel
Processing
Query
Engine
Query
DiscoveryCatalogCatalog
Query
Query
AI & Machine Learning
AI & Machine Learning: Training vs. Inference
© 2019 Gartner, Inc.ID: 354956
Raw Data
Logical Flow of Data
Trained Model
App or Service
Featuring
Capability
Inference
Applying This
Capability to
New Data
New
Data
“?”
“cat”
Deep-Learning
Framework
Training
Learning a New
Capability From
Existing Data
“cat”
Training
Dataset
“dog” “cat”
Logical Data
Warehouse
Edge Device, On-
Premises or
Cloud-Hosted
On-Premises or
Cloud-Hosted
Data Platform
Service
Event
Stream
Bulk
Data
Flow
Bulk Source
Event Source
Location
DB
Extract
File
Weather
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Consumer
BI Apps
Data Science
Workbench
Enterprise
App
Enterprise Data
Warehouse
SQL / Search
SQL
“Native” Raw
RDBMS
“SQL” / Search
Service
Event
Hub
Hadoop ClusterdHadoop ClusterBig Data Platform
SQL
Export
Storage
Storage
Raw
Refined/
UsageOpt
Stream Processing Cluster
Stream
Processor
Model /
State
Change Data
Capture
Edge Node
Rules
Event Hub
Storage
Governance
Data Catalog
Parallel
Processing
Query
Engine
Event
Stream
Event
Stream
Modern Data Platform
ML Inference
Server
Microservice Cluster
Microservice Data
{ }
API
ML Inference
Server
AI & Machine Learning: Model Training & Deployment
Backing Service
Integration of Machine Learning Model in application
Trained ML
Model
Trained ML
Model
ML
Serving
ML
Serving
Application
Trained ML
Model
ML
Serving
Application
MLasanAPI
MLinApplication
Trained ML
Model
ML
Serving
Trained ML
Model
ML
Serving
Application
Event Hub
MLandStreamProcessing
Event Hub
Application
MLasaCloudService
Trained ML
Model
ML
Serving
Fundamentals Big Data and AI Architecture

Weitere ähnliche Inhalte

Was ist angesagt?

Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernanceJames Serra
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfSuresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfAWS Chicago
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesCarole Gunst
 
Introduction Ă  Hadoop
Introduction Ă  HadoopIntroduction Ă  Hadoop
Introduction Ă  HadoopMathieu Dumoulin
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLMárton Kodok
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...DATAVERSITY
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Vector database
Vector databaseVector database
Vector databaseGuy Korland
 
Introduction To Kibana
Introduction To KibanaIntroduction To Kibana
Introduction To KibanaJen Stirrup
 

Was ist angesagt? (20)

Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfSuresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Introduction Ă  Hadoop
Introduction Ă  HadoopIntroduction Ă  Hadoop
Introduction Ă  Hadoop
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Vector database
Vector databaseVector database
Vector database
 
Introduction To Kibana
Introduction To KibanaIntroduction To Kibana
Introduction To Kibana
 

Ă„hnlich wie Fundamentals Big Data and AI Architecture

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Guido Schmutz
 
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsGuido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTGuido Schmutz
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDBDenny Lee
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Dealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityDealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityGreat Wide Open
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeDATAVERSITY
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016Mark Smith
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 

Ă„hnlich wie Fundamentals Big Data and AI Architecture (20)

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?
 
Data Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platformsData Ingestion in Big Data and IoT platforms
Data Ingestion in Big Data and IoT platforms
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Dealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityDealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to Infinity
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 

Mehr von Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as CodeGuido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureGuido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureGuido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaGuido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaGuido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming VisualisationGuido Schmutz
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 

Mehr von Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

KĂĽrzlich hochgeladen

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

KĂĽrzlich hochgeladen (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Fundamentals Big Data and AI Architecture

  • 1. BASEL | BERN | BRUGG | BUCHAREST | DĂśSSELDORF | FRANKFURT A.M. | FREIBURG I.BR. | GENEVA HAMBURG | COPENHAGEN | LAUSANNE | MANNHEIM | MUNICH | STUTTGART | VIENNA | ZURICH http://guidoschmutz.wordpress.com@gschmutz Grundlagen der Big-Data und KI-Architektur DOAG Data Centric Day, 25.9.2019 in Köln Guido Schmutz
  • 2. Guido Schmutz Working at Trivadis for more than 22 years Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Oracle Groundbreaker Ambassador & Oracle ACE Director Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 169th edition
  • 4. Data Warehouse Architecture Enterprise Data Warehouse Extract, Transform & Load (ETL) Bulk Source DB Extract File DB Consumer RDBMS BI Tools ETL Engine high latency
  • 5. Data Warehouse is an architecture Layered model, controlled ETL, single point of truth, query optimized data marts Tested, optimized, quality assured, „operated“ Standard-reporting, adHoc-reporting on DWH Base Perfect and fast for new requirements to known and prepared data and structures Data Warehouse ist not „agile“ No free definition and shaping of arbitrary analytical questions = Data Production Source: https://www.flickr.com/photos/128950981@N04/15452926858
  • 6. DWH Architecture – what about Streaming Data? Enterprise Data Warehouse Extract, Transform & Load (ETL) Bulk Source DB Extract File DB Consumer RDBMS BI Tools ETL Engine Event Source Location Weather IoT Data Mobile Apps Social Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency
  • 7. … to Big Data / Data Lake
  • 8. Initial Idea of a Data Lake … Adapted from Wikipedia.org “Reporting, visualization, analytics and machine learning” “Single store of all data in the enterprise” “Should put an end to data silos.” “Example: Distributed file system used in Apache Hadoop”
  • 9. Data Lake Data Lake is an Infrastructure Permanently new Data and Structures Schema on Read Really large amounts of Data Explorative Working (Research) Established Error-Culture New user groups ([Data] Scientists) Freedom of data-choice Freedom of source-choice Self-Service Data Labs adHoc- & One-Shot implementations Query + Advanced Analytics = Research & Development Source: https://www.flickr.com/photos/ian-arlett/34233379390 Data-Lab Interpretation
  • 10. Schema on Read instead of (only) Schema on Write "Schema on Write" • Data quality managed by formalized ETL process • Data persisted in tabular, agreed and consistent form • Data integration happens in ETL • Structure must be decided before writing "Schema on Read" • Interpretation of data captured in code for each program accessing the data • Data quality dependent on code quality • Data integration happens in code EDWHETLData Source Consumer RDBMS BI Tools Data LakeData Source Consumer Storage Script Data Science Workbench Data Science Workbench Transform Transform
  • 11. Bulk Source Consumer • Machine Learning • Graph Algorithms • Natural Language Processing DB Extract File DB Big Data / Data Lake Architecture Data Science Workbench File Import / SQL Import “Native” Raw Hadoop ClusterdHadoop ClusterBig Data Platform Parallel Processing Storage Storage Raw Refined/ UsageOpt Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency high latency
  • 12. Bulk Source Consumer DB Extract File DB Big Data / Data Lake Architecture BI Tools Data Science Workbench SQL File Import / SQL Import “Native” Raw Hadoop ClusterdHadoop ClusterBig Data Platform Parallel Processing Storage Storage Raw Refined/ UsageOpt Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Query Engine
  • 13. Enterprise Data Warehouse SQL SQL Export Data Lake & EDWH Architecture Bulk Source DB Extract File DB File Import / SQL Import Consumer BI Apps Data Science Workbench “Native” Raw RDBMS Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine
  • 14. Enterprise Data Warehouse SQL / Search Data Lake & EDWH Architecture Consumer BI Apps Data Science Workbench SQL “Native” Raw RDBMS Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt File Import / SQL Import Bulk Source DB Extract File DB SQL Export Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine
  • 15. Bulk Source Enterprise Data Warehouse SQL / Search SQL Export File Import / SQL Import DB Extract File DB Data Lake & EDWH Architecture with Streaming Data SQL Event Source Location Weather IoT Data Mobile Apps Social Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Consumer BI Apps Data Science Workbench Parallel Processing Query Engine “Native” Raw
  • 16. Bulk Source Enterprise Data Warehouse SQL / Search SQL Export File Import / SQL Import DB Extract File DB Data Lake & EDWH Architecture with Streaming Data Consumer BI Apps Data Science Workbench SQL Event Source Location Weather IoT Data Mobile Apps Social Event Hub Event Hub Event Hub Event Stream B ulk D ata Im port Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt high latency Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine “Native” Raw
  • 17. Keep the data in motion … Data at Rest Data in Motion Store (Re)Act Visualize/ Analyze StoreAct Analyze 11101 01010 10110 11101 01010 10110 vs. Visualize
  • 18. Event Hub Event Hub Event Processing Architecture Event Hub “SQL” / Search Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social Low(est) latency, no history Consumer Enterprise App Dashboard Stream Processing Cluster Stream Processor Model / State Event Stream Service Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Rules Engine • Complex Event Processing (CEP) • Machine Learning Model Execution (Inference) • State Transition Event Stream
  • 19. Event Processing & Data Lake ServiceEvent Stream Data Flow Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt DashboardStream Processing Cluster Stream Processor Model / State Event Hub Yes No Low High Yes No Elasticity End-to-End Latency Ad-Hoc (SQL) Queries Low HighStorage Costs Yes NoSupports Raw Data Yes NoSupports Streaming Data Low HighAccess Latency Parallel Processing Query Engine Rules Engine Event Stream Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 20. Event Processing & Data Lake: Lambda Architecture Event Stream Bulk Data Flow Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Stream Processing Cluster Stream Processor Model / State ML Inference Server Event Hub Consumer BI Apps Dashboard Serving API (Merger) Event Source Location Weather IoT Data Mobile Apps Social Event Stream Batch Result Speed Result { } Batch Layer Speed Layer Parallel Processing Query Engine
  • 21. Event Processing & Data Lake: Kappa Architecture Event Stream Stream Processing Cluster Stream Processor V1.0 State V1.0 Event Hub Event Source Location Weather IoT Data Mobile Apps Social Reply Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Bulk Data Flow Consumer BI Apps Dashboard Serving Stream Processor V2.0 State V2.0 Result V1.0 Result V2.0 API (Switcher) { } Speed Layer Parallel Processing Query Engine
  • 22. Integrate existing systems with CDC ServiceEvent Stream Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt DashboardStream Processing Cluster Stream Processor Model / State Event Hub Change Data Capture Parallel Processing Query Engine Rules Engine Bulk Data Flow Event Stream Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 23. Applications participate Event-Driven Service Event Stream Bulk Data Flow Event Stream Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Microservice Platform Stream Processing Platform Stream Processor Model / State Change Data Capture Rules Engine Event Stream Microservice Data { } API Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 24. Move Processing to Edge Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform Storage Storage Raw Refined/ UsageOpt Microservice Cluster Microservice Data { } API Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Parallel Processing Query Engine Rules Engine Event Stream Event Stream Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS SQL Export
  • 25. Anyone does what they want No (central?) documentation No unique data structure No unique transformations No unique KPI definitions No quality assurance No data flow analysis Silo-Thinking Data avalibility? Security? Auditibility? = No Data Architecture Data SwampQuelle https://www.flickr.com/photos/82134796@N03/10603438015 But be careful ….
  • 26. Data Lake Zones & Data Catalog
  • 27. Data Storage Landing Zone Archive Zone Data Lake Zones Object Store Tape Raw Zone Sandbox Zone Usage- Optimized Zone Data Source Data Access File System Event Hub Object Store File System Event Hub Object Store File System Object Store File System RDBMS Object Store File System RDBMS/ NoSQL Refined Zone Object Store File System Event Hub NoSQL In-Memory Grid Event Hub/ Store Disk Service Disk Service
  • 28. Data Catalog Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream
  • 29. (Machine Learning Augmented) Data Catalog A data catalog creates and maintains an inventory of data assets through discovery, description and organization of distributed datasets. It provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value. Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata.
  • 30. Data Catalog Data Catalog Features Ranking on Utilization Rate Catalog Objects Maintain Multiple Versions of Catalog Object Search & Navigation for Content Content Check in/out Certify Official Versions of Metadata Analyze and Audit Decision Processes Integrate Data Lineage Levels of Access to Catalog Objects Impact Analysis API for Search / Catalog / Mgmt Functions Track Usage of Catalog Objects Integration with IAM Automated Crawling of Source System Catalog Cloud-Deployed Sources Catalog Hadoop-based Sources Catalog BI & Data Visualization Tools Catalog Databases Integration with self-service Tools Classify Catalog Objects by Business Glossary Supports user-defined Tagging Integrates with Data Profiling Supports Data Sampling Quality Metrics Catalog Machine/IoT Data Supports Discussion Threads on Catalog Objects Annotate & Comment on Catalog Objects Catalog Unstructured Data with NLP functionality Semantic Search Classify Catalog Objects by Domain Publish/Subscribe on Changes of Catalog Objects AI/ML based Recommendation Detect Similar/Duplicate or Related Data Easy to use, intuitive GUI Supports Manual Curation Supports Automated (ML based) tagging Supports ongoing discovery of new data sets Natural Language Search Facetted based Search Catalog Object Value Estimation Incentive-based Participation Encouragement Assign Data Steward
  • 31. Traditional vs. Cloud Native Big Data Platforms
  • 32. Traditional vs. Cloud Native Big Data Data Local Compute (traditional) Separate Compute and Storage (cloud native) Worker #1 Disk Processing Master Node Worker #2 Disk Processing Worker #3 Disk Processing Network Storage Disk Disk Disk Compute #1 Processing Compute #2 Processing Compute #3 Processing Network Master Node Network Separation of compute and storage – the fundamental difference • store data in Object Storage instead of HDFS • bring up Compute nodes only for data processing • multiple workloads on separate clusters can access same data
  • 33. Traditional vs. Cloud Native Big Data Traditional Cloud Native Data Local Compute Yes No Network Bandwidth Req. Low High Scalable, shared-usage of Data No (only within cluster) Yes Persistence HDFS Object Storage Data Lifecycle Tiered Storage Built-in (cloud) Compute Hadoop, Spark Hadoop, Spark Serverless Processing no yes Infrastructure Hadoop Cluster Cloud, Container Orchestration Entry Threshold high low
  • 35. Data Platform Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream Modern Data Platform
  • 36. Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream On-Premises – Traditional Hadoop YARN Pig HDFS HDFS Kafka Confluent Hive Kafka Streams Spring Boot NoSQL RDBMS NoSQL RDBMS RDBMS Atlas Debezium Streamsets Flume Sqoop Flume Impala MapReduce Spark SparkSQL Spark Streaming Zeppelin Jupyter
  • 37. Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream Oracle Cloud Kafka Confluent Streamsets Nifi Streamsets Nifi Object Storage Archive Storage Object Storage Archive Storage Data Science Big Data Cloud Service Machine Learning Streaming Data Science Functions Visual Builder Java NoSQL DB Data Catalog Autonomous Transaction Proc NoSQL DB Autonomous DWH Big Data SQL Cloud Service GoldenGate Cloud Service Kafka Streams/ KSQL SOA Cloud Service Container Engine for Kubernetes Zeppelin Jupyter Transfer Service Container Pipelines Container Registry
  • 38. Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream AWS Cloud Kafka Confluent Streamsets Nifi Streamsets Nifi Zeppelin Jupyter S3 S3 Glacier Deep Archive S3 Dynamo DB Redshift Redshift Spectrum Spark on EMR Glue Snowball Data Sync Athena Presto on EMR SageMaker Deep Learning Containers Spark Streaming on EMR Databricks on AWS Kinesis Data Analytics Lambda Batch Spring Boot QuickSight Zeppelin on EMR Databricks on AWS RStudio on EMR API Gateway Managed Streaming for Kafka Kinesis Data Firehose Confluent Cloud
  • 39. Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Microservice Cluster Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Rules Engine Parallel Processing Query Engine Microservice Data { } API Event Stream Event Stream On-Premises – Cloud Native Istio Kubernetes Docker SparkMinIO S3 MinIO S3 Kafka Confluent NoSQL Presto Kafka Streams Spring Boot NoSQL RDBMS NoSQL RDBMS RDBMS Atlas Debezium Streamsets Nifi StreamsetsNifi SparkSQL Spark Streaming Zeppelin Jupyter
  • 40. Physical Data Lake vs. Virtual Data Lake
  • 41. Physical Data Lake Hadoop ClusterdHadoop ClusterData Lake Parallel Processing Storage Storage Raw Refined/ UsageOpt Consumer Query Engine BI Apps Data Source 1 File Data Source 2 RDBMS Data Source 3 NoSQL Data Source 4 Enterprise App Governance Data Catalog Data Lineage EncryptionPolicy Mgmt Query Data Ingest DiscoveryCatalog
  • 42. Virtual Data Lake Data Source 1 File Data Source 2 RDBMS Data Source 3 NoSQL Data Source 4 Enterprise App Data Virtuali zation Query Engine Consumer BI Apps Governance Data LineageLogical Data Catalog EncryptionPolicy Mgmt DiscoveryCatalog Catalog Query Query
  • 43. Physical Data Lake as part of Virtual Data Lake Data Source 1 File Data Source 2 RDBMS Data Source 3 NoSQL Data Source 4 Enterprise App Data Virtuali zation Query Engine Consumer BI Apps Governance Data LineageLogical Data Catalog Hadoop ClusterdHadoop ClusterData Lake Storage Storage Raw Refined/ UsageOpt EncryptionPolicy Mgmt Parallel Processing Query Engine Query Data Ingest Query DiscoveryCatalog Catalog Query
  • 44. Multiple Data Lakes form a Virtual Data Lake Hadoop ClusterdHadoop ClusterData Lake 1 Storage Storage Raw Refined/ UsageOpt Hadoop ClusterdHadoop ClusterData Lake 2 Storage Storage Raw Refined/ UsageOpt Data Virtuali zation Query Engine Consumer BI Apps Data Source 1 File Data Source 2 RDBMS Governance Data LineageLogical Data Catalog EncryptionPolicy Mgmt Parallel Processing Query Engine Parallel Processing Query Engine Query DiscoveryCatalogCatalog Query Query
  • 45. AI & Machine Learning
  • 46. AI & Machine Learning: Training vs. Inference © 2019 Gartner, Inc.ID: 354956 Raw Data Logical Flow of Data Trained Model App or Service Featuring Capability Inference Applying This Capability to New Data New Data “?” “cat” Deep-Learning Framework Training Learning a New Capability From Existing Data “cat” Training Dataset “dog” “cat” Logical Data Warehouse Edge Device, On- Premises or Cloud-Hosted On-Premises or Cloud-Hosted
  • 47. Data Platform Service Event Stream Bulk Data Flow Bulk Source Event Source Location DB Extract File Weather DB IoT Data Mobile Apps Social File Import / SQL Import Consumer BI Apps Data Science Workbench Enterprise App Enterprise Data Warehouse SQL / Search SQL “Native” Raw RDBMS “SQL” / Search Service Event Hub Hadoop ClusterdHadoop ClusterBig Data Platform SQL Export Storage Storage Raw Refined/ UsageOpt Stream Processing Cluster Stream Processor Model / State Change Data Capture Edge Node Rules Event Hub Storage Governance Data Catalog Parallel Processing Query Engine Event Stream Event Stream Modern Data Platform ML Inference Server Microservice Cluster Microservice Data { } API ML Inference Server
  • 48. AI & Machine Learning: Model Training & Deployment
  • 49. Backing Service Integration of Machine Learning Model in application Trained ML Model Trained ML Model ML Serving ML Serving Application Trained ML Model ML Serving Application MLasanAPI MLinApplication Trained ML Model ML Serving Trained ML Model ML Serving Application Event Hub MLandStreamProcessing Event Hub Application MLasaCloudService Trained ML Model ML Serving