SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
Big Data Integration Patterns
Michael Häusler
Jun 12, 2017
The social network gives scientists new tools
to connect, collaborate, and keep up with the
research that matters most to them.
ResearchGate is built
for scientists.
Our mission is to connect the world of science
and make research open to all.
12+ million
Members
100+ million
Publications
1,500+ million
Citations
operational systems
end users
12+ M scientists
“Analytics” Cluster
(batch)
“Live” Cluster
(near realtime)
internal users
HBase
replication
transactional load
(HBase reads / writes)
continuous updates
(Flink streaming results)
batch updates
(MR / Hive / Flink results)
data ingestion
Big Data
65+ 370+ 3,000+
Engineers Data Ingestion Jobs per Day
Big Data
Yarn Applications per Day
65+ 3,000+
Engineers Yarn Applications per Day
Developer Productivity
Ease of Maintenance
Ease of Operations
Big Data Architecture
Integration Patterns & Pricinicples
Patterns & Principles
Integration patterns should be strategic, but also ...
should be driven by use cases
should tackle real world pain points
should not be dictated by a single technology
Patterns & Principles
Big data is still a fast moving space
Big data batch processing today is quite different compared to 5 years ago
Big data stream processing is evolving heavily right now
Big data architecture
must evolve over time
First Big Data Use Case
Early 2011, Author Analysis
Author Analysis – Clustering and Disambiguation
Author Analysis – High Product Impact
Enriching User Generated Content
Users and batch flows continuously enrich an evolving dataset
Both user actions and batch flow results ultimately affect the same live database
Users Live Database Batch flow
Bibliographic Metadata – Data Model
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming
community publication service asset service
Bibliographic Metadata – Services
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming
Implementation
publication
service
author analysis
community
Implementation
publication
service
author analysis
community
data sources
data ingestion
input data
data processing
intermediate results
final results
export of results
differ
#1 Decouple Data Ingestion
Implementation
publication
service
author analysis
community
postgres reuse input data
mongodb
debug
Debugging an Error on Production
Your flow
has unit and integrations tests
but still breaks unexpectedly in production
You need to find the root cause
Is it a change in input data?
Is it a change on the cluster?
Is it a race condidition?
Crucial capabilities
Easy adhoc analysis of all involved data (input, intermediate, result)
Rerun current flow with current cluster configuration on yesterday’s data
Confirm hotfix by re-running on today’s data (exactly the same data that triggered the bug)
How to decouple?
Ingesting Data as Needed?
Publishing Data as Needed?
Dedicated Data Ingestion!
Platform Data Import
... Hive...
Adhoc Analytics
Platform Data Import
Dedicated component, but generic
Every team can onboard new data sources, as required by use cases
Every ingested source is immediately available for all consumers (incl. analytics)
Feature parity for all data sources (e.g., mounting everything in Hive)
#2 Speak a common format*
* have at least one copy of all data in a common format (e.g., avro)
Formats
Text SequenceFiles Avro ORC
X X +
schema evolution
self describing
reflect datum reader
flexible for batch & streaming
columnar
great for batch
Speak a common format
Have at least one copy of all data in a common format
Your choice of processing framework should not be limited by format of existing data
Every ingested source should be available for all consumers
When optimizing for a framework (e.g., ORC for Hive) consider a copy
#3 Speak a common language*
* continuously propagate schema changes
Structured or unstructured data?
... ...
mongodb (schemaless)
service knows structure
postgres (schema)
Data Warehouse vs. Data Lake
... ...
data lake
assume no schema
(defer schema to consumer)
data warehouse
enforce schema at ingestion
(schema on write)
X X
Can we have both?
Preserve schema information that is already present
some times at database level
many times at application level
Preserve full data – be truthful to our data source
continuously propagate schema changes
Can we have something like a Data Lakehouse?
Entities Define Schema
Code first
entities within owning service define schema
Auto conversion preferred
conversion to other representations via annotations
(JSON, BSON, Avro, ...)
Continuously propagate schema changes
Data ingestion process is generic and driven by avro schema
Changes in avro schema are continuously propagated to data ingestion process
Consumers with old schema can still read data due to avro schema evolution
Caveat: breaking changes still have to be dealt with by a change process
Everyone speaks the same language
Extra Benefit
service and batch processor
can share business logic
#4 Model Data Dependencies Explicitly
Model Data Dependencies Explicitly
Model Data Dependencies Explicitly
memento
publish
poll
1
2
3
4
Memento v2
memento publish
unique artifactId
memento poll <waiting-time>
Model Data Dependencies Explicitly
More flexible scheduling – run flows as early as possible
Allows multiple ingestion or processing attempts
Allows immutable data (repeatable read)
Allows analysis of dependency graph
which datasets are used by what flow
#5 Decouple export of results
Decouple export of results
Decouple export of results
Push results via HTTP to service
Export of results just becomes a client of the service
service does not have to be aware of big data technologies
Service can validate results, e.g.,
plausibility checks
optimistic locking
Makes testing much easier
Avro → Http
Part of the flow, but standardized component
Handles tracking of progress
treats input file as a “queue”
converts records to http calls
can be interrupted and resumed anytime
Sends standardized headers, e.g.,
X-rg-client-id: author-analysis
Handles backpressure signals from services
#6 Model Flow Orchestration Explicitly
Model Flow Orchestration Explicitly
Consider using an execution system like Azkaban, Luigi, or Airflow
Establish coding standards for orchestration, e.g.,
inject paths from outside – don’t construct them in your flow
inject calculation dates – never call now()
inject configuration settings – don’t hardcode -D mapreduce.map.java.opts=-Xmx4096m
foresee environment specific settings
Think about
ease of operations
tuning of settings
upgrades
What about Stream Processing?
Sources of Streaming Data
entity conveyor
timeseries data
non-timeseries data
(e.g., graph data)
kafka
kafka
Stream Processing
stream processor mqcom
#2 Speak a common format #5 Decouple export of results
#1 Decouple data ingestion
#3 Speak a common language
#6 Model flow execution explicitly
What about #4 ?
Model Data Dependencies Explicitly
We think about it
Depends on use cases and pain points
Potentially put Kafka topics into Memento
storing “offsets of interest” from producers
facilitate switching between incompatible versions of stream processors
Evolving Big Data Architecture
stream processor
batch processor
Thank you!
Michael Häusler, Head of Engineering
https://www.researchgate.net/profile/Michael_Haeusler
https://www.researchgate.net/careers

Weitere ähnliche Inhalte

Was ist angesagt?

TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
A High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta LakeA High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta LakeDatabricks
 
Backbone using Extensible Database APIs over HTTP
Backbone using Extensible Database APIs over HTTPBackbone using Extensible Database APIs over HTTP
Backbone using Extensible Database APIs over HTTPMax Neunhöffer
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...y-asgari
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsshanker_uma
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit
 
Quantopix analytics system (qas)
Quantopix analytics system (qas)Quantopix analytics system (qas)
Quantopix analytics system (qas)Al Sabawi
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorialtekslate1
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 

Was ist angesagt? (20)

TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
A High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta LakeA High Performance Mutable Engagement Activity Delta Lake
A High Performance Mutable Engagement Activity Delta Lake
 
Backbone using Extensible Database APIs over HTTP
Backbone using Extensible Database APIs over HTTPBackbone using Extensible Database APIs over HTTP
Backbone using Extensible Database APIs over HTTP
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...Introducing to Datamining vs. OLAP -  مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
Introducing to Datamining vs. OLAP - مقدمه و مقایسه ای بر داده کاوی و تحلیل ...
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Sap business objects interview questions
Sap business objects interview questionsSap business objects interview questions
Sap business objects interview questions
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Quantopix analytics system (qas)
Quantopix analytics system (qas)Quantopix analytics system (qas)
Quantopix analytics system (qas)
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorial
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 

Ähnlich wie Integration Patterns for Big Data Applications

What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 

Ähnlich wie Integration Patterns for Big Data Applications (20)

What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Legacy Migration Overview
Legacy Migration OverviewLegacy Migration Overview
Legacy Migration Overview
 
Legacy Migration
Legacy MigrationLegacy Migration
Legacy Migration
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 

Kürzlich hochgeladen

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Kürzlich hochgeladen (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Integration Patterns for Big Data Applications

  • 1. Big Data Integration Patterns Michael Häusler Jun 12, 2017
  • 2. The social network gives scientists new tools to connect, collaborate, and keep up with the research that matters most to them. ResearchGate is built for scientists.
  • 3. Our mission is to connect the world of science and make research open to all.
  • 5. operational systems end users 12+ M scientists “Analytics” Cluster (batch) “Live” Cluster (near realtime) internal users HBase replication transactional load (HBase reads / writes) continuous updates (Flink streaming results) batch updates (MR / Hive / Flink results) data ingestion Big Data
  • 6. 65+ 370+ 3,000+ Engineers Data Ingestion Jobs per Day Big Data Yarn Applications per Day
  • 7. 65+ 3,000+ Engineers Yarn Applications per Day Developer Productivity Ease of Maintenance Ease of Operations
  • 8. Big Data Architecture Integration Patterns & Pricinicples
  • 9. Patterns & Principles Integration patterns should be strategic, but also ... should be driven by use cases should tackle real world pain points should not be dictated by a single technology
  • 10. Patterns & Principles Big data is still a fast moving space Big data batch processing today is quite different compared to 5 years ago Big data stream processing is evolving heavily right now Big data architecture must evolve over time
  • 11. First Big Data Use Case Early 2011, Author Analysis
  • 12. Author Analysis – Clustering and Disambiguation
  • 13. Author Analysis – High Product Impact
  • 14. Enriching User Generated Content Users and batch flows continuously enrich an evolving dataset Both user actions and batch flow results ultimately affect the same live database Users Live Database Batch flow
  • 15. Bibliographic Metadata – Data Model Author Asset Derivative Publication Journal Institution Department Account Affiliation Citation Authorship Publication Link Affiliation Claiming
  • 16. community publication service asset service Bibliographic Metadata – Services Author Asset Derivative Publication Journal Institution Department Account Affiliation Citation Authorship Publication Link Affiliation Claiming
  • 18. Implementation publication service author analysis community data sources data ingestion input data data processing intermediate results final results export of results differ
  • 19. #1 Decouple Data Ingestion
  • 21. Debugging an Error on Production Your flow has unit and integrations tests but still breaks unexpectedly in production You need to find the root cause Is it a change in input data? Is it a change on the cluster? Is it a race condidition? Crucial capabilities Easy adhoc analysis of all involved data (input, intermediate, result) Rerun current flow with current cluster configuration on yesterday’s data Confirm hotfix by re-running on today’s data (exactly the same data that triggered the bug)
  • 23. Ingesting Data as Needed?
  • 26. Platform Data Import ... Hive... Adhoc Analytics
  • 27. Platform Data Import Dedicated component, but generic Every team can onboard new data sources, as required by use cases Every ingested source is immediately available for all consumers (incl. analytics) Feature parity for all data sources (e.g., mounting everything in Hive)
  • 28. #2 Speak a common format* * have at least one copy of all data in a common format (e.g., avro)
  • 29. Formats Text SequenceFiles Avro ORC X X + schema evolution self describing reflect datum reader flexible for batch & streaming columnar great for batch
  • 30. Speak a common format Have at least one copy of all data in a common format Your choice of processing framework should not be limited by format of existing data Every ingested source should be available for all consumers When optimizing for a framework (e.g., ORC for Hive) consider a copy
  • 31. #3 Speak a common language* * continuously propagate schema changes
  • 32. Structured or unstructured data? ... ... mongodb (schemaless) service knows structure postgres (schema)
  • 33. Data Warehouse vs. Data Lake ... ... data lake assume no schema (defer schema to consumer) data warehouse enforce schema at ingestion (schema on write) X X
  • 34. Can we have both? Preserve schema information that is already present some times at database level many times at application level Preserve full data – be truthful to our data source continuously propagate schema changes Can we have something like a Data Lakehouse?
  • 35. Entities Define Schema Code first entities within owning service define schema Auto conversion preferred conversion to other representations via annotations (JSON, BSON, Avro, ...)
  • 36. Continuously propagate schema changes Data ingestion process is generic and driven by avro schema Changes in avro schema are continuously propagated to data ingestion process Consumers with old schema can still read data due to avro schema evolution Caveat: breaking changes still have to be dealt with by a change process Everyone speaks the same language
  • 37. Extra Benefit service and batch processor can share business logic
  • 38. #4 Model Data Dependencies Explicitly
  • 40. Model Data Dependencies Explicitly memento publish poll 1 2 3 4
  • 41. Memento v2 memento publish unique artifactId memento poll <waiting-time>
  • 42. Model Data Dependencies Explicitly More flexible scheduling – run flows as early as possible Allows multiple ingestion or processing attempts Allows immutable data (repeatable read) Allows analysis of dependency graph which datasets are used by what flow
  • 43. #5 Decouple export of results
  • 46. Push results via HTTP to service Export of results just becomes a client of the service service does not have to be aware of big data technologies Service can validate results, e.g., plausibility checks optimistic locking Makes testing much easier
  • 47. Avro → Http Part of the flow, but standardized component Handles tracking of progress treats input file as a “queue” converts records to http calls can be interrupted and resumed anytime Sends standardized headers, e.g., X-rg-client-id: author-analysis Handles backpressure signals from services
  • 48. #6 Model Flow Orchestration Explicitly
  • 49. Model Flow Orchestration Explicitly Consider using an execution system like Azkaban, Luigi, or Airflow Establish coding standards for orchestration, e.g., inject paths from outside – don’t construct them in your flow inject calculation dates – never call now() inject configuration settings – don’t hardcode -D mapreduce.map.java.opts=-Xmx4096m foresee environment specific settings Think about ease of operations tuning of settings upgrades
  • 50. What about Stream Processing?
  • 51. Sources of Streaming Data entity conveyor timeseries data non-timeseries data (e.g., graph data) kafka kafka
  • 52. Stream Processing stream processor mqcom #2 Speak a common format #5 Decouple export of results #1 Decouple data ingestion #3 Speak a common language #6 Model flow execution explicitly
  • 53. What about #4 ? Model Data Dependencies Explicitly We think about it Depends on use cases and pain points Potentially put Kafka topics into Memento storing “offsets of interest” from producers facilitate switching between incompatible versions of stream processors
  • 54. Evolving Big Data Architecture stream processor batch processor
  • 55. Thank you! Michael Häusler, Head of Engineering https://www.researchgate.net/profile/Michael_Haeusler https://www.researchgate.net/careers