SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Kubernetes as
Data Platform
Riga DevOpsDays 2018-09-28
Eric Skoglund, Bonnier News
Lars Albertsson, Mimeria
1
2
3
4
5
Brand Scope Data Scope
➔ Behavioral Data
➔ Technical Data
No Content Data
Scoping the platform
Cloud Selection
6
Cloud Selection
7
The Pragmatic Choice
➔ Known to people in the dev teams
➔ New base platform for all other applications within
Bonnier News
Use Case Driven Development
➔ Use cases drive the development of the platform
➔ Focus on value and quality not on slurping in all data in the company
➔ Start with simple use cases!
8
9
FIND USE CASE
THAT PROVIDE
VALUE
NEW DATA INTO
THE PLATFORM
EVOLVE THE
PLATFORM
BASED ON
REQUIREMENTS
Use Case Driven Development
● Need data from teams
○ willing?
○ backlog?
○ collected?
○ useful?
○ extraction?
○ data governance?
○ history?
Data-centric innovation
10
A collaboration paradigm
11
Stream storage
Data lake
Data
democratised
Onboard driven by use case
12
Data lake
Data platform == collaboration platform
13
Data lake
Data platform overview
14
Data lake
Cold
store
Service
Service
Online
services
Offline
data platform
Batch
processing
Data platform overview
15
Data lake
Cold
store
Dataset
Job
Service
Service
Online
services
Offline
data platform
Batch
processing
Data platform overview
16
Data lake
Cold
store
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Job
Batch
processing
Workflow
orchestration
Data platform overview
17
Data lake
Batch
processing
Online
services
Cold
store
Service
Data feature
Dataset
Pipeline
Service
Service
Online
services
Offline
data platform
Internal
services
Job
Life of a change, batch pipelines
18
● My pipeline, version 2!
○ Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Easy to test end-to-end
○ Upstream team can do the change
∆?
Egress target change
19
● Need output in different storage!
○ Adding egress target is easy
○ Egress target backfill is easy
● Facilitates cost limitation
○ Partially aggregate → BigQuery / Redshift
○ Limited retention in egress storage
Life of an error, batch pipelines
20
● My dataset, bad version!
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
Deployment example, on-premise
21
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
Deployment example, cloud native
22
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR
Deployment, one cluster less
23
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Workerspark-submit
--master=local
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Continuous deployment
24
mono-
repo PR build,
affected
CI tests
mymodule/mypipe:revtag
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Workerspark-submit
--master=local
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Openshift registry
S3
master
branch
pipeline tests
doc build
Some pipelines are straightforward
25
Some are twisted
26
Autoscaling
27
GDPR
Article 17.
“The data subject shall have the right to obtain from the controller the erasure of personal data concerning
him or her without undue delay and the controller shall have the obligation to erase personal data without
undue delay where one of the following grounds applies:“
➔ the personal data are no longer necessary in relation to the purposes for which they were collected
or otherwise processed - Data Retention
➔ the data subject withdraws consent on which the processing is based - Data Deletion Requests
28
GDPR
29
{
id: ….
pii: [...]
}
CREATE
KEY FOR ID
ENCRYPT PERSONAL
DATA WITH KEY
GDPR - Retention
30
{
id: ….
pii: [...]
}
CREATE
KEY FOR
ID
ENCRYPT
PERSONAL DATA
WITH KEY
➔ Each dataset has a retention time from
the owners of the data
➔ Create new keys each 30 days
➔ Destroy keys older than the retention
time
GDPR - Right to be forgotten
31
List of users
that have
requested
deletion
Find keys
for those
users
Destroy
keys
Use Cases in Use
➔ Machine Learning
◆ Built a system that tries to predict if a visitor will watch an ad in a video or not
➔ Creating Reports
◆ Daily reporting data for ad team
◆ Weekly report of ad viewing data for site team
➔ GDPR Registry Extract
◆ Collect data from multiple different sources
◆ Merge the data
◆ Send data to be viewed by the user
32
Lessons Learned
Cloud selection is influenced by data location
Most data for the use cases we started with was on Google Cloud Storage / BigQuery
incurring extra development time and cost to exfiltrate that data.
Kubernetes?
Same platform as other teams + great support from infrastructure platform team.
No Spark cluster maintenance, tweaking, debugging.
Autoscaling works, but some challenges for batch jobs.
33
Summary
Use case driven development == Short Time to Production
First pipeline in 3 weeks
Small team 2-4 People
Keep it simple
10-15 Pipelines
34

Weitere ähnliche Inhalte

Was ist angesagt?

Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
 
empirical analysis modeling of power dissipation control in internet data ce...
 empirical analysis modeling of power dissipation control in internet data ce... empirical analysis modeling of power dissipation control in internet data ce...
empirical analysis modeling of power dissipation control in internet data ce...saadjamil31
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Institute e-Austria Timisoara
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsDatabricks
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingVianney FOUCAULT
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 
Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudRobert Sanders
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 

Was ist angesagt? (20)

Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Data democratised
Data democratisedData democratised
Data democratised
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
empirical analysis modeling of power dissipation control in internet data ce...
 empirical analysis modeling of power dissipation control in internet data ce... empirical analysis modeling of power dissipation control in internet data ce...
empirical analysis modeling of power dissipation control in internet data ce...
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
An Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization SystemsAn Approach to Data Quality for Netflix Personalization Systems
An Approach to Data Quality for Netflix Personalization Systems
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 

Ähnlich wie Kubernetes as a Data Platform for Bonnier News

Gaming analytics on gcp
Gaming analytics on gcpGaming analytics on gcp
Gaming analytics on gcpMyunggeun Choi
 
GCP Gaming 2016 Seoul, Korea Gaming Analytics
GCP Gaming 2016 Seoul, Korea Gaming AnalyticsGCP Gaming 2016 Seoul, Korea Gaming Analytics
GCP Gaming 2016 Seoul, Korea Gaming AnalyticsChris Jang
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)bigdata trunk
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and HerokuTapio Rautonen
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
 
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...Neo4j
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapNeo4j
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 
How DBAs can garner the power of the Oracle Public Cloud?
How DBAs can garner the  power of the Oracle Public  Cloud?How DBAs can garner the  power of the Oracle Public  Cloud?
How DBAs can garner the power of the Oracle Public Cloud?Gustavo Rene Antunez
 
Discover PostGIS: Add Spatial functions to PostgreSQL
Discover PostGIS: Add Spatial functions to PostgreSQLDiscover PostGIS: Add Spatial functions to PostgreSQL
Discover PostGIS: Add Spatial functions to PostgreSQLEDB
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024Cloud Native NoVA
 
The Fastest Way to Redis on Pivotal Cloud Foundry
The Fastest Way to Redis on Pivotal Cloud FoundryThe Fastest Way to Redis on Pivotal Cloud Foundry
The Fastest Way to Redis on Pivotal Cloud FoundryVMware Tanzu
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 

Ähnlich wie Kubernetes as a Data Platform for Bonnier News (20)

Gaming analytics on gcp
Gaming analytics on gcpGaming analytics on gcp
Gaming analytics on gcp
 
GCP Gaming 2016 Seoul, Korea Gaming Analytics
GCP Gaming 2016 Seoul, Korea Gaming AnalyticsGCP Gaming 2016 Seoul, Korea Gaming Analytics
GCP Gaming 2016 Seoul, Korea Gaming Analytics
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Introduction to PaaS and Heroku
Introduction to PaaS and HerokuIntroduction to PaaS and Heroku
Introduction to PaaS and Heroku
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
002 Introducing Neo4j 5 for Administrators - NODES2022 AMERICAS Beginner 2 - ...
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
How DBAs can garner the power of the Oracle Public Cloud?
How DBAs can garner the  power of the Oracle Public  Cloud?How DBAs can garner the  power of the Oracle Public  Cloud?
How DBAs can garner the power of the Oracle Public Cloud?
 
Discover PostGIS: Add Spatial functions to PostgreSQL
Discover PostGIS: Add Spatial functions to PostgreSQLDiscover PostGIS: Add Spatial functions to PostgreSQL
Discover PostGIS: Add Spatial functions to PostgreSQL
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
 
The Fastest Way to Redis on Pivotal Cloud Foundry
The Fastest Way to Redis on Pivotal Cloud FoundryThe Fastest Way to Redis on Pivotal Cloud Foundry
The Fastest Way to Redis on Pivotal Cloud Foundry
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 

Mehr von Lars Albertsson

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 

Mehr von Lars Albertsson (13)

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 

Kürzlich hochgeladen

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Kürzlich hochgeladen (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Kubernetes as a Data Platform for Bonnier News

  • 1. Kubernetes as Data Platform Riga DevOpsDays 2018-09-28 Eric Skoglund, Bonnier News Lars Albertsson, Mimeria 1
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. 5 Brand Scope Data Scope ➔ Behavioral Data ➔ Technical Data No Content Data Scoping the platform
  • 7. Cloud Selection 7 The Pragmatic Choice ➔ Known to people in the dev teams ➔ New base platform for all other applications within Bonnier News
  • 8. Use Case Driven Development ➔ Use cases drive the development of the platform ➔ Focus on value and quality not on slurping in all data in the company ➔ Start with simple use cases! 8
  • 9. 9 FIND USE CASE THAT PROVIDE VALUE NEW DATA INTO THE PLATFORM EVOLVE THE PLATFORM BASED ON REQUIREMENTS Use Case Driven Development
  • 10. ● Need data from teams ○ willing? ○ backlog? ○ collected? ○ useful? ○ extraction? ○ data governance? ○ history? Data-centric innovation 10
  • 11. A collaboration paradigm 11 Stream storage Data lake Data democratised
  • 12. Onboard driven by use case 12 Data lake
  • 13. Data platform == collaboration platform 13 Data lake
  • 14. Data platform overview 14 Data lake Cold store Service Service Online services Offline data platform Batch processing
  • 15. Data platform overview 15 Data lake Cold store Dataset Job Service Service Online services Offline data platform Batch processing
  • 16. Data platform overview 16 Data lake Cold store Dataset Pipeline Service Service Online services Offline data platform Job Batch processing Workflow orchestration
  • 17. Data platform overview 17 Data lake Batch processing Online services Cold store Service Data feature Dataset Pipeline Service Service Online services Offline data platform Internal services Job
  • 18. Life of a change, batch pipelines 18 ● My pipeline, version 2! ○ Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Easy to test end-to-end ○ Upstream team can do the change ∆?
  • 19. Egress target change 19 ● Need output in different storage! ○ Adding egress target is easy ○ Egress target backfill is easy ● Facilitates cost limitation ○ Partially aggregate → BigQuery / Redshift ○ Limited retention in egress storage
  • 20. Life of an error, batch pipelines 20 ● My dataset, bad version! 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 21. Deployment example, on-premise 21 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  • 22. Deployment example, cloud native 22 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  • 23. Deployment, one cluster less 23 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Workerspark-submit --master=local Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS
  • 24. Continuous deployment 24 mono- repo PR build, affected CI tests mymodule/mypipe:revtag Luigi daemon Worker Worker Worker Worker Worker Worker Workerspark-submit --master=local kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Openshift registry S3 master branch pipeline tests doc build
  • 25. Some pipelines are straightforward 25
  • 28. GDPR Article 17. “The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies:“ ➔ the personal data are no longer necessary in relation to the purposes for which they were collected or otherwise processed - Data Retention ➔ the data subject withdraws consent on which the processing is based - Data Deletion Requests 28
  • 29. GDPR 29 { id: …. pii: [...] } CREATE KEY FOR ID ENCRYPT PERSONAL DATA WITH KEY
  • 30. GDPR - Retention 30 { id: …. pii: [...] } CREATE KEY FOR ID ENCRYPT PERSONAL DATA WITH KEY ➔ Each dataset has a retention time from the owners of the data ➔ Create new keys each 30 days ➔ Destroy keys older than the retention time
  • 31. GDPR - Right to be forgotten 31 List of users that have requested deletion Find keys for those users Destroy keys
  • 32. Use Cases in Use ➔ Machine Learning ◆ Built a system that tries to predict if a visitor will watch an ad in a video or not ➔ Creating Reports ◆ Daily reporting data for ad team ◆ Weekly report of ad viewing data for site team ➔ GDPR Registry Extract ◆ Collect data from multiple different sources ◆ Merge the data ◆ Send data to be viewed by the user 32
  • 33. Lessons Learned Cloud selection is influenced by data location Most data for the use cases we started with was on Google Cloud Storage / BigQuery incurring extra development time and cost to exfiltrate that data. Kubernetes? Same platform as other teams + great support from infrastructure platform team. No Spark cluster maintenance, tweaking, debugging. Autoscaling works, but some challenges for batch jobs. 33
  • 34. Summary Use case driven development == Short Time to Production First pipeline in 3 weeks Small team 2-4 People Keep it simple 10-15 Pipelines 34