SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
www.scling.com
DataOps in practice -
Swedish style
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Who’s talking?
...
Google - video conference, engineering productivity
...
Spotify - data engineering
...
Independent data engineering consultant
Banks, media, startups, heavy industry, telco
Founder @ Scling - data-value-as-a-service
2
www.scling.com
Contents
Journey to DataOps
Experiences that shaped my data engineering
IMHO principles of successful DataOps
Toolbox
3
● Spotify information is old history
● Previously published
● Today is very different
www.scling.com
Spotify data 2007-2013
● Hadoop installed 2007
● Use cases: reporting, insights, recommendations
● Cultural aspects:
○ Autonomous teams
○ Eliminate waste
○ Learn and adapt
4
www.scling.com
Traditional systems
5
Mutation
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
www.scling.com
Data lake
Transformation
Cold
store
6
Mutation
Immutable,
shareable
Early Hadoop:
● Weak indexing
● No transactions
● Weak security
● Batch transformations
DataOps workflows:
● Immutable, shared data
● Resilient to failure
● Quick error recovery
● Low-risk experiments
Data factories
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
7
www.scling.com
Wrong conclusion, every day
● Downward trend every day!
8
www.scling.com
Normalise data collection to compare
9Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Normalise data collection to compare
10Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
11Graph by Adam Altmejd, @adamaltmejd
www.scling.com
From craft to process
12
www.scling.com
From craft to process
13
Multiple time windows
www.scling.com
From craft to process
14
Multiple time windows
Assess ingress data quality
www.scling.com
From craft to process
15
Multiple time windows
Assess ingress data quality
Assess outcome data quality
www.scling.com
From craft to process
16
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
www.scling.com
From craft to process
17
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
www.scling.com
From craft to process
18
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
www.scling.com
From craft to process
19
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
www.scling.com
From craft to process
20
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Towards sustainable production ML
21
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Risky operations
22
How to I test the pipeline?
You temporarily change the
output path and run manually.
Don’t do that.
What if I forget to change path?
www.scling.com
2013
23
● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1)
● Folklore development cycle & operations
● Unsatisfied needs in other teams
www.scling.com
luigid
Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop.
On-prem Hadoop production
Worker
10 * * * * luigi --module mymodule MyDaily
23 * * * * luigi --module other OtherDaily
Master
Executor
Worker
HDFS metadata
Data
Control
(+data)
Submit job
10 * ...
23 * ...
www.scling.com
Ghost in the cluster
● Jobs were deployed with Debian packages + Puppet on pet machines.
○ Multiple pets for redundancy. Race to run job.
● "This monitor daemon is at 100%. Since 6 months. I'll kill it."
● "Data is wrong. But we fixed this bug 6 months ago?!?"
25
www.scling.com
Start of a DataOps journey
26
Stateful Stateless
Pets Cattle
Folklore
Golden pathTest in prod
Local test
CI/CD
Weeks to learn
New pipeline
< 1 day
Days to mend
Bug fix
< 1 hour
www.scling.com
On-prem pipeline deployment pipeline
27
source
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
All that a pipeline needs, installed atomically
10 * * * * luigi --module mymodule MyDaily
Standard deployment artifact Standard artifact store
www.scling.com
Principle: Functional pipelines
28
● Raw source of truth + data refinement factory
● Immutable datasets & artifacts
● Deterministic, idempotent, reproducible deployment & processing
● Key success factor: workflow orchestration
○ Oozie, Rambo, Builder, Builder2, Luigi
○ Key properties:
1. Pure Python
2. Simplicity
3. All the features it lacks
www.scling.com
Big data - a collaboration paradigm
29
Stream storage
Data lake
Data
democratised
www.scling.com
● Technically
○ Data available
○ Reusable QA
● Operationally
○ Continuous deployment
○ Hands off operations
○ Monitoring, debugging
● Bottom-up innovation
Enabling teams
30
"The actual work that went into
Discover Weekly was very little,
because we're reusing things we
already had."
https://youtu.be/A259Yo8hBRs
https://youtu.be/ZcmJxli8WS8
www.scling.com
Principle: Small scope components
31
● Do one thing well. Less is more.
● Complex systems from replaceable bricks
○ Cloud/OSS over enterprise vendors
○ Simplicity over features
Solvable
challenge
~2000 lines of code
Perpetual
complexity
www.scling.com
Cloud native deployment
32
source
repo Luigi DSL, jars, config
my-pipe:7
Luigi
daemon
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule,
higher frequency
kind: CronJob
spec:
schedule: "10 * * * *"
command: "luigi --module mymodule MyDaily"
Docker image Docker registry
S3 / GCS
Dataproc /
EMR
www.scling.com
Data platform gravitation
● Hadoop all the things.
● Data is there. Simple test, simple deploy, simple ops.
● Autonomous teams - no mandate. Natural gravity.
33
www.scling.com
3434
Nearline
● Stream storage
● Asynchronous event
processing
● 10 ms - 1 hour
Data integration timescales
34
Job
Stream
Offline
● File storage
● Asynchronous batch
processing
● 1 minute -
Online
● SOA / microservices
● Synchronous RPC
● 1-100 ms
Stream
Job
Stream
www.scling.com
3535
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
35
www.scling.com
3636
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
36
Service failure
● User impact
● Data loss
● Cascading outage
www.scling.com
3737
Upgrade
● Careful rollout
● Risk of user impact
● Proactive QA
Operational manoeuvres - online
37
Service failure
● User impact
● Data loss
● Cascading outage
Bug
● User impact
● Data corruption
● Cascading corruption
www.scling.com
38
Operational manoeuvres - offline
38
Upgrade
● Instant rollout
● No user impact
● Reactive QA
Service failure
● Pipeline delay
● No data loss
● No downstream impact
Bug
● Temporary data
corruption
● Downstream impact
www.scling.com
Life of an error, batch pipelines
39
● Faulty job, emits bad data
1. Revert serving datasets to old
2. Fix bug
3. Remove faulty datasets
4. Backfill is automatic (Luigi)
Done!
● Low cost of error
○ Reactive QA
○ Production environment sufficient
www.scling.com
40
Production critical upgrade
● Dual datasets during transition
● Run downstream parallel pipelines
○ Cheap
○ Low risk
○ Easy rollback
● Testable end-to-end
No dev & staging environment needed!
∆?
www.scling.com
41
Operational manoeuvres - nearline
41
Upgrade
● Swift rollout
● Parallel pipelines
● User impact, QA?
Service failure
● Pipeline delay
● No data loss
● Downstream impact?
Bug
● Data corruption
● Downstream impact
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
Job
Stream
Stream
Job
Stream
www.scling.com
42
Life of an error, streaming
42
● Works for a single job, not pipeline. :-(
Job
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job Job
Reprocessing in Kafka Streams
www.scling.com
Data speed Innovation speed
43
Nearline
Data processing tradeoff
43
Job
Stream
OfflineOnline
Stream
Job
Stream
www.scling.com
44
Separating online & offline
● Daily user DB dump. Cassandra can handle the load.
○ Load spike became 25 h long…
● New recommendation model! Cassandra can replicate to all regions.
○ Who saturated the Atlantic link?
● Batch jobs saturate one resource.
○ Bad neighbours.
www.scling.com
Batch offline vs online
45
Raw
Fraud
serviceFraud
model
Orders Orders
Replication /
Backup
Standard procedures Standard proceduresLightweight procedures
● QA driven by internal efficiency
● Continuous deployment
● New pipeline < 1 day
● Upgrade < 1 hour
● Bug recovery < 1 hour
Careful handover Careful handover
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
46
www.scling.com
Testing single batch job
47
Job
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Runs well in
CI / from IDE
www.scling.com
Testing batch pipelines - two options
48
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job
Test job with sequence of jobs
3. Verify output
f() p()
A:
Customised workflow manager setup
p()f()
B:
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
49
www.scling.com
50
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
Hadoop / Spark counters DB
Standard graphing tools
Standard
alerting
service
www.scling.com
Measuring correctness: counters
● User-defined
● Technical from framework
○ Execution time
○ Memory consumption
○ Data volumes
○ ...
51
case class Order(item: ItemId, userId: UserId)
case class User(id: UserId, country: String)
val orders = read(orderPath)
val users = read(userPath)
val orderNoUserCounter = longAccumulator("order-no-user")
val joined: C[(Order, Option[User])] = orders
.groupBy(_.userId)
.leftJoin(users.groupBy(_.id))
.values
val orderWithUser: C[(Order, User)] = joined
.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope
www.scling.com
Data quality - high code vs low code
● 2013: Python MapReduce outdated
● Hive/SQL?
○ Not expressive enough
○ Data quality challenging
● Technical platform + multi-skilled teams!
○ Strong development processes
52
Low code / no code platform? Technical platform?
www.scling.com
53
Measuring consistency: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
○ System metrics
● Dedicated quality assessment pipelines
DB
Quality assessment job
Quality metadataset (tiny)
Standard graphing tools
Standard
alerting
service
www.scling.com
54
Machine learning operations, simplified
● Multiple trained models
○ Select at run time
● Measure user behaviour
○ E.g. session length, engagement, funnel
● Ready to revert to
○ old models
○ simpler models
Measure interactionsRendez-
vous
DB
Standard
alerting
service
Stream Job
"The required surrounding
infrastructure is vast and
complex."
- Google
www.scling.com
55
Not all things went well
● Autonomy → excessive heterogeneity
○ 25 ways to store a timestamp?
● Pipeline end-to-end tests
○ Culturally challenging
○ → difficult to change & retire pipelines
● Trial and error to learn
www.scling.com
Data engineering in Scandinavia
● Stockholm region ranks 2nd in unicorns / capita
○ Media, games, fintech
● Critical mass of world class data engineering
○ Limited to a few companies
56
www.scling.com
Mission: Spread data & AI superpowers
● There are companies to help
● Data & AI capabilities require culture & process change
○ Slow, very slow
57
www.scling.com
Scandinavian minimalist design
● Lean, simple technology - focus on flow and business value
● Bonnier News data platform, 4-5 persons:
○ Zero to happy customer in 3 weeks.
○ Dozens of ROI pipelines in 8 months.
● Scling retail client, 1-3 persons, after 1 year:
○ 40 sources, 70 pipelines, 200 egress points
○ 3,400 datasets / day
● Typical enterprise numbers
○ Big data project: 6-24 months
○ Analytics department: 100-1000 datasets / day
○ Spotify: 100,000+ datasets / day
○ Google: 1.6B datasets / day (2016)
58
www.scling.com
Scling - data-value-as-a-service
59
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Big Data Monitoring Cockpit
Big Data Monitoring CockpitBig Data Monitoring Cockpit
Big Data Monitoring Cockpit
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integration
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Testing the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big ProblemsTesting the Data Warehouse—Big Data, Big Problems
Testing the Data Warehouse—Big Data, Big Problems
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 

Ähnlich wie Data ops in practice - Swedish style

Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 

Ähnlich wie Data ops in practice - Swedish style (20)

Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014About VisualDNA Architecture @ Rubyslava 2014
About VisualDNA Architecture @ Rubyslava 2014
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
The of Operational Analytics Data Store
The of Operational Analytics Data StoreThe of Operational Analytics Data Store
The of Operational Analytics Data Store
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
OpenFlow @ Google
OpenFlow @ GoogleOpenFlow @ Google
OpenFlow @ Google
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
 

Mehr von Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 

Mehr von Lars Albertsson (10)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Data ops in practice - Swedish style

  • 1. www.scling.com DataOps in practice - Swedish style Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Who’s talking? ... Google - video conference, engineering productivity ... Spotify - data engineering ... Independent data engineering consultant Banks, media, startups, heavy industry, telco Founder @ Scling - data-value-as-a-service 2
  • 3. www.scling.com Contents Journey to DataOps Experiences that shaped my data engineering IMHO principles of successful DataOps Toolbox 3 ● Spotify information is old history ● Previously published ● Today is very different
  • 4. www.scling.com Spotify data 2007-2013 ● Hadoop installed 2007 ● Use cases: reporting, insights, recommendations ● Cultural aspects: ○ Autonomous teams ○ Eliminate waste ○ Learn and adapt 4
  • 5. www.scling.com Traditional systems 5 Mutation Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations
  • 6. www.scling.com Data lake Transformation Cold store 6 Mutation Immutable, shareable Early Hadoop: ● Weak indexing ● No transactions ● Weak security ● Batch transformations DataOps workflows: ● Immutable, shared data ● Resilient to failure ● Quick error recovery ● Low-risk experiments Data factories
  • 7. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 7
  • 8. www.scling.com Wrong conclusion, every day ● Downward trend every day! 8
  • 9. www.scling.com Normalise data collection to compare 9Graph by Adam Altmejd, @adamaltmejd
  • 10. www.scling.com Normalise data collection to compare 10Graph by Adam Altmejd, @adamaltmejd
  • 11. www.scling.com Forecast for analytics with fresh data 11Graph by Adam Altmejd, @adamaltmejd
  • 13. www.scling.com From craft to process 13 Multiple time windows
  • 14. www.scling.com From craft to process 14 Multiple time windows Assess ingress data quality
  • 15. www.scling.com From craft to process 15 Multiple time windows Assess ingress data quality Assess outcome data quality
  • 16. www.scling.com From craft to process 16 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  • 17. www.scling.com From craft to process 17 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  • 18. www.scling.com From craft to process 18 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  • 19. www.scling.com From craft to process 19 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  • 20. www.scling.com From craft to process 20 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 21. www.scling.com Towards sustainable production ML 21 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 22. www.scling.com Risky operations 22 How to I test the pipeline? You temporarily change the output path and run manually. Don’t do that. What if I forget to change path?
  • 23. www.scling.com 2013 23 ● Teams: Analytics computation (AC), data collection (DC), recommend, reporting (1) ● Folklore development cycle & operations ● Unsatisfied needs in other teams
  • 24. www.scling.com luigid Redundant "edge nodes" with Luigi workers, scheduled with cron. Compute + data in Hadoop. On-prem Hadoop production Worker 10 * * * * luigi --module mymodule MyDaily 23 * * * * luigi --module other OtherDaily Master Executor Worker HDFS metadata Data Control (+data) Submit job 10 * ... 23 * ...
  • 25. www.scling.com Ghost in the cluster ● Jobs were deployed with Debian packages + Puppet on pet machines. ○ Multiple pets for redundancy. Race to run job. ● "This monitor daemon is at 100%. Since 6 months. I'll kill it." ● "Data is wrong. But we fixed this bug 6 months ago?!?" 25
  • 26. www.scling.com Start of a DataOps journey 26 Stateful Stateless Pets Cattle Folklore Golden pathTest in prod Local test CI/CD Weeks to learn New pipeline < 1 day Days to mend Bug fix < 1 hour
  • 27. www.scling.com On-prem pipeline deployment pipeline 27 source repo Luigi DSL, jars, config my-pipe-7.tar.gz Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency All that a pipeline needs, installed atomically 10 * * * * luigi --module mymodule MyDaily Standard deployment artifact Standard artifact store
  • 28. www.scling.com Principle: Functional pipelines 28 ● Raw source of truth + data refinement factory ● Immutable datasets & artifacts ● Deterministic, idempotent, reproducible deployment & processing ● Key success factor: workflow orchestration ○ Oozie, Rambo, Builder, Builder2, Luigi ○ Key properties: 1. Pure Python 2. Simplicity 3. All the features it lacks
  • 29. www.scling.com Big data - a collaboration paradigm 29 Stream storage Data lake Data democratised
  • 30. www.scling.com ● Technically ○ Data available ○ Reusable QA ● Operationally ○ Continuous deployment ○ Hands off operations ○ Monitoring, debugging ● Bottom-up innovation Enabling teams 30 "The actual work that went into Discover Weekly was very little, because we're reusing things we already had." https://youtu.be/A259Yo8hBRs https://youtu.be/ZcmJxli8WS8
  • 31. www.scling.com Principle: Small scope components 31 ● Do one thing well. Less is more. ● Complex systems from replaceable bricks ○ Cloud/OSS over enterprise vendors ○ Simplicity over features Solvable challenge ~2000 lines of code Perpetual complexity
  • 32. www.scling.com Cloud native deployment 32 source repo Luigi DSL, jars, config my-pipe:7 Luigi daemon Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency kind: CronJob spec: schedule: "10 * * * *" command: "luigi --module mymodule MyDaily" Docker image Docker registry S3 / GCS Dataproc / EMR
  • 33. www.scling.com Data platform gravitation ● Hadoop all the things. ● Data is there. Simple test, simple deploy, simple ops. ● Autonomous teams - no mandate. Natural gravity. 33
  • 34. www.scling.com 3434 Nearline ● Stream storage ● Asynchronous event processing ● 10 ms - 1 hour Data integration timescales 34 Job Stream Offline ● File storage ● Asynchronous batch processing ● 1 minute - Online ● SOA / microservices ● Synchronous RPC ● 1-100 ms Stream Job Stream
  • 35. www.scling.com 3535 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 35
  • 36. www.scling.com 3636 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 36 Service failure ● User impact ● Data loss ● Cascading outage
  • 37. www.scling.com 3737 Upgrade ● Careful rollout ● Risk of user impact ● Proactive QA Operational manoeuvres - online 37 Service failure ● User impact ● Data loss ● Cascading outage Bug ● User impact ● Data corruption ● Cascading corruption
  • 38. www.scling.com 38 Operational manoeuvres - offline 38 Upgrade ● Instant rollout ● No user impact ● Reactive QA Service failure ● Pipeline delay ● No data loss ● No downstream impact Bug ● Temporary data corruption ● Downstream impact
  • 39. www.scling.com Life of an error, batch pipelines 39 ● Faulty job, emits bad data 1. Revert serving datasets to old 2. Fix bug 3. Remove faulty datasets 4. Backfill is automatic (Luigi) Done! ● Low cost of error ○ Reactive QA ○ Production environment sufficient
  • 40. www.scling.com 40 Production critical upgrade ● Dual datasets during transition ● Run downstream parallel pipelines ○ Cheap ○ Low risk ○ Easy rollback ● Testable end-to-end No dev & staging environment needed! ∆?
  • 41. www.scling.com 41 Operational manoeuvres - nearline 41 Upgrade ● Swift rollout ● Parallel pipelines ● User impact, QA? Service failure ● Pipeline delay ● No data loss ● Downstream impact? Bug ● Data corruption ● Downstream impact Job Stream Stream Job Stream Job Stream Stream Job Stream Job Stream Stream Job Stream
  • 42. www.scling.com 42 Life of an error, streaming 42 ● Works for a single job, not pipeline. :-( Job StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Job Reprocessing in Kafka Streams
  • 43. www.scling.com Data speed Innovation speed 43 Nearline Data processing tradeoff 43 Job Stream OfflineOnline Stream Job Stream
  • 44. www.scling.com 44 Separating online & offline ● Daily user DB dump. Cassandra can handle the load. ○ Load spike became 25 h long… ● New recommendation model! Cassandra can replicate to all regions. ○ Who saturated the Atlantic link? ● Batch jobs saturate one resource. ○ Bad neighbours.
  • 45. www.scling.com Batch offline vs online 45 Raw Fraud serviceFraud model Orders Orders Replication / Backup Standard procedures Standard proceduresLightweight procedures ● QA driven by internal efficiency ● Continuous deployment ● New pipeline < 1 day ● Upgrade < 1 hour ● Bug recovery < 1 hour Careful handover Careful handover
  • 46. www.scling.com Data quality dimensions ● Timeliness ○ E.g. the customer engagement report was produced at the expected time ● Correctness ○ The numbers in the reports were calculated correctly ● Completeness ○ The report includes information on all customers, using all information from the whole time period ● Consistency ○ The customer summaries are all based on the same time period 46
  • 47. www.scling.com Testing single batch job 47 Job Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Runs well in CI / from IDE
  • 48. www.scling.com Testing batch pipelines - two options 48 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job Test job with sequence of jobs 3. Verify output f() p() A: Customised workflow manager setup p()f() B:
  • 49. www.scling.com Monitoring timeliness, examples ● Datamon - Spotify internal ● Twitter Ambrose (dead?) ● Airflow 49
  • 50. www.scling.com 50 Measuring correctness: counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics Hadoop / Spark counters DB Standard graphing tools Standard alerting service
  • 51. www.scling.com Measuring correctness: counters ● User-defined ● Technical from framework ○ Execution time ○ Memory consumption ○ Data volumes ○ ... 51 case class Order(item: ItemId, userId: UserId) case class User(id: UserId, country: String) val orders = read(orderPath) val users = read(userPath) val orderNoUserCounter = longAccumulator("order-no-user") val joined: C[(Order, Option[User])] = orders .groupBy(_.userId) .leftJoin(users.groupBy(_.id)) .values val orderWithUser: C[(Order, User)] = joined .flatMap( orderUser match case (order, Some(user)) => Some((order, user)) case (order, None) => { orderNoUserCounter.add(1) None }) SQL: Nope
  • 52. www.scling.com Data quality - high code vs low code ● 2013: Python MapReduce outdated ● Hive/SQL? ○ Not expressive enough ○ Data quality challenging ● Technical platform + multi-skilled teams! ○ Strong development processes 52 Low code / no code platform? Technical platform?
  • 53. www.scling.com 53 Measuring consistency: pipelines ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ○ System metrics ● Dedicated quality assessment pipelines DB Quality assessment job Quality metadataset (tiny) Standard graphing tools Standard alerting service
  • 54. www.scling.com 54 Machine learning operations, simplified ● Multiple trained models ○ Select at run time ● Measure user behaviour ○ E.g. session length, engagement, funnel ● Ready to revert to ○ old models ○ simpler models Measure interactionsRendez- vous DB Standard alerting service Stream Job "The required surrounding infrastructure is vast and complex." - Google
  • 55. www.scling.com 55 Not all things went well ● Autonomy → excessive heterogeneity ○ 25 ways to store a timestamp? ● Pipeline end-to-end tests ○ Culturally challenging ○ → difficult to change & retire pipelines ● Trial and error to learn
  • 56. www.scling.com Data engineering in Scandinavia ● Stockholm region ranks 2nd in unicorns / capita ○ Media, games, fintech ● Critical mass of world class data engineering ○ Limited to a few companies 56
  • 57. www.scling.com Mission: Spread data & AI superpowers ● There are companies to help ● Data & AI capabilities require culture & process change ○ Slow, very slow 57
  • 58. www.scling.com Scandinavian minimalist design ● Lean, simple technology - focus on flow and business value ● Bonnier News data platform, 4-5 persons: ○ Zero to happy customer in 3 weeks. ○ Dozens of ROI pipelines in 8 months. ● Scling retail client, 1-3 persons, after 1 year: ○ 40 sources, 70 pipelines, 200 egress points ○ 3,400 datasets / day ● Typical enterprise numbers ○ Big data project: 6-24 months ○ Analytics department: 100-1000 datasets / day ○ Spotify: 100,000+ datasets / day ○ Google: 1.6B datasets / day (2016) 58
  • 59. www.scling.com Scling - data-value-as-a-service 59 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses