SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
www.scling.com
Mortal Analytics - Covid 19 &
the problem of data quality
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Why this presentation?
● Non-goal: Argue for or against a particular strategy
○ We are already too polarised
● Goals:
○ What can go wrong with data quality?
○ What can we learn?
○ Data engineering as a solution
2
www.scling.com
Imperial College: We saved the world!
3
https://www.bbc.com/news/health-52968523
www.scling.com
Imperial College model predictions for Sweden
4
https://www.medrxiv.org/content/10.1101/2020.04.11.20062133v1.full.pdf
www.scling.com
Model and reality
5
https://swprs.org/a-swiss-doctor-on-covid-19/
www.scling.com
Imperial College model code
●
● Screenshots are only part of functions...
● A couple of regression tests - no tests validating correct functionality
● My impression: No chance of producing high confidence result
6
https://github.com/mrc-ide/covid-sim
www.scling.com
Imperial College: bugs are not a problem
7
https://lockdownsceptics.org/code-review-of-fergusons-model/
www.scling.com
Example Imperial College bug handling
8
https://github.com/mrc-ide/covid-sim/issues/330
Imperial College response
www.scling.com
Bad predictions are harmful
9
● Each action has a health cost
○ Economic misery
→ social misery
→ health misery
○ Mental health
○ Drug / alcohol use
○ Domestic violence
● During Ebola pandemic,
10x more people died from fear
of hospitals than from Ebola
https://medium.com/@robert.munro/the-tech-communitys-response-to-ebola-44d2c8dbb5be
www.scling.com
Ways to degrade data & analytics quality
10
● Deviating definitions
● Selection
● Deviating context
● Presentation
● Interpretation
● Data collection
● Data processing
● Lack of quality assessment
● Lack of quality improvement
Add senior software
engineers with
production experience.
Data engineering
www.scling.com
Define death
11
Observed Covid-19 death definitions:
● Infection confirmed, last 30 days
● Infection confirmed, any time
● Infection assumed
● Assumed cause
● Hospitalised
● Other disease complicated by Covid-19
● Excess mortality
www.scling.com
Sweden on the rise?
12
https://youtu.be/4uTj96ZowCU
https://www.bbc.com/news/world-europe-53175459
https://sverigesradio.se/artikel/7503606
"New Covid-19 cases per day"
www.scling.com
No, context is missing
13
Tests executed
Test positive rate
New cases
https://youtu.be/4uTj96ZowCU https://twitter.com/JacobGudiol/status/1283308826842759168 https://twitter.com/JacobGudiol/status/1283308817787293696
www.scling.com
Death numbers, different views
14https://twitter.com/HaraldofW/status/1270080232104624128
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
www.scling.com
Data will confess to anything
15
● Absolute numbers mislead
○ Days since case x →
time shift by country size
● Relative numbers mislead
○ Diluted in large countries
○ Small regions stand out
https://swprs.org/a-swiss-doctor-on-covid-19/
www.scling.com
Granularity matters
16
● Outbreaks in regions
● Country aggregation - information loss
○ But debate assumes homogeneous countries
● Peak of Swedish outbreak
○ Major outbreak in Stockholm + surroundings
○ Rest of Sweden on par with Nordics
● Nothing is "obvious"
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
Swedish policy "obviously"
terrible. Compare numbers
with neighbours!
www.scling.com
Data collection
17
"The last week is not complete, so it is
difficult to determine if the trend continues."
https://youtu.be/4uTj96ZowCU
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-27-final.pdf
www.scling.com
What conclusion from this graph?
COVID-19 fatalities / day in Sweden
18
https://www.folkhalsomyndigheten.se/
www.scling.com
Comparing apples, oranges, bananas, ...
COVID-19 fatalities / day in Sweden
19
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
www.scling.com
Naive data collection
● Gather the events that we have
● Put them in a database
● "Let us look at the latest data"
● You never want the latest data!
You want comparable data.
20
www.scling.com
Wrong conclusion, every day
● Fatalities data as of
April 6
April 15
April 19
21Graph by Statistisk Opinion, @StatistiskO
www.scling.com
Wrong conclusion, every day
● Downward trend every day!
22
https://www.bloomberg.com/amp/news/articles/2020-07-17/georgia-massaged-virus-data-to-reopen-then-voided-mask-orders
www.scling.com
Normalise data collection to compare
23Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Normalise data collection to compare
24Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Forecast for analytics with fresh data
25Graph by Adam Altmejd, @adamaltmejd
www.scling.com
Why aren't authorities doing that?
26
● Cost of processing data
● Manual handcraft
not
Industrial process
https://github.com/FohmAnalys/SEIR-model-Stockholm
We are not done
processing the data yet.
Since we do calculations
quickly, some mistakes
might happen.
www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
27
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
www.scling.com
Muscle powered analytics & machine learning
● Use hand tools to
○ Collect data
○ Aggregate for analytics
or
○ Train a model
● Typical tools:
○ Excel
○ Matlab
○ Interactive SQL
○ Interactive BI tools
○ Jupyter
○ R
○ One-off Python scripts
28
"Dataset" - a data artifact of direct or indirect value
www.scling.com
Mechanised analytics & machine learning
● Use machine tools to semi-automatically
○ Collect data
○ Aggregate for analytics
or
○ Train a model
● Typical tools: Muscle tools +
○ Databases
○ Data warehouses + ETL
○ Hadoop, Spark, Flink
○ Java, Scala, Python, SQL
○ Kafka
○ Similar cloud services
29
Datasets, produced monthly / hourly / daily / ..
www.scling.com
From craft to process
30
www.scling.com
From craft to process
31
Multiple time windows
www.scling.com
From craft to process
32
Multiple time windows
Assess ingress data quality
www.scling.com
From craft to process
33
Multiple time windows
Assess ingress data quality
Assess outcome data quality
www.scling.com
From craft to process
34
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
www.scling.com
From craft to process
35
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
www.scling.com
From craft to process
36
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
www.scling.com
From craft to process
37
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
www.scling.com
From craft to process
38
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
www.scling.com
Towards sustainable production ML
39
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
www.scling.com
Industrialised analytics / machine learning
● Build resilient, automated processes that
○ Collect & process
○ Assess & improve quality
○ Create multiple artifacts, measure, adapt
● Typical tools: Mechanised tools +
○ Data lake
○ Workflow orchestration (Luigi, Airflow)
○ Quality assessment, monitoring
○ Testing, CI/CD
40
www.scling.com
● Resilient data factory
● Every dev team,
100-1000s datasets /
day per team
Costs down - ROI from data
41
● Hand-built
● Analyst team,
< 10 dataset / day
● Semi-automated
● "The data team",
10-100 datasets / day
Spotify ~2014,
20K datasets/day
www.scling.com
Becoming data industrialised
42
● Knowledge limited to leading tech companies + startups
● Change in processes & culture
○ C.f. agile, DevOps
○ Journey of many years
● Challenge is not technical
○ Can't buy a system or tool
○ Consultants can't help
www.scling.com
Scling - data-value-as-a-service
43
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses

Weitere ähnliche Inhalte

Was ist angesagt?

Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practiceLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Institute e-Austria Timisoara
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureAndreas Schreiber
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache HadoopInfoFarm
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
 
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationMichael Rainey
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integrationgluent.
 

Was ist angesagt? (20)

Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Protecting privacy in practice
Protecting privacy in practiceProtecting privacy in practice
Protecting privacy in practice
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
Monitoring in Big Data Frameworks @ Big Data Meetup, Timisoara, 2015
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data IntegrationA Walk Through the Kimball ETL Subsystems with Oracle Data Integration
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
 
Offload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data IntegrationOffload, Transform, and Present - The New World of Data Integration
Offload, Transform, and Present - The New World of Data Integration
 

Ähnlich wie Mortal analytics - Covid-19 and the problem of data quality

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2Steen Rasmussen
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Chris Hammerschmidt
 
Investing in ai driven startups
Investing in ai driven startupsInvesting in ai driven startups
Investing in ai driven startupsRoy Lowrance
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Mikhail Rozhkov
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceDatabricks
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning SystemsThoughtworks
 
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA SystemsUsing OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA SystemsInfluxData
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 

Ähnlich wie Mortal analytics - Covid-19 and the problem of data quality (20)

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2The end of analytics as we know it   gauc 2020 - iih nordic - steen rasmussen v2
The end of analytics as we know it gauc 2020 - iih nordic - steen rasmussen v2
 
Data science guide
Data science guideData science guide
Data science guide
 
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
Machine Learning for (DF)IR with Velociraptor: From Setting Expectations to a...
 
Investing in ai driven startups
Investing in ai driven startupsInvesting in ai driven startups
Investing in ai driven startups
 
Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning Reproducibility and experiments management in Machine Learning
Reproducibility and experiments management in Machine Learning
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Introduction to Six Sigma
Introduction to Six SigmaIntroduction to Six Sigma
Introduction to Six Sigma
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
 
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA SystemsUsing OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
Using OPC-UA to Extract IIoT Time Series Data from PLC and SCADA Systems
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 

Mehr von Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 

Mehr von Lars Albertsson (11)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 

Kürzlich hochgeladen

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Mortal analytics - Covid-19 and the problem of data quality

  • 1. www.scling.com Mortal Analytics - Covid 19 & the problem of data quality Lars Albertsson (@lalleal) Scling 1
  • 2. www.scling.com Why this presentation? ● Non-goal: Argue for or against a particular strategy ○ We are already too polarised ● Goals: ○ What can go wrong with data quality? ○ What can we learn? ○ Data engineering as a solution 2
  • 3. www.scling.com Imperial College: We saved the world! 3 https://www.bbc.com/news/health-52968523
  • 4. www.scling.com Imperial College model predictions for Sweden 4 https://www.medrxiv.org/content/10.1101/2020.04.11.20062133v1.full.pdf
  • 6. www.scling.com Imperial College model code ● ● Screenshots are only part of functions... ● A couple of regression tests - no tests validating correct functionality ● My impression: No chance of producing high confidence result 6 https://github.com/mrc-ide/covid-sim
  • 7. www.scling.com Imperial College: bugs are not a problem 7 https://lockdownsceptics.org/code-review-of-fergusons-model/
  • 8. www.scling.com Example Imperial College bug handling 8 https://github.com/mrc-ide/covid-sim/issues/330 Imperial College response
  • 9. www.scling.com Bad predictions are harmful 9 ● Each action has a health cost ○ Economic misery → social misery → health misery ○ Mental health ○ Drug / alcohol use ○ Domestic violence ● During Ebola pandemic, 10x more people died from fear of hospitals than from Ebola https://medium.com/@robert.munro/the-tech-communitys-response-to-ebola-44d2c8dbb5be
  • 10. www.scling.com Ways to degrade data & analytics quality 10 ● Deviating definitions ● Selection ● Deviating context ● Presentation ● Interpretation ● Data collection ● Data processing ● Lack of quality assessment ● Lack of quality improvement Add senior software engineers with production experience. Data engineering
  • 11. www.scling.com Define death 11 Observed Covid-19 death definitions: ● Infection confirmed, last 30 days ● Infection confirmed, any time ● Infection assumed ● Assumed cause ● Hospitalised ● Other disease complicated by Covid-19 ● Excess mortality
  • 12. www.scling.com Sweden on the rise? 12 https://youtu.be/4uTj96ZowCU https://www.bbc.com/news/world-europe-53175459 https://sverigesradio.se/artikel/7503606 "New Covid-19 cases per day"
  • 13. www.scling.com No, context is missing 13 Tests executed Test positive rate New cases https://youtu.be/4uTj96ZowCU https://twitter.com/JacobGudiol/status/1283308826842759168 https://twitter.com/JacobGudiol/status/1283308817787293696
  • 14. www.scling.com Death numbers, different views 14https://twitter.com/HaraldofW/status/1270080232104624128 https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
  • 15. www.scling.com Data will confess to anything 15 ● Absolute numbers mislead ○ Days since case x → time shift by country size ● Relative numbers mislead ○ Diluted in large countries ○ Small regions stand out https://swprs.org/a-swiss-doctor-on-covid-19/
  • 16. www.scling.com Granularity matters 16 ● Outbreaks in regions ● Country aggregation - information loss ○ But debate assumes homogeneous countries ● Peak of Swedish outbreak ○ Major outbreak in Stockholm + surroundings ○ Rest of Sweden on par with Nordics ● Nothing is "obvious" https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf Swedish policy "obviously" terrible. Compare numbers with neighbours!
  • 17. www.scling.com Data collection 17 "The last week is not complete, so it is difficult to determine if the trend continues." https://youtu.be/4uTj96ZowCU https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-27-final.pdf
  • 18. www.scling.com What conclusion from this graph? COVID-19 fatalities / day in Sweden 18 https://www.folkhalsomyndigheten.se/
  • 19. www.scling.com Comparing apples, oranges, bananas, ... COVID-19 fatalities / day in Sweden 19 Fatalities collected during 2 day Fatalities collected during 4 days Fatalities collected during 10 days
  • 20. www.scling.com Naive data collection ● Gather the events that we have ● Put them in a database ● "Let us look at the latest data" ● You never want the latest data! You want comparable data. 20
  • 21. www.scling.com Wrong conclusion, every day ● Fatalities data as of April 6 April 15 April 19 21Graph by Statistisk Opinion, @StatistiskO
  • 22. www.scling.com Wrong conclusion, every day ● Downward trend every day! 22 https://www.bloomberg.com/amp/news/articles/2020-07-17/georgia-massaged-virus-data-to-reopen-then-voided-mask-orders
  • 23. www.scling.com Normalise data collection to compare 23Graph by Adam Altmejd, @adamaltmejd
  • 24. www.scling.com Normalise data collection to compare 24Graph by Adam Altmejd, @adamaltmejd
  • 25. www.scling.com Forecast for analytics with fresh data 25Graph by Adam Altmejd, @adamaltmejd
  • 26. www.scling.com Why aren't authorities doing that? 26 ● Cost of processing data ● Manual handcraft not Industrial process https://github.com/FohmAnalys/SEIR-model-Stockholm We are not done processing the data yet. Since we do calculations quickly, some mistakes might happen.
  • 27. www.scling.com ● Scaled processes ● Machine tools ● Challenges: scale, logistics, legal, organisation, faults, ... Manual, mechanised, industrialised 27 ● Muscle-powered ● Few tools ● Human touch for every step ● Direct human control ● Machine tools ● Low investment, direct return
  • 28. www.scling.com Muscle powered analytics & machine learning ● Use hand tools to ○ Collect data ○ Aggregate for analytics or ○ Train a model ● Typical tools: ○ Excel ○ Matlab ○ Interactive SQL ○ Interactive BI tools ○ Jupyter ○ R ○ One-off Python scripts 28 "Dataset" - a data artifact of direct or indirect value
  • 29. www.scling.com Mechanised analytics & machine learning ● Use machine tools to semi-automatically ○ Collect data ○ Aggregate for analytics or ○ Train a model ● Typical tools: Muscle tools + ○ Databases ○ Data warehouses + ETL ○ Hadoop, Spark, Flink ○ Java, Scala, Python, SQL ○ Kafka ○ Similar cloud services 29 Datasets, produced monthly / hourly / daily / ..
  • 31. www.scling.com From craft to process 31 Multiple time windows
  • 32. www.scling.com From craft to process 32 Multiple time windows Assess ingress data quality
  • 33. www.scling.com From craft to process 33 Multiple time windows Assess ingress data quality Assess outcome data quality
  • 34. www.scling.com From craft to process 34 Multiple time windows Assess ingress data quality Assess outcome data quality Repair broken data Intermediate datasets, reusable between pipelines
  • 35. www.scling.com From craft to process 35 Multiple time windows Assess ingress data quality Repair broken data from complementary source Assess outcome data quality
  • 36. www.scling.com From craft to process 36 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history Assess outcome data quality
  • 37. www.scling.com From craft to process 37 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality
  • 38. www.scling.com From craft to process 38 Multiple time windows Assess ingress data quality Repair broken data from complementary source Forecast based on history, multiple parameter settings Assess outcome data quality Assess forecast success, adapt parameters
  • 39. www.scling.com Towards sustainable production ML 39 Multiple models, parameters, features Assess ingress data quality Repair broken data from complementary source Choose model and parameters based on performance and input data Benchmark models Try multiple models, measure, A/B test
  • 40. www.scling.com Industrialised analytics / machine learning ● Build resilient, automated processes that ○ Collect & process ○ Assess & improve quality ○ Create multiple artifacts, measure, adapt ● Typical tools: Mechanised tools + ○ Data lake ○ Workflow orchestration (Luigi, Airflow) ○ Quality assessment, monitoring ○ Testing, CI/CD 40
  • 41. www.scling.com ● Resilient data factory ● Every dev team, 100-1000s datasets / day per team Costs down - ROI from data 41 ● Hand-built ● Analyst team, < 10 dataset / day ● Semi-automated ● "The data team", 10-100 datasets / day Spotify ~2014, 20K datasets/day
  • 42. www.scling.com Becoming data industrialised 42 ● Knowledge limited to leading tech companies + startups ● Change in processes & culture ○ C.f. agile, DevOps ○ Journey of many years ● Challenge is not technical ○ Can't buy a system or tool ○ Consultants can't help
  • 43. www.scling.com Scling - data-value-as-a-service 43 Data value through collaboration Customer Data factory Data platform & lake data domain expertise Value from data! www.scling.com/reading-list www.scling.com/presentations www.scling.com/courses