Social media are full of Covid-19 graphs, each pointing to an "obvious" conclusion that fits the author's agenda. Unfortunately, even the official sources publish analytics that point at incorrect conclusions. Bad data quality has become a matter of life and death.
We look at the quality problems with official Covid-19 data presentations. The problems are common in all domains, and solutions are known, but not widespread. We describe tools and patterns that data mature companies use to assess and improve data quality in similar situations. Mastering data quality and data operations is a prerequisite for building sustainable AI solutions, and we will explain how these patterns fit into machine learning product development.
2. www.scling.com
Why this presentation?
● Non-goal: Argue for or against a particular strategy
○ We are already too polarised
● Goals:
○ What can go wrong with data quality?
○ What can we learn?
○ Data engineering as a solution
2
6. www.scling.com
Imperial College model code
●
● Screenshots are only part of functions...
● A couple of regression tests - no tests validating correct functionality
● My impression: No chance of producing high confidence result
6
https://github.com/mrc-ide/covid-sim
9. www.scling.com
Bad predictions are harmful
9
● Each action has a health cost
○ Economic misery
→ social misery
→ health misery
○ Mental health
○ Drug / alcohol use
○ Domestic violence
● During Ebola pandemic,
10x more people died from fear
of hospitals than from Ebola
https://medium.com/@robert.munro/the-tech-communitys-response-to-ebola-44d2c8dbb5be
10. www.scling.com
Ways to degrade data & analytics quality
10
● Deviating definitions
● Selection
● Deviating context
● Presentation
● Interpretation
● Data collection
● Data processing
● Lack of quality assessment
● Lack of quality improvement
Add senior software
engineers with
production experience.
Data engineering
11. www.scling.com
Define death
11
Observed Covid-19 death definitions:
● Infection confirmed, last 30 days
● Infection confirmed, any time
● Infection assumed
● Assumed cause
● Hospitalised
● Other disease complicated by Covid-19
● Excess mortality
12. www.scling.com
Sweden on the rise?
12
https://youtu.be/4uTj96ZowCU
https://www.bbc.com/news/world-europe-53175459
https://sverigesradio.se/artikel/7503606
"New Covid-19 cases per day"
13. www.scling.com
No, context is missing
13
Tests executed
Test positive rate
New cases
https://youtu.be/4uTj96ZowCU https://twitter.com/JacobGudiol/status/1283308826842759168 https://twitter.com/JacobGudiol/status/1283308817787293696
14. www.scling.com
Death numbers, different views
14https://twitter.com/HaraldofW/status/1270080232104624128
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
15. www.scling.com
Data will confess to anything
15
● Absolute numbers mislead
○ Days since case x →
time shift by country size
● Relative numbers mislead
○ Diluted in large countries
○ Small regions stand out
https://swprs.org/a-swiss-doctor-on-covid-19/
16. www.scling.com
Granularity matters
16
● Outbreaks in regions
● Country aggregation - information loss
○ But debate assumes homogeneous countries
● Peak of Swedish outbreak
○ Major outbreak in Stockholm + surroundings
○ Rest of Sweden on par with Nordics
● Nothing is "obvious"
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-25-final.pdf
Swedish policy "obviously"
terrible. Compare numbers
with neighbours!
17. www.scling.com
Data collection
17
"The last week is not complete, so it is
difficult to determine if the trend continues."
https://youtu.be/4uTj96ZowCU
https://www.folkhalsomyndigheten.se/globalassets/statistik-uppfoljning/smittsamma-sjukdomar/veckorapporter-covid-19/2020/covid-19-veckorapport-vecka-27-final.pdf
19. www.scling.com
Comparing apples, oranges, bananas, ...
COVID-19 fatalities / day in Sweden
19
Fatalities collected during 2 day
Fatalities collected during 4 days
Fatalities collected during 10 days
20. www.scling.com
Naive data collection
● Gather the events that we have
● Put them in a database
● "Let us look at the latest data"
● You never want the latest data!
You want comparable data.
20
22. www.scling.com
Wrong conclusion, every day
● Downward trend every day!
22
https://www.bloomberg.com/amp/news/articles/2020-07-17/georgia-massaged-virus-data-to-reopen-then-voided-mask-orders
26. www.scling.com
Why aren't authorities doing that?
26
● Cost of processing data
● Manual handcraft
not
Industrial process
https://github.com/FohmAnalys/SEIR-model-Stockholm
We are not done
processing the data yet.
Since we do calculations
quickly, some mistakes
might happen.
27. www.scling.com
● Scaled processes
● Machine tools
● Challenges: scale,
logistics, legal,
organisation, faults, ...
Manual, mechanised, industrialised
27
● Muscle-powered
● Few tools
● Human touch for every
step
● Direct human control
● Machine tools
● Low investment, direct
return
28. www.scling.com
Muscle powered analytics & machine learning
● Use hand tools to
○ Collect data
○ Aggregate for analytics
or
○ Train a model
● Typical tools:
○ Excel
○ Matlab
○ Interactive SQL
○ Interactive BI tools
○ Jupyter
○ R
○ One-off Python scripts
28
"Dataset" - a data artifact of direct or indirect value
29. www.scling.com
Mechanised analytics & machine learning
● Use machine tools to semi-automatically
○ Collect data
○ Aggregate for analytics
or
○ Train a model
● Typical tools: Muscle tools +
○ Databases
○ Data warehouses + ETL
○ Hadoop, Spark, Flink
○ Java, Scala, Python, SQL
○ Kafka
○ Similar cloud services
29
Datasets, produced monthly / hourly / daily / ..
33. www.scling.com
From craft to process
33
Multiple time windows
Assess ingress data quality
Assess outcome data quality
34. www.scling.com
From craft to process
34
Multiple time windows
Assess ingress data quality
Assess outcome data quality
Repair broken data
Intermediate datasets, reusable between pipelines
35. www.scling.com
From craft to process
35
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Assess outcome data quality
36. www.scling.com
From craft to process
36
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history
Assess outcome data quality
37. www.scling.com
From craft to process
37
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
38. www.scling.com
From craft to process
38
Multiple time windows
Assess ingress data quality
Repair broken data from
complementary source
Forecast based on history,
multiple parameter settings
Assess outcome data quality
Assess forecast success,
adapt parameters
39. www.scling.com
Towards sustainable production ML
39
Multiple models,
parameters, features
Assess ingress data quality
Repair broken data from
complementary source
Choose model and parameters based
on performance and input data
Benchmark models
Try multiple models,
measure, A/B test
41. www.scling.com
● Resilient data factory
● Every dev team,
100-1000s datasets /
day per team
Costs down - ROI from data
41
● Hand-built
● Analyst team,
< 10 dataset / day
● Semi-automated
● "The data team",
10-100 datasets / day
Spotify ~2014,
20K datasets/day
42. www.scling.com
Becoming data industrialised
42
● Knowledge limited to leading tech companies + startups
● Change in processes & culture
○ C.f. agile, DevOps
○ Journey of many years
● Challenge is not technical
○ Can't buy a system or tool
○ Consultants can't help
43. www.scling.com
Scling - data-value-as-a-service
43
Data value through collaboration
Customer
Data factory
Data platform & lake
data
domain
expertise
Value from data!
www.scling.com/reading-list
www.scling.com/presentations
www.scling.com/courses