SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Downloaden Sie, um offline zu lesen
Data Science Pitfalls
Pedro Tabacof, co-founder of Datart
Data science is easy, right?
1. Get data
2. Clean data
3. Extract features
4. Train model
5. Test model
6. Deploy
df = pd.read_csv(...)
df = df.fillna(0)
df["dayofweek"] = pd.get_dummies(...)
clf = sklearn.ensemble.RandomForestClassifier()
clf = clf.fit(X, y)
sklearn.metrics.roc_auc_score(y_test, clf.predict(X_test))
clf.predict(X_new)
Hundreds of frameworks
and APIs.
That is the easy part.
In this talk we will see the
real challenges.
Correlation X Causation
Correlation is easy: Predictive models.
Causation is hard: Physical models, experiments, or strong
assumptions.
Causation in practice is determined by randomized controlled
experiments ("A/B testing").
Ronald Fisher
Father of modern statistics.
Founder of population genetics.
Thought smoking did not cause cancer.
Abraham Wald
Statistician in the WWII
Experts thought they should reinforce
where the planes were most hit
Wald thought the opposite
Classical case of selection bias
James Lind
First randomized controlled trial in
1747.
Showed citrus fruits cured scurvy, a
deadly disease at the sea.
Lemon juice became a staple in the
British navy.
Robert Falcon Scott
Great British explorer and hero.
Disastrous expedition to the south
pole, in 1912.
Crew showed symptoms of scurvy.
Frequentist hypothesis testing
1. z-test
2. Student's t-test
3. ANOVA
4. Chi-squared test
5. F-test
6. Paired, pooled,
interactions, etc
z.test(x, y = NULL, …)
t.test(x, y, ...)
fit <- aov(y ~ A, data=df)
chisq.test(df)
var.test(x, ...)
Power lines cause leukemia.
25-year study, 800 ailments.
What is a p-value?
What is a p-value?
The probability of obtaining a result
equal to or more extreme than what
was actually observed, assuming the
null hypothesis is true.
Statistical significance does
not mean practical
significance.
Check the effect size.
What is a 95% confidence
interval?
What is a 95% confidence
interval?
A range generated by a
procedure that contains the
true value 95% of the time.
Biases and fallacies
Selection bias.
Confirmation bias.
Survivorship bias.
Winner's curse.
Hawthorne effect.
Shy Tory effect.
Availability heuristic.
Gambler's fallacy.
Regression to the mean.
Optimism bias.
Texas sharpshooter fallacy.
And many more.
Kidney transplant
Kidney givers: 1/3 risk of kidney failure
compared to the general population.
8x more risk when compared to the
correct group (healthy people).
Roosevelt X Landon
1936 USA presidential elections.
The Literary Digest poll: 10M
questionnaires, 2.3M returned, Landon
victory.
Gallup poll: Random sample, 50K
pollsters, Roosevelt victory.
Abraham Wald
Statistician doing research in WWII.
Experts thought they should reinforce
where the planes were most hit.
Wald thought the opposite.
Classical case of survivorship bias.
Baselines
1. Classification: Most frequent class.
2. Regression: Mean value.
3. Time-series: Last value.
4. Simple models: Linear regression, decision trees, k-NN, etc.
5. Standard models: CNNs for computer vision, LDA for topic models,
ARIMA for time series, LSTMs for speech, etc.
6. XGBoost.
Time series
Easy to be fooled by zoomed out graphs.
Naive baseline: Repeat last value.
Anomaly detection
99% regular points, 1% anomalies.
Trivial to get 99% accuracy.
What about the AUC?
Overfitting
Learning noise instead of the signal.
Test set hyperparameters tuning.
Leakage.
Testing methodology: Stratified, hierarchical, temporal, etc.
Google Flu trends
AUC:100%
Buyers -> Personal Information -> X.
Visitors -> No Personal Information -> NaN -> Mean(X).
Perfect discrimination between X and its mean value.
Prostate Cancer
Dataset says which patients had
prostate surgery.
Completely useless predictor in
practice, perfect in the competition.
Obvious leakage.
Temporal evaluation
Train Test
Present
Current issues (training data)
Future
New issues
Not only for time series.
Soft sensor wizard
Wizard for predictive model building.
Used by chemical engineers in the wild.
Next -> Back -> Next -> Back -> ....
“RNG optimization”.
Visualization and
interpretability.
See your data.
Challenge your models.
Anscombe’s quartet
Same mean (x and y).
Same standard deviation (x and y).
Same correlation.
Same regression line.
Simpson’s paradox
Treatment A Treatment B
Small
stones
Group 1
93% (81/87)
Group 2
87% (234/270)
Large
stones
Group 3
73% (192/263)
Group 4
69% (55/80)
Both 78% (273/350) 83% (289/350)
Simpson’s paradox
LIME: Interpretability
Pneumonia screening
Rich Caruana’s work at Microsoft.
State-of-the-art models and simple
baselines.
Linear model showed that patients
with asthma would be sent home.
Criminal recidivism prediction.
Same arrest (drug possession).
Left: One non-violent prior offense. High risk.
Right: One violent prior offense. Low risk.
One of them was arrested 3 times after, the other none.
Deployment
Covariate shift.
Technical debt.
Misuse of predictions.
Interaction with users.
Experimental validation.
Netflix prize
$1 million for best predictive model.
Winning solution: Ensemble of
hundreds of models.
What went into production: Not the
winning solution.
Hidden technical debt
Calibration
Does 70% chance of positive means I am
right 70% of the time?
Online learning
Nobody used my online learning
algorithm for parameter tuning.
It worked in theory, in simulations,
everyone liked the idea.
But no one was comfortable with an
algorithm second guessing human
judgement.
Historical anecdotes.
Those who don't know
history are doomed to
repeat it.
Vulcan
Predicted planet based on Mercury’s
orbit and Newtonian physics.
Same methodology that led to the
discovery of Neptune.
Many actual “sightings” of Vulcan.
Ignaz Semmelweis
Obstetricians did not wash their hands.
Mortality rate 3x higher than midwives.
Showed washing hands greatly reduces
mortality.
Ignored by the establishment: “Doctor
hands are clean”.
Abraham Wald
Statistician in the WWII
Experts thought they should reinforce
where the planes were most hit
Wald thought the opposite
Classical case of selection bias
George Stigler
Nobel prize in economics.
He and other U. of Chicago economists
found a great arbitrage opportunity:
The ton of wheat.
The British ton is not the same as the
American ton.
LTCM
Long-Term Capital Management.
Two Nobel prize winners in the board.
Sophisticated models, high leverage.
Lost $4.6 billion in four months.
Abraham Wald
Statistician in the WWII
Experts thought they should reinforce
where the planes were most hit
Wald thought the opposite
Classical case of selection bias
Sally Clark
Two of her babies died of SIDS.
SIDS: 1 in 8500.
“1 in 72M chance of two SIDS”.
Found guilty of murder.
Data science is hard
Machine learning goes way beyond training predictive models.
Statistics goes way beyond p-values and frequentist hypothesis tests.
A data scientist must understand his data, models, assumptions, production
environment, objectives, and business.
What can be automated by framework, tools, and APIs is the easy part.
The hard part is delivering actual value.
Thank you!
Questions?
ptabacof@datart.com.br

Weitere ähnliche Inhalte

Ähnlich wie Data science pitfalls

poster_Baseline_20160518
poster_Baseline_20160518poster_Baseline_20160518
poster_Baseline_20160518
Dahbia Agher
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
​Iván Rodríguez
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Benjamin Good
 
This quiz consists of 20 questions most appear to be similar but now.docx
This quiz consists of 20 questions most appear to be similar but now.docxThis quiz consists of 20 questions most appear to be similar but now.docx
This quiz consists of 20 questions most appear to be similar but now.docx
amit657720
 
Explore, Analyze and Present your data
Explore, Analyze and Present your dataExplore, Analyze and Present your data
Explore, Analyze and Present your data
gcalmettes
 
Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...
Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...
Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...
breastcancerupdatecongress
 
1. The standard deviation of the diameter at breast height, or DBH.docx
1. The standard deviation of the diameter at breast height, or DBH.docx1. The standard deviation of the diameter at breast height, or DBH.docx
1. The standard deviation of the diameter at breast height, or DBH.docx
paynetawnya
 
2019 10 11 Originality ReportiginalityReportultraattem.docx
2019 10 11 Originality ReportiginalityReportultraattem.docx2019 10 11 Originality ReportiginalityReportultraattem.docx
2019 10 11 Originality ReportiginalityReportultraattem.docx
domenicacullison
 
Case control studies..skp
Case control studies..skpCase control studies..skp
Case control studies..skp
sudhiramkcg
 

Ähnlich wie Data science pitfalls (20)

poster_Baseline_20160518
poster_Baseline_20160518poster_Baseline_20160518
poster_Baseline_20160518
 
Machine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer DiagnosisMachine Learning - Breast Cancer Diagnosis
Machine Learning - Breast Cancer Diagnosis
 
Heuristics-biases.ppt
Heuristics-biases.pptHeuristics-biases.ppt
Heuristics-biases.ppt
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
 
Bayes Theorem & Share Trading on the ASX
Bayes Theorem & Share Trading on the ASXBayes Theorem & Share Trading on the ASX
Bayes Theorem & Share Trading on the ASX
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotationMark2Cure: a crowdsourcing platform for biomedical literature annotation
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
 
This quiz consists of 20 questions most appear to be similar but now.docx
This quiz consists of 20 questions most appear to be similar but now.docxThis quiz consists of 20 questions most appear to be similar but now.docx
This quiz consists of 20 questions most appear to be similar but now.docx
 
Michael Festing - The Principles of Experimental Design
Michael Festing - The Principles of Experimental DesignMichael Festing - The Principles of Experimental Design
Michael Festing - The Principles of Experimental Design
 
Animal Experiments and Alternatives
Animal Experiments and AlternativesAnimal Experiments and Alternatives
Animal Experiments and Alternatives
 
Reproducibility, argument and data in translational medicine
Reproducibility, argument and data in translational medicineReproducibility, argument and data in translational medicine
Reproducibility, argument and data in translational medicine
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk University
 
Explore, Analyze and Present your data
Explore, Analyze and Present your dataExplore, Analyze and Present your data
Explore, Analyze and Present your data
 
Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...
Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...
Alain Toledano : Small Breast Cancers Radiotherapy : Locoregional Treatments ...
 
1. The standard deviation of the diameter at breast height, or DBH.docx
1. The standard deviation of the diameter at breast height, or DBH.docx1. The standard deviation of the diameter at breast height, or DBH.docx
1. The standard deviation of the diameter at breast height, or DBH.docx
 
2019 10 11 Originality ReportiginalityReportultraattem.docx
2019 10 11 Originality ReportiginalityReportultraattem.docx2019 10 11 Originality ReportiginalityReportultraattem.docx
2019 10 11 Originality ReportiginalityReportultraattem.docx
 
Case control studies..skp
Case control studies..skpCase control studies..skp
Case control studies..skp
 
Confounding and Directed Acyclic Graphs
Confounding and Directed Acyclic GraphsConfounding and Directed Acyclic Graphs
Confounding and Directed Acyclic Graphs
 
Nikon Small World Competition 2017: Winners & Finalists
Nikon Small World Competition 2017: Winners & FinalistsNikon Small World Competition 2017: Winners & Finalists
Nikon Small World Competition 2017: Winners & Finalists
 
TestSurvRec manual
TestSurvRec manualTestSurvRec manual
TestSurvRec manual
 
08 entropie
08 entropie08 entropie
08 entropie
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Data science pitfalls

  • 1. Data Science Pitfalls Pedro Tabacof, co-founder of Datart
  • 2. Data science is easy, right? 1. Get data 2. Clean data 3. Extract features 4. Train model 5. Test model 6. Deploy df = pd.read_csv(...) df = df.fillna(0) df["dayofweek"] = pd.get_dummies(...) clf = sklearn.ensemble.RandomForestClassifier() clf = clf.fit(X, y) sklearn.metrics.roc_auc_score(y_test, clf.predict(X_test)) clf.predict(X_new)
  • 4. That is the easy part. In this talk we will see the real challenges.
  • 5. Correlation X Causation Correlation is easy: Predictive models. Causation is hard: Physical models, experiments, or strong assumptions. Causation in practice is determined by randomized controlled experiments ("A/B testing").
  • 6. Ronald Fisher Father of modern statistics. Founder of population genetics. Thought smoking did not cause cancer.
  • 7. Abraham Wald Statistician in the WWII Experts thought they should reinforce where the planes were most hit Wald thought the opposite Classical case of selection bias James Lind First randomized controlled trial in 1747. Showed citrus fruits cured scurvy, a deadly disease at the sea. Lemon juice became a staple in the British navy.
  • 8. Robert Falcon Scott Great British explorer and hero. Disastrous expedition to the south pole, in 1912. Crew showed symptoms of scurvy.
  • 9. Frequentist hypothesis testing 1. z-test 2. Student's t-test 3. ANOVA 4. Chi-squared test 5. F-test 6. Paired, pooled, interactions, etc z.test(x, y = NULL, …) t.test(x, y, ...) fit <- aov(y ~ A, data=df) chisq.test(df) var.test(x, ...)
  • 10. Power lines cause leukemia. 25-year study, 800 ailments.
  • 11.
  • 12. What is a p-value?
  • 13. What is a p-value? The probability of obtaining a result equal to or more extreme than what was actually observed, assuming the null hypothesis is true.
  • 14. Statistical significance does not mean practical significance. Check the effect size.
  • 15. What is a 95% confidence interval?
  • 16. What is a 95% confidence interval? A range generated by a procedure that contains the true value 95% of the time.
  • 17. Biases and fallacies Selection bias. Confirmation bias. Survivorship bias. Winner's curse. Hawthorne effect. Shy Tory effect. Availability heuristic. Gambler's fallacy. Regression to the mean. Optimism bias. Texas sharpshooter fallacy. And many more.
  • 18.
  • 19. Kidney transplant Kidney givers: 1/3 risk of kidney failure compared to the general population. 8x more risk when compared to the correct group (healthy people).
  • 20. Roosevelt X Landon 1936 USA presidential elections. The Literary Digest poll: 10M questionnaires, 2.3M returned, Landon victory. Gallup poll: Random sample, 50K pollsters, Roosevelt victory.
  • 21. Abraham Wald Statistician doing research in WWII. Experts thought they should reinforce where the planes were most hit. Wald thought the opposite. Classical case of survivorship bias.
  • 22. Baselines 1. Classification: Most frequent class. 2. Regression: Mean value. 3. Time-series: Last value. 4. Simple models: Linear regression, decision trees, k-NN, etc. 5. Standard models: CNNs for computer vision, LDA for topic models, ARIMA for time series, LSTMs for speech, etc. 6. XGBoost.
  • 23. Time series Easy to be fooled by zoomed out graphs. Naive baseline: Repeat last value.
  • 24. Anomaly detection 99% regular points, 1% anomalies. Trivial to get 99% accuracy. What about the AUC?
  • 25. Overfitting Learning noise instead of the signal. Test set hyperparameters tuning. Leakage. Testing methodology: Stratified, hierarchical, temporal, etc.
  • 27. AUC:100% Buyers -> Personal Information -> X. Visitors -> No Personal Information -> NaN -> Mean(X). Perfect discrimination between X and its mean value.
  • 28. Prostate Cancer Dataset says which patients had prostate surgery. Completely useless predictor in practice, perfect in the competition. Obvious leakage.
  • 29. Temporal evaluation Train Test Present Current issues (training data) Future New issues Not only for time series.
  • 30. Soft sensor wizard Wizard for predictive model building. Used by chemical engineers in the wild. Next -> Back -> Next -> Back -> .... “RNG optimization”.
  • 31. Visualization and interpretability. See your data. Challenge your models.
  • 32. Anscombe’s quartet Same mean (x and y). Same standard deviation (x and y). Same correlation. Same regression line.
  • 33. Simpson’s paradox Treatment A Treatment B Small stones Group 1 93% (81/87) Group 2 87% (234/270) Large stones Group 3 73% (192/263) Group 4 69% (55/80) Both 78% (273/350) 83% (289/350)
  • 36. Pneumonia screening Rich Caruana’s work at Microsoft. State-of-the-art models and simple baselines. Linear model showed that patients with asthma would be sent home.
  • 37. Criminal recidivism prediction. Same arrest (drug possession). Left: One non-violent prior offense. High risk. Right: One violent prior offense. Low risk. One of them was arrested 3 times after, the other none.
  • 38. Deployment Covariate shift. Technical debt. Misuse of predictions. Interaction with users. Experimental validation.
  • 39. Netflix prize $1 million for best predictive model. Winning solution: Ensemble of hundreds of models. What went into production: Not the winning solution.
  • 41. Calibration Does 70% chance of positive means I am right 70% of the time?
  • 42. Online learning Nobody used my online learning algorithm for parameter tuning. It worked in theory, in simulations, everyone liked the idea. But no one was comfortable with an algorithm second guessing human judgement.
  • 43. Historical anecdotes. Those who don't know history are doomed to repeat it.
  • 44. Vulcan Predicted planet based on Mercury’s orbit and Newtonian physics. Same methodology that led to the discovery of Neptune. Many actual “sightings” of Vulcan.
  • 45. Ignaz Semmelweis Obstetricians did not wash their hands. Mortality rate 3x higher than midwives. Showed washing hands greatly reduces mortality. Ignored by the establishment: “Doctor hands are clean”.
  • 46. Abraham Wald Statistician in the WWII Experts thought they should reinforce where the planes were most hit Wald thought the opposite Classical case of selection bias George Stigler Nobel prize in economics. He and other U. of Chicago economists found a great arbitrage opportunity: The ton of wheat. The British ton is not the same as the American ton.
  • 47. LTCM Long-Term Capital Management. Two Nobel prize winners in the board. Sophisticated models, high leverage. Lost $4.6 billion in four months.
  • 48. Abraham Wald Statistician in the WWII Experts thought they should reinforce where the planes were most hit Wald thought the opposite Classical case of selection bias Sally Clark Two of her babies died of SIDS. SIDS: 1 in 8500. “1 in 72M chance of two SIDS”. Found guilty of murder.
  • 49. Data science is hard Machine learning goes way beyond training predictive models. Statistics goes way beyond p-values and frequentist hypothesis tests. A data scientist must understand his data, models, assumptions, production environment, objectives, and business. What can be automated by framework, tools, and APIs is the easy part. The hard part is delivering actual value.