Data science pitfalls

Data Science Pitfalls
Pedro Tabacof, co-founder of Datart

Data science is easy, right?
1. Get data
2. Clean data
3. Extract features
4. Train model
5. Test model
6. Deploy
df = pd.read_csv(...)
df = df.fillna(0)
df["dayofweek"] = pd.get_dummies(...)
clf = sklearn.ensemble.RandomForestClassifier()
clf = clf.fit(X, y)
sklearn.metrics.roc_auc_score(y_test, clf.predict(X_test))
clf.predict(X_new)

Hundreds of frameworks
and APIs.

That is the easy part.
In this talk we will see the
real challenges.

Correlation X Causation
Correlation is easy: Predictive models.
Causation is hard: Physical models, experiments, or strong
assumptions.
Causation in practice is determined by randomized controlled
experiments ("A/B testing").

Ronald Fisher
Father of modern statistics.
Founder of population genetics.
Thought smoking did not cause cancer.

Abraham Wald
Statistician in the WWII
Experts thought they should reinforce
where the planes were most hit
Wald thought the opposite
Classical case of selection bias
James Lind
First randomized controlled trial in
1747.
Showed citrus fruits cured scurvy, a
deadly disease at the sea.
Lemon juice became a staple in the
British navy.

Robert Falcon Scott
Great British explorer and hero.
Disastrous expedition to the south
pole, in 1912.
Crew showed symptoms of scurvy.

Frequentist hypothesis testing
1. z-test
2. Student's t-test
3. ANOVA
4. Chi-squared test
5. F-test
6. Paired, pooled,
interactions, etc
z.test(x, y = NULL, …)
t.test(x, y, ...)
fit <- aov(y ~ A, data=df)
chisq.test(df)
var.test(x, ...)

Power lines cause leukemia.
25-year study, 800 ailments.

What is a p-value?
The probability of obtaining a result
equal to or more extreme than what
was actually observed, assuming the
null hypothesis is true.

Statistical significance does
not mean practical
significance.
Check the effect size.

What is a 95% confidence
interval?

What is a 95% confidence
interval?
A range generated by a
procedure that contains the
true value 95% of the time.

Biases and fallacies
Selection bias.
Confirmation bias.
Survivorship bias.
Winner's curse.
Hawthorne effect.
Shy Tory effect.
Availability heuristic.
Gambler's fallacy.
Regression to the mean.
Optimism bias.
Texas sharpshooter fallacy.
And many more.

Kidney transplant
Kidney givers: 1/3 risk of kidney failure
compared to the general population.
8x more risk when compared to the
correct group (healthy people).

Roosevelt X Landon
1936 USA presidential elections.
The Literary Digest poll: 10M
questionnaires, 2.3M returned, Landon
victory.
Gallup poll: Random sample, 50K
pollsters, Roosevelt victory.

Abraham Wald
Statistician doing research in WWII.
where the planes were most hit.
Wald thought the opposite.
Classical case of survivorship bias.

Baselines
1. Classification: Most frequent class.
2. Regression: Mean value.
3. Time-series: Last value.
4. Simple models: Linear regression, decision trees, k-NN, etc.
5. Standard models: CNNs for computer vision, LDA for topic models,
ARIMA for time series, LSTMs for speech, etc.
6. XGBoost.

Time series
Easy to be fooled by zoomed out graphs.
Naive baseline: Repeat last value.

Anomaly detection
99% regular points, 1% anomalies.
Trivial to get 99% accuracy.
What about the AUC?

Overfitting
Learning noise instead of the signal.
Test set hyperparameters tuning.
Leakage.
Testing methodology: Stratified, hierarchical, temporal, etc.

AUC:100%
Buyers -> Personal Information -> X.
Visitors -> No Personal Information -> NaN -> Mean(X).
Perfect discrimination between X and its mean value.

Prostate Cancer
Dataset says which patients had
prostate surgery.
Completely useless predictor in
practice, perfect in the competition.
Obvious leakage.

Temporal evaluation
Train Test
Present
Current issues (training data)
Future
New issues
Not only for time series.

Soft sensor wizard
Wizard for predictive model building.
Used by chemical engineers in the wild.
Next -> Back -> Next -> Back -> ....
“RNG optimization”.

Visualization and
interpretability.
See your data.
Challenge your models.

Anscombe’s quartet
Same mean (x and y).
Same standard deviation (x and y).
Same correlation.
Same regression line.

Simpson’s paradox
Treatment A Treatment B
Small
stones
Group 1
93% (81/87)
Group 2
87% (234/270)
Large
stones
Group 3
73% (192/263)
Group 4
69% (55/80)
Both 78% (273/350) 83% (289/350)

Pneumonia screening
Rich Caruana’s work at Microsoft.
State-of-the-art models and simple
baselines.
Linear model showed that patients
with asthma would be sent home.

Criminal recidivism prediction.
Same arrest (drug possession).
Left: One non-violent prior offense. High risk.
Right: One violent prior offense. Low risk.
One of them was arrested 3 times after, the other none.

Deployment
Covariate shift.
Technical debt.
Misuse of predictions.
Interaction with users.
Experimental validation.

Netflix prize
$1 million for best predictive model.
Winning solution: Ensemble of
hundreds of models.
What went into production: Not the
winning solution.

Calibration
Does 70% chance of positive means I am
right 70% of the time?

Online learning
Nobody used my online learning
algorithm for parameter tuning.
It worked in theory, in simulations,
everyone liked the idea.
But no one was comfortable with an
algorithm second guessing human
judgement.

Historical anecdotes.
Those who don't know
history are doomed to
repeat it.

Vulcan
Predicted planet based on Mercury’s
orbit and Newtonian physics.
Same methodology that led to the
discovery of Neptune.
Many actual “sightings” of Vulcan.

Ignaz Semmelweis
Obstetricians did not wash their hands.
Mortality rate 3x higher than midwives.
Showed washing hands greatly reduces
mortality.
Ignored by the establishment: “Doctor
hands are clean”.

Abraham Wald
George Stigler
Nobel prize in economics.
He and other U. of Chicago economists
found a great arbitrage opportunity:
The ton of wheat.
The British ton is not the same as the
American ton.

LTCM
Long-Term Capital Management.
Two Nobel prize winners in the board.
Sophisticated models, high leverage.
Lost $4.6 billion in four months.

Abraham Wald
Sally Clark
Two of her babies died of SIDS.
SIDS: 1 in 8500.
“1 in 72M chance of two SIDS”.
Found guilty of murder.

Data science is hard
Machine learning goes way beyond training predictive models.
Statistics goes way beyond p-values and frequentist hypothesis tests.
A data scientist must understand his data, models, assumptions, production
environment, objectives, and business.
What can be automated by framework, tools, and APIs is the easy part.
The hard part is delivering actual value.

Thank you!
Questions?
ptabacof@datart.com.br

Data science pitfalls

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Data science pitfalls

Ähnlich wie Data science pitfalls (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data science pitfalls