SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Data science workflow
Andrew Gelman
Dept of Statistics and Dept of Political Science
Columbia University, New York
PyData, New York, 28 Nov 2017
The (abridged) model in Stan
parameters {
real b;
real<lower=0> sigma_a;
real<lower=0> sigma_y;
vector[nteams] a;
}
model {
a ~ normal(b*prior_score, sigma_a)
sqrt_dif ~ normal(a[team1] - a[team2], sigma_y);
}
Fit the model
Inference for Stan model: worldcup_first_try.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00
sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01
sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00
a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00
a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00
a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00
a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00
a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00
. . .
Graph the estimates
Compare to model fit without prior rankings
Compare model to predictions
After finding and fixing a bug
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Data on putts in pro golf
Distance from hole (feet)
Probabilityofsuccess
1346/1443
577/694
337/455
208/353
149/272
136/256
111/240
69/217
67/200
75/237
52/202
46/192
54/174
28/167
27/201
31/195
33/191
20/147
24/152
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry-based model
x
R
r
−2σ 0 2σ
Stan code
data {
int J;
int n[J];
real x[J];
int y[J];
real r;
real R;
}
parameters {
real<lower=0> sigma;
}
model {
real p[J];
p = 2*Phi(asin((R-r)/x) / sigma) - 1;
y ~ binomial(n, p);
}
Fit the model
golf <- read.table("golf.txt", header=TRUE, skip=2)
x <- golf$x
y <- golf$y
n <- golf$n
J <- length(y)
r <- (1.68/2)/12
R <- (4.25/2)/12
fit1 <- stan("golf1.stan")
Check convergence
> print(fit1)
Inference for Stan model: golf1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 25% 50% 75% n_eff Rhat
sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1
sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
What's the probability of making a golf putt?
Distance from hole (feet)
Probabilityofsuccess
Geometry−based model,
sigma = 1.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 5 10 15 20
0.00.20.40.60.81.0
Two models fit to the golf putting data
Distance from hole (feet)
Probabilityofsuccess
Logistic regression,
a = 2.2, b = −0.3
Geometry−based model,
sigma = 1.5
Birthdays!
The published graphs show data from 30 days in the year
1970 1972 1974 1976 1978 1980 1982 1984 1986 1988
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
1972
1976
1980
1984
1988
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap dayApril 1st Memorial day
Independence day
Labor day
Halloween
Thanksgiving
Christmas
Mon Tue Wed Thu Fri Sat Sun
Dayofweekeffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Seasonaleffect
60
80
100
120
2002
2006
2010
2014
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
60
80
100
120
New year
Valentine's day
Leap day
April 1st Memorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
2000 2002 2004 2006 2008 2010 2012 2014
Trends
60
80
100
120
Relative Number of Births
Slow trend
Fast non-periodic component
Mean
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dayofyeareffect
50
60
70
80
90
100
110
120
New year
Valentine's day
Leap day
April 1stMemorial day
Independence day
Labor day
9/11
Halloween
Thanksgiving
Christmas
13th day of month
Xbox estimates, adjusting for demographics
Xbox estimates, adjusting for demographics and
partisanship
Data from 2016
Some ideas in data science workflow
Data and information
Replication
Fake-data simulation (or statistical theory)
Comparing predictions to data
The network of models

Weitere ähnliche Inhalte

Was ist angesagt?

STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIALSTRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
Rohit Katarya
 
Teoría y problemas de Sumas Notables II sn26 ccesa007
Teoría y problemas de Sumas Notables II  sn26 ccesa007Teoría y problemas de Sumas Notables II  sn26 ccesa007
Teoría y problemas de Sumas Notables II sn26 ccesa007
Demetrio Ccesa Rayme
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
Roziq Bahtiar
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Hsien-Hsin Sean Lee, Ph.D.
 
คู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprogramsคู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprograms
Therdkeat Khuonhat
 

Was ist angesagt? (19)

The final
The finalThe final
The final
 
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIALSTRESS ANALYSIS OF AN ISOTROPIC MATERIAL
STRESS ANALYSIS OF AN ISOTROPIC MATERIAL
 
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
Deep Learning A-Z™: Artificial Neural Networks (ANN) - How do Neural Networks...
 
Teoría y problemas de Sumas Notables II sn26 ccesa007
Teoría y problemas de Sumas Notables II  sn26 ccesa007Teoría y problemas de Sumas Notables II  sn26 ccesa007
Teoría y problemas de Sumas Notables II sn26 ccesa007
 
Copier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographieCopier correction du devoir_de_synthèse_de_topographie
Copier correction du devoir_de_synthèse_de_topographie
 
Math unit21 formulae
Math unit21 formulaeMath unit21 formulae
Math unit21 formulae
 
Cilindro
CilindroCilindro
Cilindro
 
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
A game theoretic approach for runtime capacity allocation in map-reduce (WACC...
 
Javier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretasJavier dominguez 20800945 actividad 1_estructuras discretas
Javier dominguez 20800945 actividad 1_estructuras discretas
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
 
Boolean difference examples
Boolean difference examplesBoolean difference examples
Boolean difference examples
 
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
Supersymmetric Q-balls and boson stars in (d + 1) dimensions - Mexico city ta...
 
HMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude ControlHMPC for Upper Stage Attitude Control
HMPC for Upper Stage Attitude Control
 
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...
 
Trigo functions
Trigo functionsTrigo functions
Trigo functions
 
คู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprogramsคู่มือการใช้ Casiofx5800 p surveyingprograms
คู่มือการใช้ Casiofx5800 p surveyingprograms
 
Problem Application of Antiderivatives
Problem Application of AntiderivativesProblem Application of Antiderivatives
Problem Application of Antiderivatives
 
Fast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimizationFast parallelizable scenario-based stochastic optimization
Fast parallelizable scenario-based stochastic optimization
 
Free FE practice problems
Free FE practice problemsFree FE practice problems
Free FE practice problems
 

Ähnlich wie Data Science Workflow

Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
Pranamesh Chakraborty
 

Ähnlich wie Data Science Workflow (20)

jacobi method, gauss siedel for solving linear equations
jacobi method, gauss siedel for solving linear equationsjacobi method, gauss siedel for solving linear equations
jacobi method, gauss siedel for solving linear equations
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts model
 
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
MUMS Undergraduate Workshop - Parameter Selection and Model Calibration for a...
 
Laporan pemodelan dan simulasi
Laporan pemodelan dan simulasiLaporan pemodelan dan simulasi
Laporan pemodelan dan simulasi
 
Comparison GUM versus GUM+1
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1
 
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudskoCHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
CHAPTER 7.pdfdjdjdjdjdjdjdjsjsjddhhdudsko
 
Compression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplotsCompression of “noisy” measurement data for plotting with TikZ and pgfplots
Compression of “noisy” measurement data for plotting with TikZ and pgfplots
 
Numerical Methods Solving Linear Equations
Numerical Methods Solving Linear EquationsNumerical Methods Solving Linear Equations
Numerical Methods Solving Linear Equations
 
Muhammad ariefnugraha 142014066_kode4
Muhammad ariefnugraha 142014066_kode4Muhammad ariefnugraha 142014066_kode4
Muhammad ariefnugraha 142014066_kode4
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Vu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxVu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptx
 
ADVANCED ALGORITHMS-UNIT-3-Final.ppt
ADVANCED   ALGORITHMS-UNIT-3-Final.pptADVANCED   ALGORITHMS-UNIT-3-Final.ppt
ADVANCED ALGORITHMS-UNIT-3-Final.ppt
 
Precomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical modelsPrecomputation for SMC-ABC with undirected graphical models
Precomputation for SMC-ABC with undirected graphical models
 
Introduction to MATLAB
Introduction to MATLAB Introduction to MATLAB
Introduction to MATLAB
 
sheet6.pdf
sheet6.pdfsheet6.pdf
sheet6.pdf
 
doc6.pdf
doc6.pdfdoc6.pdf
doc6.pdf
 
paper6.pdf
paper6.pdfpaper6.pdf
paper6.pdf
 
lecture5.pdf
lecture5.pdflecture5.pdf
lecture5.pdf
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 

Mehr von PyData

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Data Science Workflow

  • 1. Data science workflow Andrew Gelman Dept of Statistics and Dept of Political Science Columbia University, New York PyData, New York, 28 Nov 2017
  • 2.
  • 3.
  • 4. The (abridged) model in Stan parameters { real b; real<lower=0> sigma_a; real<lower=0> sigma_y; vector[nteams] a; } model { a ~ normal(b*prior_score, sigma_a) sqrt_dif ~ normal(a[team1] - a[team2], sigma_y); }
  • 5. Fit the model Inference for Stan model: worldcup_first_try. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat b 0.46 0.00 0.09 0.40 0.46 0.52 1039 1.00 sigma_a 0.14 0.00 0.07 0.09 0.13 0.19 203 1.01 sigma_y 0.42 0.00 0.05 0.38 0.42 0.46 956 1.00 a[1] 0.35 0.00 0.13 0.27 0.36 0.44 4000 1.00 a[2] 0.39 0.00 0.12 0.31 0.38 0.46 4000 1.00 a[3] 0.43 0.01 0.15 0.33 0.42 0.52 756 1.00 a[4] 0.20 0.01 0.16 0.11 0.22 0.31 966 1.00 a[5] 0.29 0.00 0.13 0.21 0.29 0.36 4000 1.00 . . .
  • 7. Compare to model fit without prior rankings
  • 8. Compare model to predictions
  • 9. After finding and fixing a bug
  • 10. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Data on putts in pro golf Distance from hole (feet) Probabilityofsuccess 1346/1443 577/694 337/455 208/353 149/272 136/256 111/240 69/217 67/200 75/237 52/202 46/192 54/174 28/167 27/201 31/195 33/191 20/147 24/152
  • 11. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3
  • 13. Stan code data { int J; int n[J]; real x[J]; int y[J]; real r; real R; } parameters { real<lower=0> sigma; } model { real p[J]; p = 2*Phi(asin((R-r)/x) / sigma) - 1; y ~ binomial(n, p); }
  • 14. Fit the model golf <- read.table("golf.txt", header=TRUE, skip=2) x <- golf$x y <- golf$y n <- golf$n J <- length(y) r <- (1.68/2)/12 R <- (4.25/2)/12 fit1 <- stan("golf1.stan")
  • 15. Check convergence > print(fit1) Inference for Stan model: golf1. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 25% 50% 75% n_eff Rhat sigma 0.03 0.00 0.00 0.03 0.03 0.03 1692 1 sigma_degrees 1.53 0.00 0.02 1.51 1.53 1.54 1692 1
  • 16. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 What's the probability of making a golf putt? Distance from hole (feet) Probabilityofsuccess Geometry−based model, sigma = 1.5
  • 17. q q q q q q q q q q q q q q q q q q q 0 5 10 15 20 0.00.20.40.60.81.0 Two models fit to the golf putting data Distance from hole (feet) Probabilityofsuccess Logistic regression, a = 2.2, b = −0.3 Geometry−based model, sigma = 1.5
  • 19. The published graphs show data from 30 days in the year
  • 20.
  • 21.
  • 22. 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 1972 1976 1980 1984 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap dayApril 1st Memorial day Independence day Labor day Halloween Thanksgiving Christmas
  • 23. Mon Tue Wed Thu Fri Sat Sun Dayofweekeffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Seasonaleffect 60 80 100 120 2002 2006 2010 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 60 80 100 120 New year Valentine's day Leap day April 1st Memorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 2000 2002 2004 2006 2008 2010 2012 2014 Trends 60 80 100 120 Relative Number of Births Slow trend Fast non-periodic component Mean
  • 24. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dayofyeareffect 50 60 70 80 90 100 110 120 New year Valentine's day Leap day April 1stMemorial day Independence day Labor day 9/11 Halloween Thanksgiving Christmas 13th day of month
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. Xbox estimates, adjusting for demographics
  • 31. Xbox estimates, adjusting for demographics and partisanship
  • 33. Some ideas in data science workflow Data and information Replication Fake-data simulation (or statistical theory) Comparing predictions to data The network of models