SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Finding a Regression Model to Predict the
Distance of a Golf Shot
Scott Naleway
Actuarial Science Student, Illinois State University, Normal, Illinois, email: srnalew@ilstu.edu
Abstract Golf is becoming an increasingly
popular sport in America and around the
world. However, for many U.S states and
foreign countries, golf is a sport that can
only be enjoyed for just a few months out
of the year. The computer technology of
today has offered a solution to this
problem, indoor golf simulators. This paper
proposes a linear regression model that
would be used in these indoor simulators to
predict the distance of any golf shot. The
model uses six predictor variables: club loft,
ball speed, launch angle, spin RPMs, side
spin, and a wind factor, to predict the
forward distance of a golf shot. The results
indicate that the model is reasonably
accurate for shots up to around 200 yards.
However, due to many limiting factors, the
model would not be up to par with those of
today’s golf simulators.
Introduction Due to the ever-increasing
level of technology and innovation, the
quality and functionality of today’s golf
simulators can give a golfer a near-real-life
experience. This study aims to find a model
that can predict total forward distance of a
golf shot based on six predictor variables:
club loft, ball speed, launch angle, spin
RPMs, side spin, and a wind factor. Forward
distance is measured with a Bushnell Tour
Z6 golf rangefinder. The lofts are the Titleist
690.CB irons used are found at Titleist.com.
The two Titleist-Vokey wedges are labeled
54 and 60, for their respective lofts. Ball
speed, launch angle, spin RPMs, and side
spin are all measured using a Foresight
Sports GC2 Smart Camera System launch
monitor (see Figure 1) available for public
use at the All Seasons Golf Learning Center.
Finally, the wind factor is estimated and
derived from a simple formula discussed
later in this paper.
Figure 1 The GC2 Smart Camera Launch Monitor
From “Guide to Lanuch Monitors” by Lucy Locket,
Feb. 3, 2015, GolfALot. Availableat
http://www.golfalot.com/equipment-news/guide-
to-launch-monitors-3070.aspx
All of the data on the golf shots is
collected in one day. A Microsoft Excel
spreadsheet is used to record and organize
the data for statistical testing. All of the
statistical testing and the extraction of the
regression model is performed using the
statistical computing software R. The data is
split into two categories: Model-Building
data and a holdout sample used for Model
Validation. Most of the testing and initial
regression model is derived from the
Model-Building data, then tested on the
Model-Validation data for consistency and
prediction ability, and finally the two data
sets are combined and a final model is
produced.
Limitations Due to the limited
availability of resources for an ISU
undergraduate in Bloomington-Normal, the
goal of this study is not to find a model that
will exceed or improve those models used
in today’s golf simulators, but one that will
simply be able to reasonably predict
distance. Limiting factors and restrictions
for this study are as follows:
1. All data is collected on one day. The
temperature does not vary much,
the wind direction does not vary
much, and we are limited to the
amount of balls we can hit because
we are deep into an Illinois winter,
the five-month off-season for Mid-
Western golfers. In other words, my
brother and I are out of shape.
2. The range balls used in the study
were inconsistent. There are three
different types of range balls, some
brand new, some worn and without
dimples. These inconsistencies in
the balls account for a great deal of
error in the data.
3. The distance of each shot is
measured with a Bushnell TourZ6
golf rangefinder. It uses an infrared
laser to measure distance. It is
accurate to within half of one yard
from 5-200 yards. It can give this
accurate reading of distance to an
object, pin, tree wall and even the
ground. However, the response(Y)
variable is total forward distance.
For example, one golfer hits a 200
yard shot at a 30-degree angle to
the right. Another golfer hits a 200
yard shot right down the middle of
the fairway. The first golfer is not as
close to the pin as the second golfer.
This is why forward distance is the
most crucial of the two distances.
The rangefinder can only accurately
measure the distance from tee to
ball; therefore, the total forward
distance needs to be estimated. If a
spotter could be out on the range,
this would remove the need for
estimation. However, due to safety
and liability reasons, the All Seasons
Golf Learning Center does not
permit anyone to walk out onto the
range for any reason.
4. The range itself has an upward slope
with negative convexity and many
small hills and valleys. This causes
two inconveniences. First, due to
the negative convexity of the range,
the accurate measurement of
distance beyond 200 yards becomes
nearly impossible. Due to this
hindrance, the study only includes
shots from the 3-iron to the 60-
degree lob wedge. No woods or
drivers are used. This is a serious
problem because woods are
constructed with different materials
and for different purposes than
irons. Generally speaking, the shot
distance between clubs for any
golfer rapidly increases as loft
decreases. The distance between a
9-degree driver and a 13-degree 3-
wood is much greater than the
distance between a 48-degree
pitching wedge and a 44-degree 9-
iron. This missing data from the
study will have a substantial,
negative impact on the overall
usability of the model. Second, any
ball that ends up in a valley behind a
small is no longer visible to the
rangefinder and therefore, its
distance is estimated.
5. Due to the inability to locate an
anemometer, there is no accurate
way to measure wind speed and
direction.
6. Only one set of irons are used in the
study. This could lead to a biased
data set.
7. Only two golfers are used in the data
collection. This could lead to a
biased data set.
Data Collection All of the data from 100
shots is recorded. Since all of the data has
to collected in one day, data from the first
70 shots is used for the model-building data
set and a holdout is saved for model
validation. The initial plan was to hit 120
shots and have the model-building data set
and the validation data set be an equal 60
each, however due to encroaching fatigue
and soreness from hitting 100 golf balls for
the first time in months forced us to adjust
the experiment. In addition to 100 golf
shots, a gentleman from the pro shop hit
five shots and I hit four more shots to be
used for prediction intervals after the full
and final model is acquired.
On the day of the data collection,
the weather was brisk. It was about 50
degrees and the wind averaged about 15
MPH WNW, according to the Weather
Channel App. The driving range faces due
North, so from a golfer’s standpoint, the
wind was blowing to the right and “a bit in
our faces.” The wind remained fairly
consistent with some gusts here and there
and for about the last 15 shots or so, the
wind speed decreased substantially. The
process of hitting golf balls, measuring and
collecting data would take three hours to
complete. In order to conserve energy,
Mark and I would take turns, hitting every
other six shots. I always measure the
distance to the ball for consistency. Data
from the rangefinder, launch monitor and
wind estimation is collected for every shot.
A total of eleven different clubs are
used: Eight Titleist 690.CB forged cavity
back irons (3 – PW), two Titleist Vokey
wedges (54 and 60 degree), and an 8
degree Callaway X-Hot titanium composite
driver. The driver was only used for two
shots as part of the prediction data, due to
the inability to get consistent, accurate
distance measurements of longer shots at
the range.
A very important factor in the
interpretation of the final model is that two
of the seven, total variables are estimated.
First, the response variable, forward
distance, is estimated (Limitation 4 above).
Since the rangefinder can only accurately
measure the total distance travelled by the
ball, the resulting forward distance is
estimated by the Pythagorean Theorem
(see Figure 2).
Figure 2 The Pythagorean Theorem
From “Pythagorean Theorem Calculator,”
NCalculators. Availableat
http://ncalculators.com/number-
conversion/pythagoras-theorem.htm
C is the distance measured by the
rangefinder to the ball. A is the distance to
the center of the range. And B is the
resulting forward distance. In order to solve
this equation for every shot, A needed to be
estimated. I estimated this value for every
shot for consistency. This factor will account
for some of the error in the model.
Second, the wind factor predictor
variable is a concocted value combining the
wind magnitude and direction. The final
value recorded as the sixth predictor
variable is:
sin(wind direction) x (wind speed)
This value, has two sources of estimation
and, thus, two sources for error. The use of
an anemometer would have eliminated
these sources of error, however, due to
limited resources, one could not be
acquired. For each shot, both wind speed
and direction are estimated by myself for
consistency.
Any shot that goes out of the range
or does not make it into the measurable
range area are immediately discarded. This
is also another possible source for error in
the model. In all, data on 100 measurable
shots is collected. An additional 9 shots are
hit by an employee of the facility to be used
for prediction ability of the final model.
Data Analysis
1.Model Building The first 70 shots
comprise the Model-Building data set. The
statistical computing software, R, is used to
conduct statistical tests on the data to
determine the quality of the data and the
capacity to extract a useful regression
model.
The variables are as follows:
Y = Forward Distance
X1 = Loft
X2 = Ball Speed
X3 = Launch Angle
X4 = Spin RPM
X5 = Side Spin
X6 = Wind Factor
Model selection tests are conducted
to extract the best model. Three tests are
performed based on three criteria: AIC,
adjusted R-squared, and Mallow’s CP.
Step: AIC=303.01
Y = X2 + X3 + X4 + X5 + X6
Df Sum of Sq RSS AIC
+ X1 1 250.32 4222.7 300.98
<none> 4473.0 303.01
Step: AIC=300.98
Y = X1 + X2 + X3 + X4 + X5 + X6
Figure 3 Forward Step-wise AICModelSelection
Adjusted R-squared
Figure 4
Mallow’s CP
Figure 5
All three tests conclude that the best model
is the original hypothesized model with all
six predictors. Furthermore, all three tests
indicate that the second best model
involves removing X1. This suggests that
club loft may not significantly add to the
model and will need to be further
examined.
A couple of tests are conducted on
the selected model to discover the
normality and consistency of the data. A
normal quantile-quantile plot of the
residuals shows that the appear to be
normally distributed with a slight deviance
on the tails. The Shapiro-Wilks Test for
Normality concluded, with a p-value of
.342, that the residuals are most likely
normally distributed.
Shapiro-Wilk normality test
data: residuals(reg)
W = 0.98041,p-value= 0.342
Figure 6 Normal Q-Q Plot with Shapiro-Wilk normalitytest
results
Since, the test indicated that the population
was most likely distributed, no log or
exponential transformations are conducted.
A residual plot against the fitted
values (see Figure 7) shows that the
residuals are relatively symmetrical about
zero, indicating that model is most likely
unbiased, however, the slight megaphone
shape of the residuals implies that there
may be a non-constancy of variance issue.
Figure 7 Residual plot against fitted values
Breusch-Pagan test
H0: Error Varianceis constant
Ha: Error Varianceincreases or decreases as Xi’s
increaseor decrease (non-constancy of error
variance)
data:
Y = X1 + X2 + X3 + X4 + X5 + X6
BP = 9.6005
df = 6
p-value= 0.1425
Figure 8 B-Ptest results
The Breusch-Pagan test for Constancy of
Error Variance indicates that there is no
severe issue here.
A Scatter Plot Matrix of the all of the
variables, included in the Appendix, shows
the correlation among all the variables. It is
evident that the predictor that has the
greatest influence on the distance is ball
speed. Launch angle and loft, intuitively,
have a fairly strong negative correlation
with distance. Spin RPM correlation plot
with distance that vaguely resembles a
parabola. With a little more insight to this
data it becomes clear that this should be
the case. Weakly-struck golf balls will
naturally have less spin that a more
powerful shot, suggesting that the distance
will usually be shorter. And the most typical
way to achieve extremely high spin is with a
medium-to-high-lofted club, say 7 iron
through 60-degree lob wedge), combined
with a powerful swing, which generally
produces a shorter shot as well. It follows
that lower clubs struck by a powerful swing
will produce the longest shots and most
likely, a mid-range spin level.
The Scatter Plot Matrix also suggests
that there may be a source for
multicollinearity between some of the
predictor variables. There seems to be a
strong, positive correlation between loft
and launch angle, with the exception of
about three points. These points
correspond to shots with the lob wedge,
the highest lofted club used in the study,
that were bladed, or thinned. In golf, to
blade the golf ball means to hit the ball of
the leading edge of the sole, producing a
much lower and more powerful shot than
intended. These 3 shots ended up travelling
over 160 yards, whereas the mean of the
other nine shots from the lob wedge is only
67.75. Regardless, these do not correspond
to being outliers because the residuals of
these points are well within the bounds of
influence measures. This is because the
other data collected by the launch monitor
match up very closely to the other shots
similar distance. In order to confirm that
there is no significant multicollinearity
between launch angle and loft, a VIF
(Variance Inflation Factor) test is
performed.
Variables VIF
1 loft 3.058256
2 ballspeed 4.615723
3 launch 4.085240
4 spinrpm 1.755950
5 sidespin 1.363794
6 wind 1.141324
Figure 9 VIF test results
No VIF is greater than 10 indicating that
there is no serious multicollinearity
problem. All six predictors remain in the
model.
A lesson learned from this small
piece of data is that, for future experiments,
while loft and launch angle are most likely
generally correlated, this correlation factor
relies heavily on how the ball is struck. For a
pro, launch angle and loft are ideally
correlated, for the average or sub-par
golfer, they may not be correlated at all.
This will need to be addressed in any future
studies. A Regression Summary of the
original model is provided (see Figure 10).
2.Model Validation In order to assess the
prediction ability and the bias of the model
extracted from the Model-Building data set,
it cross-validated with the Model-Validation
data set. The Model-Validation set consists
of data on 30 shots with the same ten clubs
used in the original model. This data set is a
holdout sample collected at the same time
as the original sample. It is generally
preferred that the holdout sample be the
same size as the original sample, however
due to the onset of fatigue and failing light
of the winter day, the holdout sample is cut
short to only 30 shots.
Figure 10 Original Model Regression Summary
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.0475732 17.0031205 1.355 0.180101
loft -0.2741271 0.1418499 -1.933. 0.057795 .
ballspeed 1.5551991 0.1152407 13.495 < 2e-16 ***
launch -0.4874853 0.1921718 -2.537 0.013679 *
spinrpm -0.0021457 0.0005989 -3.583 0.000663 ***
sidespin 0.0037499 0.0017322 2.165 0.034197 *
wind 0.9046697 0.3622216 2.498 0.015130 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standarderror:8.187 on 63 degrees offreedom
Multiple R-squared: 0.9535, AdjustedR-squared:
0.9491
F-statistic:215.3 on 6 and 63 DF, p-value: < 2.2e-16
The MSPR (Mean Squared Prediction
Error) of the validation sample is obtained
by applying the betas derived from the
original sample and calculating the resulting
MSE (Mean Squared Error), now referred to
as the MSPR, and comparing it to the MSE
of the original sample. The results are as
follows:
 > yhats <- xx %*% betas
 > (MSPR <- sum((Y-
yhats)^2)/length(Y))
 >MSPR
 [1] 90.64802
Figure 10 R code for extraction of MSPR
The MSPR of the validation sample is 90.648
and the MSE of the original data is 67.027.
These data denote Residual Standard Errors
of 9.520925 and 8.187002. It can be
concluded that the two data sets are
comparable; therefore, the prediction
ability of the original model is sufficient.
Results Before the final regression model
is produced, the original data set and the
validation data set are combined. One more
multiple linear regression is performed on
all 100 observations and checked against
the original model for consistencies. The
results are as follows:
Figure 11 Cross Validation Data
The two regression models are consistent.
The betas are similar. The residual standard
errors are close as well. However, x1 loses
significance in the full model. If it is
removed the regression summary looks like
this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Int)-2.1802 11.75348 -0.185 0.8532
x2 1.7566 0.07859 22.350 < 2e-16 ***
x3 -0.4970 0.16654 -2.984 0.00362 **
x4 -0.0026 0.000531 -5.041 2.24e-06 ***
x5 0.0043 0.001546 2.806 0.00610 **
x6 0.9279 0.332917 2.787 0.00643 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.47 on 94 degrees
of freedom
Multiple R-squared: 0.953, Adjusted R-squar
ed: 0.9504
F-statistic: 380.8 on 5 and 94 DF,
p-value: < 2.2e-16
Figure 12 Final Model Regression Summary
The residual standard error and R-squared
are hardly affected more influence is put on
ball speed. It appears that loft does contain
a significant amount of additional
information. As discussed earlier, miss-
struck shots can produce a wide range of
ball speeds, launch angles and spin rates,
completely inconsistent with those of a
well-struck shot. Therefore, the important
info truly lies in the remaining predictors.
The final model is:
Distance =
-2.1802 +
(1.7566 x Ball Speed) –
(.497 x Launch Angle) –
(.0026 x Spin Rpm) +
(.0043 x Side Spin) +
(.9279 x Wind Factor)
Further Prediction In order to properly
test the full model, further prediction
capability is assessed. On the day of the
data collection, nine extra shots were
recorded specifically for the purpose of
using the final model to test predictive
abilities. These shots were hit by an
employee at the All Seasons Golf Learning
Center and myself. Two of the prediction
shots were hit by myself with a driver. The
study did not include any shots from a
fairway wood or a driver.
Prediction Intervals Actual Distance
fit lwr upr
149.2101 131.6304 166.7898 135.7239846
fit lwr upr
156.401 138.99 173.812 153.2677396
fit lwr upr
197.6068 179.3682 215.8455 214.2988567
fit lwr upr
186.2827 168.5112 204.0542 196.5400722
fit lwr upr
192.876 175.466 210.2861 205.7571384
fit lwr upr
148.2855 130.8548 165.7163 153.4503177
fit lwr upr
173.9984 156.8719 191.1248 180.2747903
fit lwr upr
235.1113 217.0663 253.1564 253.929124
fit lwr upr
225.8325 207.9475 243.7175 247.3459116
Conclusions The model has an overall
adjusted R-squared of .9511. The Residual
Standard Error of 8.47 is much higher than a
precise model should be. In golf, eight yards
could mean the difference between
“putting for birdie”, or “in the creek.” The
prediction data show that this model is
reasonably accurate. Most of the actual
distances lie within the prediction intervals.
However, as the true distance gets higher,
the model severely underestimates the
distance. This shows that indeed the model
is most likely biased. This is a symptom of
only collected data on shots that stayed
under 200 yards. One flaw that is not
evident in the data, but that is intuitive, is
that the wind factor would likely have very
little significance in any real applications of
the model. This is a result of the wind
primarily only blowing in one direction for
the entirety of the data collection.
There are many improvements that
could be made to this study to produce
much more accurate results. If the study
were to be repeated under the conditions
of more time, warmer weather and better
funding, a much better model could be
achieved. Better variety of wind conditions
would be analyzed to give its predictor
more significance and more information. A
large sample of participants would lead to
much less biased data. A better distance
measuring technique would help decrease
the estimation error of the response
variable. The study shows promise, but
allows much room for improvement.
References
1.“690.CB Irons: Specifications,” Titleist. Available at http://www.titleist.com/golf-
clubs/irons/2005-forged-690cb
Appendix I
Figure 12 Final Model Regression Summary

Weitere ähnliche Inhalte

Andere mochten auch (10)

гапченко мария денисовна
гапченко мария денисовнагапченко мария денисовна
гапченко мария денисовна
 
Historia ensayo
Historia  ensayoHistoria  ensayo
Historia ensayo
 
Организация бивуака
Организация бивуакаОрганизация бивуака
Организация бивуака
 
Henry patiño buho - catherine
Henry patiño buho - catherineHenry patiño buho - catherine
Henry patiño buho - catherine
 
[RICARDO JACINTO] [MUSEO VOSTELL MALPARTIDA]
[RICARDO JACINTO] [MUSEO VOSTELL MALPARTIDA][RICARDO JACINTO] [MUSEO VOSTELL MALPARTIDA]
[RICARDO JACINTO] [MUSEO VOSTELL MALPARTIDA]
 
ULRIKE OTTINGER
ULRIKE OTTINGERULRIKE OTTINGER
ULRIKE OTTINGER
 
BAMBIKINA by Corredores de Ideas
BAMBIKINA by Corredores de IdeasBAMBIKINA by Corredores de Ideas
BAMBIKINA by Corredores de Ideas
 
Cáceres Rap en el Foro Provincial 05
Cáceres Rap en el Foro Provincial 05Cáceres Rap en el Foro Provincial 05
Cáceres Rap en el Foro Provincial 05
 
OLOR A COLOR [By Corredores de Ideas]
OLOR A COLOR [By Corredores de Ideas]OLOR A COLOR [By Corredores de Ideas]
OLOR A COLOR [By Corredores de Ideas]
 
[GARROVILLAS] [CONVENTO DE SAN ANTONIO]
[GARROVILLAS] [CONVENTO DE SAN ANTONIO][GARROVILLAS] [CONVENTO DE SAN ANTONIO]
[GARROVILLAS] [CONVENTO DE SAN ANTONIO]
 

Ähnlich wie Golf Regression Paper1

Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scaletheijes
 
Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scaletheijes
 
AI Golf: Golf Swing Analysis Tool for Self-Training
AI Golf: Golf Swing Analysis Tool for Self-TrainingAI Golf: Golf Swing Analysis Tool for Self-Training
AI Golf: Golf Swing Analysis Tool for Self-TrainingIRJET Journal
 
Gait Biometrics Attack
Gait Biometrics Attack Gait Biometrics Attack
Gait Biometrics Attack FazleRabbi80
 
ACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTE
ACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTEACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTE
ACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTEijcsit
 
Accuracy of Garmin GPS Running Watches over Repetitive Trials on the Same Route
Accuracy of Garmin GPS Running Watches over Repetitive Trials on the Same RouteAccuracy of Garmin GPS Running Watches over Repetitive Trials on the Same Route
Accuracy of Garmin GPS Running Watches over Repetitive Trials on the Same RouteAIRCC Publishing Corporation
 
Kinematic analysis of shot release of intercollegiate athletes
Kinematic analysis of shot release of intercollegiate athletesKinematic analysis of shot release of intercollegiate athletes
Kinematic analysis of shot release of intercollegiate athletesSports Journal
 
PREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEEL
PREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEELPREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEEL
PREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEELSourav Samanta
 
Golf Tee Simulator 2010
Golf Tee Simulator 2010Golf Tee Simulator 2010
Golf Tee Simulator 2010ronmajor
 
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...IRJET Journal
 
FinalProjectSurveyingfinal
FinalProjectSurveyingfinalFinalProjectSurveyingfinal
FinalProjectSurveyingfinalkai yu chen
 
May internship challenge: Estimating Distance between Two Balls App
May internship challenge: Estimating Distance between Two Balls AppMay internship challenge: Estimating Distance between Two Balls App
May internship challenge: Estimating Distance between Two Balls AppRidge-i, Inc.
 
CROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERA CROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERA ijaceeejournal
 
CROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERACROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERAijaceeejournal
 
Weather Prediction Model using Random Forest Algorithm and Apache Spark
Weather Prediction Model using Random Forest Algorithm and Apache SparkWeather Prediction Model using Random Forest Algorithm and Apache Spark
Weather Prediction Model using Random Forest Algorithm and Apache Sparkijtsrd
 
DNN Project Report (Prathmesh)
DNN Project Report (Prathmesh)DNN Project Report (Prathmesh)
DNN Project Report (Prathmesh)Prathmesh Kumbhare
 
Car’s Aerodynamic Characteristics at High Speed Influenced by Rear Spoiler
Car’s Aerodynamic Characteristics at High Speed Influenced by Rear SpoilerCar’s Aerodynamic Characteristics at High Speed Influenced by Rear Spoiler
Car’s Aerodynamic Characteristics at High Speed Influenced by Rear SpoilerIJRES Journal
 
Total Station by Denis Jangeed.pdf
Total Station by Denis Jangeed.pdfTotal Station by Denis Jangeed.pdf
Total Station by Denis Jangeed.pdfDenish Jangid
 
Flight departure delay prediction
Flight departure delay predictionFlight departure delay prediction
Flight departure delay predictionVivek Maskara
 

Ähnlich wie Golf Regression Paper1 (20)

Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scale
 
Distance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted ScaleDistance Estimation to Image Objects Using Adapted Scale
Distance Estimation to Image Objects Using Adapted Scale
 
AI Golf: Golf Swing Analysis Tool for Self-Training
AI Golf: Golf Swing Analysis Tool for Self-TrainingAI Golf: Golf Swing Analysis Tool for Self-Training
AI Golf: Golf Swing Analysis Tool for Self-Training
 
Gait Biometrics Attack
Gait Biometrics Attack Gait Biometrics Attack
Gait Biometrics Attack
 
ACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTE
ACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTEACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTE
ACCURACY OF GARMIN GPS RUNNING WATCHES OVER REPETITIVE TRIALS ON THE SAME ROUTE
 
Accuracy of Garmin GPS Running Watches over Repetitive Trials on the Same Route
Accuracy of Garmin GPS Running Watches over Repetitive Trials on the Same RouteAccuracy of Garmin GPS Running Watches over Repetitive Trials on the Same Route
Accuracy of Garmin GPS Running Watches over Repetitive Trials on the Same Route
 
Kinematic analysis of shot release of intercollegiate athletes
Kinematic analysis of shot release of intercollegiate athletesKinematic analysis of shot release of intercollegiate athletes
Kinematic analysis of shot release of intercollegiate athletes
 
PREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEEL
PREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEELPREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEEL
PREDICTION OF TOOL WEAR USING ARTIFICIAL NEURAL NETWORK IN TURNING OF MILD STEEL
 
Golf Tee Simulator 2010
Golf Tee Simulator 2010Golf Tee Simulator 2010
Golf Tee Simulator 2010
 
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
 
FinalProjectSurveyingfinal
FinalProjectSurveyingfinalFinalProjectSurveyingfinal
FinalProjectSurveyingfinal
 
May internship challenge: Estimating Distance between Two Balls App
May internship challenge: Estimating Distance between Two Balls AppMay internship challenge: Estimating Distance between Two Balls App
May internship challenge: Estimating Distance between Two Balls App
 
CROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERA CROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERA
 
CROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERACROWD ANALYSIS WITH FISH EYE CAMERA
CROWD ANALYSIS WITH FISH EYE CAMERA
 
Weather Prediction Model using Random Forest Algorithm and Apache Spark
Weather Prediction Model using Random Forest Algorithm and Apache SparkWeather Prediction Model using Random Forest Algorithm and Apache Spark
Weather Prediction Model using Random Forest Algorithm and Apache Spark
 
DNN Project Report (Prathmesh)
DNN Project Report (Prathmesh)DNN Project Report (Prathmesh)
DNN Project Report (Prathmesh)
 
Car’s Aerodynamic Characteristics at High Speed Influenced by Rear Spoiler
Car’s Aerodynamic Characteristics at High Speed Influenced by Rear SpoilerCar’s Aerodynamic Characteristics at High Speed Influenced by Rear Spoiler
Car’s Aerodynamic Characteristics at High Speed Influenced by Rear Spoiler
 
Total Station by Denis Jangeed.pdf
Total Station by Denis Jangeed.pdfTotal Station by Denis Jangeed.pdf
Total Station by Denis Jangeed.pdf
 
Optimal_Control_Project
Optimal_Control_ProjectOptimal_Control_Project
Optimal_Control_Project
 
Flight departure delay prediction
Flight departure delay predictionFlight departure delay prediction
Flight departure delay prediction
 

Golf Regression Paper1

  • 1. Finding a Regression Model to Predict the Distance of a Golf Shot Scott Naleway Actuarial Science Student, Illinois State University, Normal, Illinois, email: srnalew@ilstu.edu Abstract Golf is becoming an increasingly popular sport in America and around the world. However, for many U.S states and foreign countries, golf is a sport that can only be enjoyed for just a few months out of the year. The computer technology of today has offered a solution to this problem, indoor golf simulators. This paper proposes a linear regression model that would be used in these indoor simulators to predict the distance of any golf shot. The model uses six predictor variables: club loft, ball speed, launch angle, spin RPMs, side spin, and a wind factor, to predict the forward distance of a golf shot. The results indicate that the model is reasonably accurate for shots up to around 200 yards. However, due to many limiting factors, the model would not be up to par with those of today’s golf simulators. Introduction Due to the ever-increasing level of technology and innovation, the quality and functionality of today’s golf simulators can give a golfer a near-real-life experience. This study aims to find a model that can predict total forward distance of a golf shot based on six predictor variables: club loft, ball speed, launch angle, spin RPMs, side spin, and a wind factor. Forward distance is measured with a Bushnell Tour Z6 golf rangefinder. The lofts are the Titleist 690.CB irons used are found at Titleist.com. The two Titleist-Vokey wedges are labeled 54 and 60, for their respective lofts. Ball speed, launch angle, spin RPMs, and side spin are all measured using a Foresight Sports GC2 Smart Camera System launch monitor (see Figure 1) available for public use at the All Seasons Golf Learning Center. Finally, the wind factor is estimated and derived from a simple formula discussed later in this paper. Figure 1 The GC2 Smart Camera Launch Monitor From “Guide to Lanuch Monitors” by Lucy Locket, Feb. 3, 2015, GolfALot. Availableat http://www.golfalot.com/equipment-news/guide- to-launch-monitors-3070.aspx
  • 2. All of the data on the golf shots is collected in one day. A Microsoft Excel spreadsheet is used to record and organize the data for statistical testing. All of the statistical testing and the extraction of the regression model is performed using the statistical computing software R. The data is split into two categories: Model-Building data and a holdout sample used for Model Validation. Most of the testing and initial regression model is derived from the Model-Building data, then tested on the Model-Validation data for consistency and prediction ability, and finally the two data sets are combined and a final model is produced. Limitations Due to the limited availability of resources for an ISU undergraduate in Bloomington-Normal, the goal of this study is not to find a model that will exceed or improve those models used in today’s golf simulators, but one that will simply be able to reasonably predict distance. Limiting factors and restrictions for this study are as follows: 1. All data is collected on one day. The temperature does not vary much, the wind direction does not vary much, and we are limited to the amount of balls we can hit because we are deep into an Illinois winter, the five-month off-season for Mid- Western golfers. In other words, my brother and I are out of shape. 2. The range balls used in the study were inconsistent. There are three different types of range balls, some brand new, some worn and without dimples. These inconsistencies in the balls account for a great deal of error in the data. 3. The distance of each shot is measured with a Bushnell TourZ6 golf rangefinder. It uses an infrared laser to measure distance. It is accurate to within half of one yard from 5-200 yards. It can give this accurate reading of distance to an object, pin, tree wall and even the ground. However, the response(Y) variable is total forward distance. For example, one golfer hits a 200 yard shot at a 30-degree angle to the right. Another golfer hits a 200 yard shot right down the middle of the fairway. The first golfer is not as close to the pin as the second golfer. This is why forward distance is the most crucial of the two distances. The rangefinder can only accurately measure the distance from tee to ball; therefore, the total forward distance needs to be estimated. If a spotter could be out on the range, this would remove the need for estimation. However, due to safety and liability reasons, the All Seasons Golf Learning Center does not permit anyone to walk out onto the range for any reason. 4. The range itself has an upward slope with negative convexity and many small hills and valleys. This causes two inconveniences. First, due to the negative convexity of the range, the accurate measurement of distance beyond 200 yards becomes nearly impossible. Due to this hindrance, the study only includes shots from the 3-iron to the 60- degree lob wedge. No woods or drivers are used. This is a serious problem because woods are constructed with different materials and for different purposes than
  • 3. irons. Generally speaking, the shot distance between clubs for any golfer rapidly increases as loft decreases. The distance between a 9-degree driver and a 13-degree 3- wood is much greater than the distance between a 48-degree pitching wedge and a 44-degree 9- iron. This missing data from the study will have a substantial, negative impact on the overall usability of the model. Second, any ball that ends up in a valley behind a small is no longer visible to the rangefinder and therefore, its distance is estimated. 5. Due to the inability to locate an anemometer, there is no accurate way to measure wind speed and direction. 6. Only one set of irons are used in the study. This could lead to a biased data set. 7. Only two golfers are used in the data collection. This could lead to a biased data set. Data Collection All of the data from 100 shots is recorded. Since all of the data has to collected in one day, data from the first 70 shots is used for the model-building data set and a holdout is saved for model validation. The initial plan was to hit 120 shots and have the model-building data set and the validation data set be an equal 60 each, however due to encroaching fatigue and soreness from hitting 100 golf balls for the first time in months forced us to adjust the experiment. In addition to 100 golf shots, a gentleman from the pro shop hit five shots and I hit four more shots to be used for prediction intervals after the full and final model is acquired. On the day of the data collection, the weather was brisk. It was about 50 degrees and the wind averaged about 15 MPH WNW, according to the Weather Channel App. The driving range faces due North, so from a golfer’s standpoint, the wind was blowing to the right and “a bit in our faces.” The wind remained fairly consistent with some gusts here and there and for about the last 15 shots or so, the wind speed decreased substantially. The process of hitting golf balls, measuring and collecting data would take three hours to complete. In order to conserve energy, Mark and I would take turns, hitting every other six shots. I always measure the distance to the ball for consistency. Data from the rangefinder, launch monitor and wind estimation is collected for every shot. A total of eleven different clubs are used: Eight Titleist 690.CB forged cavity back irons (3 – PW), two Titleist Vokey wedges (54 and 60 degree), and an 8 degree Callaway X-Hot titanium composite driver. The driver was only used for two shots as part of the prediction data, due to the inability to get consistent, accurate distance measurements of longer shots at the range. A very important factor in the interpretation of the final model is that two of the seven, total variables are estimated. First, the response variable, forward distance, is estimated (Limitation 4 above). Since the rangefinder can only accurately measure the total distance travelled by the ball, the resulting forward distance is estimated by the Pythagorean Theorem (see Figure 2).
  • 4. Figure 2 The Pythagorean Theorem From “Pythagorean Theorem Calculator,” NCalculators. Availableat http://ncalculators.com/number- conversion/pythagoras-theorem.htm C is the distance measured by the rangefinder to the ball. A is the distance to the center of the range. And B is the resulting forward distance. In order to solve this equation for every shot, A needed to be estimated. I estimated this value for every shot for consistency. This factor will account for some of the error in the model. Second, the wind factor predictor variable is a concocted value combining the wind magnitude and direction. The final value recorded as the sixth predictor variable is: sin(wind direction) x (wind speed) This value, has two sources of estimation and, thus, two sources for error. The use of an anemometer would have eliminated these sources of error, however, due to limited resources, one could not be acquired. For each shot, both wind speed and direction are estimated by myself for consistency. Any shot that goes out of the range or does not make it into the measurable range area are immediately discarded. This is also another possible source for error in the model. In all, data on 100 measurable shots is collected. An additional 9 shots are hit by an employee of the facility to be used for prediction ability of the final model. Data Analysis 1.Model Building The first 70 shots comprise the Model-Building data set. The statistical computing software, R, is used to conduct statistical tests on the data to determine the quality of the data and the capacity to extract a useful regression model. The variables are as follows: Y = Forward Distance X1 = Loft X2 = Ball Speed X3 = Launch Angle X4 = Spin RPM X5 = Side Spin X6 = Wind Factor Model selection tests are conducted to extract the best model. Three tests are performed based on three criteria: AIC, adjusted R-squared, and Mallow’s CP. Step: AIC=303.01 Y = X2 + X3 + X4 + X5 + X6 Df Sum of Sq RSS AIC + X1 1 250.32 4222.7 300.98 <none> 4473.0 303.01 Step: AIC=300.98 Y = X1 + X2 + X3 + X4 + X5 + X6 Figure 3 Forward Step-wise AICModelSelection
  • 5. Adjusted R-squared Figure 4 Mallow’s CP Figure 5 All three tests conclude that the best model is the original hypothesized model with all six predictors. Furthermore, all three tests indicate that the second best model involves removing X1. This suggests that club loft may not significantly add to the model and will need to be further examined. A couple of tests are conducted on the selected model to discover the normality and consistency of the data. A normal quantile-quantile plot of the residuals shows that the appear to be normally distributed with a slight deviance on the tails. The Shapiro-Wilks Test for Normality concluded, with a p-value of .342, that the residuals are most likely normally distributed. Shapiro-Wilk normality test data: residuals(reg) W = 0.98041,p-value= 0.342 Figure 6 Normal Q-Q Plot with Shapiro-Wilk normalitytest results Since, the test indicated that the population was most likely distributed, no log or exponential transformations are conducted. A residual plot against the fitted values (see Figure 7) shows that the residuals are relatively symmetrical about zero, indicating that model is most likely unbiased, however, the slight megaphone shape of the residuals implies that there may be a non-constancy of variance issue. Figure 7 Residual plot against fitted values Breusch-Pagan test H0: Error Varianceis constant
  • 6. Ha: Error Varianceincreases or decreases as Xi’s increaseor decrease (non-constancy of error variance) data: Y = X1 + X2 + X3 + X4 + X5 + X6 BP = 9.6005 df = 6 p-value= 0.1425 Figure 8 B-Ptest results The Breusch-Pagan test for Constancy of Error Variance indicates that there is no severe issue here. A Scatter Plot Matrix of the all of the variables, included in the Appendix, shows the correlation among all the variables. It is evident that the predictor that has the greatest influence on the distance is ball speed. Launch angle and loft, intuitively, have a fairly strong negative correlation with distance. Spin RPM correlation plot with distance that vaguely resembles a parabola. With a little more insight to this data it becomes clear that this should be the case. Weakly-struck golf balls will naturally have less spin that a more powerful shot, suggesting that the distance will usually be shorter. And the most typical way to achieve extremely high spin is with a medium-to-high-lofted club, say 7 iron through 60-degree lob wedge), combined with a powerful swing, which generally produces a shorter shot as well. It follows that lower clubs struck by a powerful swing will produce the longest shots and most likely, a mid-range spin level. The Scatter Plot Matrix also suggests that there may be a source for multicollinearity between some of the predictor variables. There seems to be a strong, positive correlation between loft and launch angle, with the exception of about three points. These points correspond to shots with the lob wedge, the highest lofted club used in the study, that were bladed, or thinned. In golf, to blade the golf ball means to hit the ball of the leading edge of the sole, producing a much lower and more powerful shot than intended. These 3 shots ended up travelling over 160 yards, whereas the mean of the other nine shots from the lob wedge is only 67.75. Regardless, these do not correspond to being outliers because the residuals of these points are well within the bounds of influence measures. This is because the other data collected by the launch monitor match up very closely to the other shots similar distance. In order to confirm that there is no significant multicollinearity between launch angle and loft, a VIF (Variance Inflation Factor) test is performed. Variables VIF 1 loft 3.058256 2 ballspeed 4.615723 3 launch 4.085240 4 spinrpm 1.755950 5 sidespin 1.363794 6 wind 1.141324 Figure 9 VIF test results No VIF is greater than 10 indicating that there is no serious multicollinearity problem. All six predictors remain in the model. A lesson learned from this small piece of data is that, for future experiments, while loft and launch angle are most likely generally correlated, this correlation factor relies heavily on how the ball is struck. For a
  • 7. pro, launch angle and loft are ideally correlated, for the average or sub-par golfer, they may not be correlated at all. This will need to be addressed in any future studies. A Regression Summary of the original model is provided (see Figure 10). 2.Model Validation In order to assess the prediction ability and the bias of the model extracted from the Model-Building data set, it cross-validated with the Model-Validation data set. The Model-Validation set consists of data on 30 shots with the same ten clubs used in the original model. This data set is a holdout sample collected at the same time as the original sample. It is generally preferred that the holdout sample be the same size as the original sample, however due to the onset of fatigue and failing light of the winter day, the holdout sample is cut short to only 30 shots. Figure 10 Original Model Regression Summary Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.0475732 17.0031205 1.355 0.180101 loft -0.2741271 0.1418499 -1.933. 0.057795 . ballspeed 1.5551991 0.1152407 13.495 < 2e-16 *** launch -0.4874853 0.1921718 -2.537 0.013679 * spinrpm -0.0021457 0.0005989 -3.583 0.000663 *** sidespin 0.0037499 0.0017322 2.165 0.034197 * wind 0.9046697 0.3622216 2.498 0.015130 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standarderror:8.187 on 63 degrees offreedom Multiple R-squared: 0.9535, AdjustedR-squared: 0.9491 F-statistic:215.3 on 6 and 63 DF, p-value: < 2.2e-16 The MSPR (Mean Squared Prediction Error) of the validation sample is obtained by applying the betas derived from the original sample and calculating the resulting MSE (Mean Squared Error), now referred to as the MSPR, and comparing it to the MSE of the original sample. The results are as follows:  > yhats <- xx %*% betas  > (MSPR <- sum((Y- yhats)^2)/length(Y))  >MSPR  [1] 90.64802 Figure 10 R code for extraction of MSPR The MSPR of the validation sample is 90.648 and the MSE of the original data is 67.027. These data denote Residual Standard Errors of 9.520925 and 8.187002. It can be concluded that the two data sets are comparable; therefore, the prediction ability of the original model is sufficient. Results Before the final regression model is produced, the original data set and the validation data set are combined. One more multiple linear regression is performed on all 100 observations and checked against the original model for consistencies. The results are as follows:
  • 8. Figure 11 Cross Validation Data The two regression models are consistent. The betas are similar. The residual standard errors are close as well. However, x1 loses significance in the full model. If it is removed the regression summary looks like this: Coefficients: Estimate Std. Error t value Pr(>|t|) (Int)-2.1802 11.75348 -0.185 0.8532 x2 1.7566 0.07859 22.350 < 2e-16 *** x3 -0.4970 0.16654 -2.984 0.00362 ** x4 -0.0026 0.000531 -5.041 2.24e-06 *** x5 0.0043 0.001546 2.806 0.00610 ** x6 0.9279 0.332917 2.787 0.00643 ** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 8.47 on 94 degrees of freedom Multiple R-squared: 0.953, Adjusted R-squar ed: 0.9504 F-statistic: 380.8 on 5 and 94 DF, p-value: < 2.2e-16 Figure 12 Final Model Regression Summary The residual standard error and R-squared are hardly affected more influence is put on ball speed. It appears that loft does contain a significant amount of additional information. As discussed earlier, miss- struck shots can produce a wide range of ball speeds, launch angles and spin rates, completely inconsistent with those of a well-struck shot. Therefore, the important info truly lies in the remaining predictors. The final model is: Distance = -2.1802 + (1.7566 x Ball Speed) – (.497 x Launch Angle) – (.0026 x Spin Rpm) + (.0043 x Side Spin) + (.9279 x Wind Factor)
  • 9. Further Prediction In order to properly test the full model, further prediction capability is assessed. On the day of the data collection, nine extra shots were recorded specifically for the purpose of using the final model to test predictive abilities. These shots were hit by an employee at the All Seasons Golf Learning Center and myself. Two of the prediction shots were hit by myself with a driver. The study did not include any shots from a fairway wood or a driver. Prediction Intervals Actual Distance fit lwr upr 149.2101 131.6304 166.7898 135.7239846 fit lwr upr 156.401 138.99 173.812 153.2677396 fit lwr upr 197.6068 179.3682 215.8455 214.2988567 fit lwr upr 186.2827 168.5112 204.0542 196.5400722 fit lwr upr 192.876 175.466 210.2861 205.7571384 fit lwr upr 148.2855 130.8548 165.7163 153.4503177 fit lwr upr 173.9984 156.8719 191.1248 180.2747903 fit lwr upr 235.1113 217.0663 253.1564 253.929124 fit lwr upr 225.8325 207.9475 243.7175 247.3459116 Conclusions The model has an overall adjusted R-squared of .9511. The Residual Standard Error of 8.47 is much higher than a precise model should be. In golf, eight yards could mean the difference between “putting for birdie”, or “in the creek.” The prediction data show that this model is reasonably accurate. Most of the actual distances lie within the prediction intervals. However, as the true distance gets higher, the model severely underestimates the distance. This shows that indeed the model is most likely biased. This is a symptom of only collected data on shots that stayed under 200 yards. One flaw that is not evident in the data, but that is intuitive, is that the wind factor would likely have very little significance in any real applications of the model. This is a result of the wind primarily only blowing in one direction for the entirety of the data collection. There are many improvements that could be made to this study to produce much more accurate results. If the study were to be repeated under the conditions of more time, warmer weather and better funding, a much better model could be achieved. Better variety of wind conditions would be analyzed to give its predictor more significance and more information. A large sample of participants would lead to much less biased data. A better distance measuring technique would help decrease the estimation error of the response variable. The study shows promise, but allows much room for improvement.
  • 10. References 1.“690.CB Irons: Specifications,” Titleist. Available at http://www.titleist.com/golf- clubs/irons/2005-forged-690cb Appendix I Figure 12 Final Model Regression Summary