Powerful Google developer tools for immediate impact! (2023-24 C)
Introduction to Regression Analysis and R
1. Multiple regression in R on Automobile data to predict
Gasoline Mileage
Rachana T. Bhatia - Rutgers University
2. Basics of Regression Analysis
Addressing Model Deviation of regression models
Model selection criterion
Types of regression Model
Introduction to R
Multiple regression (including polynomial regression) on Car Data
Rachana T. Bhatia - Rutgers University
3. First step to learn predictive modelling
Statistical technique for investigating and modeling the relationship between
variables
Equation of straight line 𝒚 = 𝜷 𝟎 + 𝜷 𝟏 𝒙 + 𝜺
𝜺 is a random variable that accounts for the failure of the model to fit the data
𝒙 explanatory variable & 𝑦 response variable
Regression does not necessarily imply causality
Rachana T. Bhatia - Rutgers University
5. Least squares estimation- minimize the sum of squares of the
differences between the observed response, yi, and the straight line
Fitted values & Residuals
Hypothesis Testing for the value of slope and intercept - T-tests
Significant relation between the variables- reject Null Hypothesis
Alternative approach : p-value
Confidence Interval – associated with randomness of the data
Prediction Interval – associated with the random variable yet to be
observed.
Rachana T. Bhatia - Rutgers University
6. Linearity
Homoscedasticity
Errors normally distributed (for inferential purposes)
Independent
Constant variance
There is a probability distribution for y at each value of x
with mean: E 𝑌 𝑥 = β0 + β1 𝑥
Variance: Var 𝑌 𝑥 = σ2
Rachana T. Bhatia - Rutgers University
7. Looking at the scatter plot
Q-Q plot – Quantiles of the residuals vs normal distribution
Residual plot – Residuals Vs Explanatory variable
Rachana T. Bhatia - Rutgers University
8. Correctable non-linearity (simple and monotone )
Non-Correctable linearity
Rachana T. Bhatia - Rutgers University
9. Define a new variable u as 𝑢 = 𝑒 𝑥
Rachana T. Bhatia - Rutgers University
10. Some common transformations are:
v = ln(y)
v = p √y where p > 1 v = 1/y p where p > 0
Rachana T. Bhatia - Rutgers University
11. How well a statistical model fits observed data
How much of the total variation in Y is described by the variation in the
explanatory variables
square of the sample correlation of the response variable and the explanatory
variable
Lies between -∞ to 1
Adjusted R-squared- adjusted for the number of coefficients in the model relative
to the sample size in order to correct it for bias
Rachana T. Bhatia - Rutgers University
12. Mean Square Error
Coefficient of Determination - R2
Adjusted R2
AIC (Akaike’s Information Criterion) - smaller values are better
BIC (Bayesian Information Criterion) - smaller values are better
Rachana T. Bhatia - Rutgers University
13. LEVERAGE – ‘standardized’ measure the distance of the ith observation abscissa from
the mean of the explanatory variables
DFBETAS - standardized measures how much estimation of βj is influenced by the ith
observation.
DFFITS - standardized measures how much estimation of ith fitted value is influenced
by the ith observation
COOK’S Distance -standardized measure of the distance between the fitted values
obtained using the whole sample and the fitted values obtained after removing the jth
observation
Rachana T. Bhatia - Rutgers University
14. Simple linear Model
Polynomial regression – relationship is not linear
Multiple linear Model – more than one explanatory variables- Categorical Data
Robust regression (Least Absolute Deviations, Huber/ Bisquare function) - Data
contaminated with outliers
Logistic regression – Response variable Binary (Logit and Probit link function)
Ridge Regression – High multicollinearity
Step wise regression – High dimensions (Forward selection/ backward elimination)
Rachana T. Bhatia - Rutgers University
15. A power tool for statistics and data modeling
R is free
R is a language
Graphics and data visualization
A flexible statistical analysis toolkit
R Studio - an Integrated Development Environment (IDE) for the R programming
language.
Rachana T. Bhatia - Rutgers University
16. Setting the working directory
Installing packages, updating and loading the packages
Importing and Converting Data
Creating vectors, data frames
Connection to the outside world(file, gzfile,bzfile, url)
Atomic classes of vectors : integer • numeric • character • complex • logical
Rachana T. Bhatia - Rutgers University
17. Data Frames (tabular data)-stores different class of
objects{read.table/read.csv/data.frame)
Analogous code for writing the data
Foreign package (read.xport, read.spss )
Reading larger data sets (Specifying the column classes)
Inspect objects/dataframes
Missing Values (Na / NaN)
Rachana T. Bhatia - Rutgers University
19. Variation in gasoline mileage among makes and models of automobiles is
influenced substantially by the size of the vehicle and its engine.
Downloaded from http://lib.stat.cmu.edu/DASL/Datafiles/carmpgdat.html
Variable Names:
VOL: Cubic feet of cab space
HP: Engine horsepower
MPG: Average miles per gallon (Response Variable)
SP: Top speed (mph)
WT: Vehicle weight (100 lb)
Rachana T. Bhatia - Rutgers University
20. Prof. Andrew Magyar - Stat 563 - Introduction to Linear Regression_Course
Material
Linear Regression Analysis 5th edition Montgomery, Peck & Vining
http://www.ats.ucla.edu/stat/stata/dae/rreg.htm
https://www.coursera.org/learn/r-programming/home/welcome
Rachana T. Bhatia - Rutgers University