SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Digg Data

Linear Regression
without Tears
ANKIT SHARMA, DIGG DATA

www.diggdata.in
Content
 What is Regression Analysis
 When to use regression

 Intuition behind linear regression - Machine learning
 Simple Linear Regression
 Multivariate Linear Regression
 Performance Analysis
 ANOVA
 Goodness of fit
 Confidence & Prediction bands

 Assumptions

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

2
What is Regression Analysis?
In statistics, regression analysis is a statistical process
for estimating the relationships among variables.

More specifically, regression analysis helps one
understand how the typical value of the dependent
variable changes when any one of the independent
variables is varied, while the other independent
variables are held fixed.
Regression analysis is widely used for prediction and
forecasting.
Regression analysis is also used to understand which
among the independent variables are related to the
dependent variable, and to explore the forms of
these relationships.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

3
When to use regression?
Regression analysis is used to describe the relationship between:
 A single response variable: Y ; and
 One or more predictor variables: X1, X2,…,Xp
• p = 1: Simple Regression
• p > 1: Multivariate Regression

Response Variable ‘Y’ must be a continuous variable.
Predictor Variables X1,…,Xp can be continuous, discrete or categorical variables.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

4
The Meaning of the term “Linear”
Linearity in the Variables
The first meaning of linearity is that the conditional
expectation of Y, E(Y|Xi), is a linear function of Xi, the
regression curve in this case is a straight line. But
E(Y|Xi) = β1 + β2X2i is not a linear function
Linearity in the Parameters
The second interpretation of linearity is that the
conditional expectation of Y, E(Y|Xi), is a linear function
of the parameters, the β’s; it may or may not be linear in
the variable X.
E(Y|Xi) = β1 + β2X2i is a linear (in parameter)

regression model. All the models shown in Figure are thus linear
regression models, that is, models linear in the parameters.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

5
The Meaning of the term “Linear” Cond...
Now consider the model:
E(Y|Xi) = β1 + β22 Xi
The preceding model is an example of a nonlinear (in the parameter) regression model.
From now on the term “linear” regression will always mean a regression that is linear in the
parameters; the β’s (that is, the parameters are raised to the first power only).

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

6
Intuition
LINEAR REGRESSION

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

7
Hypothesis(for one variable)
500
Price
(in 1000s of
dollars)

Training Set

400
300
200

Learning Algorithm

100
0
0

Size of
house

Friday, November 22, 2013

h

Estimated
price

500

1000

1500

2000

2500

3000

Size (feet2)

WITHOUT TEARS SERIES, DIGG DATA

8
Cost function
Hypothesis:
How to choose

‘s ?

Cost Function:
Goal:

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

9
(for fixed

, this is a function of x)

(function of the parameter

3

3

2

2

1

1

0

)

0

y

0

1

x

2

-0.5

3

1
𝐽 1 =
1−1
2𝑚
Friday, November 22, 2013

2

+ 2−2

2

+ 3−3

WITHOUT TEARS SERIES, DIGG DATA

0

2

0.5

1

1.5

2

2.5

=0
10
(for fixed

, this is a function of x)

(function of the parameter

3

3

2

2

1

1

0

)

0

y

0

1

x

2

-0.5

1
𝐽 0.5 =
0.5 − 1
2𝑚
Friday, November 22, 2013

0

+ 1.5 − 3

2

3

2

+ 1−2

2

WITHOUT TEARS SERIES, DIGG DATA

0.5

1

1.5

2

2.5

= 0.68
11
(for fixed

, this is a function of x)

(function of the parameter

3

3

2

2

1

1

0

)

0

y

0

1

x

2

-0.5

3

0

0.5

1

1.5

2

2.5

min 𝐽 𝜃1
𝜃1

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

12
Contour plot

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

13
Linear Regression in R
SINGLE PREDICTOR

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

14
Data cleaning & preprocessing
Prior to any
analysis, the data
should always be
inspected for

Data-entry
errors

Missing
values

Outliers

Numerical
summaries

5-number
summaries

Correlations

…

Graphical
summaries

Boxplots

Histograms

Scatterplots

Friday, November 22, 2013

Unusual
distributions

Changes in
variability

Clustering

Non-linear
bivariate
relationships

Unexpected
patterns

…

WITHOUT TEARS SERIES, DIGG DATA

15
Simple Linear Regression
Objective
Describe the relationship between two variables, say X and Y as a straight line, that is, Y is
modeled as a linear function of X.
X

 X: explanatory variable (horizontal axis)
 Y : response variable (vertical axis)
After data collection, we have pairs of observations: (X1,Y1),…,(Xn,Yn)

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

X1

Y1

X2

Y2

…

The variables

Y

…

Xn

Yn

16
Simple LR model
The regression of variable Y on variable X is given by:
yi = β0 + β1xi + ϵi

Residuals

i = 1,...,n

The difference between the
observed value yi and the
fitted value ^yi is called
residual and is given by:

where:
Random Error: ϵi ̴N(0, σ2), independent
Linear Function: β0 + β1xi = E(Y|X = xi )
Unknown parameters

ei = yi - ^yi

- β0 (Intercept): point in which the line intercepts the y-axis;
- β1 (Slope): increase in Y per unit change in X.

Least Squares Method

Estimation of unknown parameters
We want to find the equation of the line that “best" fits the
data. It means finding β0 and β1 such that the fitted values
of yi , given by
^yi = β0 + β1 xi ;

A usual way of calculating β0 and β1 is based on the
minimization of the sum of the squared residuals, or
residual sum of squares (RSS):

are as “close" as possible to the observed values yi .

𝑒2 𝑖 =

𝑅𝑆𝑆 =
𝑖

(𝑦𝑖 − 𝑦 𝑖)2
𝑖

(𝑦𝑖 − β0 − β1xi)2

𝑅𝑆𝑆 =
𝑖

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

17
Simple LR in R
> # Download the data from a url
> production <read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/productio
n.txt", header=T, sep="")
> # analyze the data
> head(production)
Case RunTime RunSize
1 1 195 175
2 2 215 189
3 3 243 344
4 4 162 88
5 5 185 114
6 6 231 338
> table(is.na(production))
FALSE
60
> str(production)
'data.frame':
20 obs. of 3 variables:
$ Case : int 1 2 3 4 5 6 7 8 9 10 ...
$ RunTime: int 195 215 243 162 185 231 234 166 253 196 ...
$ RunSize: int 175 189 344 88 114 338 271 173 284 277 ...
> attach(production)
The following object is masked from production (position 3):
Case, RunSize, RunTime
> # Lets plot the data
> plot(RunTime~RunSize)
> # Fit the regression model using the lm()
> production.lm <- lm(RunTime~RunSize, data=production)
> # Use the function summary() to get some results
Friday, November 22, 2013

> summary(production.lm)
Call:
lm(formula = RunTime ~ RunSize, data = production)
Residuals:
Min 1Q Median 3Q Max
-28.597 -11.079 3.329 8.302 29.627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.74770 8.32815 17.98 6.00e-13 ***
RunSize
0.25924 0.03714 6.98 1.61e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.25 on 18 degrees of freedom
Multiple R-squared: 0.7302,
Adjusted R-squared: 0.7152
F-statistic: 48.72 on 1 and 18 DF, p-value: 1.615e-06
> # plot a line fitting the model
> abline(production.lm)
> production <data.frame(production,fitted.value=fitted(production.lm),residual=resid(productio
n.lm))
> head(production)
Case RunTime RunSize fitted.value residual
1 1 195 175 195.1152 -0.1152469
2 2 215 189 198.7447 16.2553496
3 3 243 344 238.9273 4.0726679
4 4 162 88 172.5611 -10.5610965
5 5 185 114 179.3014 5.6985827
6 6 231 338 237.3719 -6.3718734

WITHOUT TEARS SERIES, DIGG DATA

18
Multivariate Linear Regression

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

19
Multivariate Linear Regression
Objective
Generalize the simple regression methodology in order to describe the relationship between a
response variable Y and a set of predictors X1,X2,…, Xp in terms of a linear function.
The variables
 Y : response variable (vertical axis)
After data collection, we have pairs of observations:
(X11,…,X1p,Y1),…,(Xn1,…,Xnp,Yn)

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

…

Xp

Y

X1

…

X1p

Y1

X2

 X: explanatory variable (horizontal axis)

X1

…

X2p

Y2

…

…

…

…

Xn

…

Xnp

Yn

20
Polynomial regression
Price
(y)

Size (x)

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

21
Multivariate LR model
The model is given by:
yi = β0 + β1xi +…+ βpxp + ϵi

i = 1,...,n

Residuals
The difference between the observed value yi and the
fitted value ^yi is called residual and is given by:

where:
Random Error: ϵi ̴N(0, σ2), independent

ei = yi - ^yi

Linear Function: β0 + β1xi + βpxp = E(y|x1 ,…, xp)
Unknown parameters
- β0 : overall mean
- βk : regression coefficient

Least Squares Method

Estimation of unknown parameters
We want to find the equation of the line that “best" fits the
data. It means finding β0 and βk such that the fitted values
of yi , given by
^yi = β0 + β1 xi ;

A usual way of calculating β0, β1, …, βp is based on the
minimization of the sum of the squared residuals, or
residual sum of squares (RSS): :

are as “close" as possible to the observed values yi .

𝑒2 𝑖 =

𝑅𝑆𝑆 =
𝑖

(𝑦𝑖 − 𝑦 𝑖)2
𝑖

(𝑦𝑖 − β0 − β1xi − ⋯ )2

𝑅𝑆𝑆 =
𝑖

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

22
Performance measurement

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

23
Analysis of Variance (ANOVA)
Total sample
variability
TSS

Unexplained
(or error)
variability
RSS

Variability
explained by
the model
SSreg

> anova(production.lm)
Analysis of Variance Table
Response: RunTime
Df Sum Sq Mean Sq F value Pr(>F)
RunSize
1 12868.4 12868.4
48.717 1.615e-06 ***
Residuals 18 4754.6 264.1
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ANOVA Table gives us the following
information:
• Degrees Of Freedom
• The Sum Of The Squares
• The Mean Square
• The F ratio
• The p-value

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

24
ANOVA Cond…
Select a model:
y = β0+ β1x1+ β2x2+ β3x3+ … + ε
Use sample data to estimate unknown parameters
Evaluate how useful the model is
If we want to test the usefulness of a particular term in our model, we would
perform a t-test and look at the p-value for that term. However, if we wanted to test
whether any of the terms in our model are useful in predicting y we would use the
F-test.
The F-test is a test of the hypothesis:
H0: β1= β2= … = βk= 0
H1: At least one of the coefficients is non-zero
Note1 our H0 will always include all of our parameters except our y-intercept β0.
Note2 this test has a general set-up of:
H0: None of the explanatory variables are helping
H1: At least one of the explanatory variables are helping
which shares the general format seen throughout the last couple of chapters of:
H0: Model not useful
H1: Model useful
Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

Once we know the test
statistic of our F-test, we will
often want to determine
whether it is significant. As in
all our tests, if our test statistic
is more extreme (i.e. greater)
than our critical value, we
reject H0. By rejecting H0 we
are saying that our model is
significantly better than just
estimating y with avg(y).
25
The Coefficient Of Correlation
 The Correlation Coefficient (denoted r) is a measure of the
strength of the linear relationship between x and y. It will
always be between -1 and 1.
 If r is near -1 or 1, then there is a strong linear
relationship.
 If r is near 0, then there is little or no linear relationship.
 A positive correlation occurs when an increase in one
variable typically leads to an increase in the other
variable.

 A negative correlation occurs when an increase in one
variable typically leads to a decrease in the other variable.

𝑟=

Friday, November 22, 2013

𝑆𝑆 𝑋𝑌

𝑆𝑆 𝑋𝑋 − 𝑆𝑆 𝑌𝑌

WITHOUT TEARS SERIES, DIGG DATA

26
Measuring Goodness of Fit
Coefficient of Determination, r2
 Represents the proportion of the total sample variability
explained by the regression model.

Adjusted r2adj

 For simple linear regression, the r2 statistic corresponds to
the square of the correlation between Y and X.

 The adjusted r2 takes into account the number of degrees
of freedom and is preferable to r2.

 Indicates of how well the model fits the data.

𝑟2

𝑆𝑆 𝑦𝑦 − 𝑆𝑆𝐸
𝑆𝑆𝐸
=
=1 −
𝑆𝑆 𝑦𝑦
𝑆𝑆 𝑦𝑦

Important Note: Neither r2 nor r2adj give direct
indication on how well the model will perform in the
prediction of a new observation.

About 100(r2)% of the sample variation in y can be explained by (or attributed to) using x to predict
y in the straight line model. Ideally this value will be close to 1.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

27
Confidence & Prediction band
Confidence Bands
Reflect the uncertainty about the regression line (how well the line is determined).
Prediction Bands
Include also the uncertainty about future observations.
Attention

250
200
50

100

150

RunTime

300

350

These limits rely strongly on the assumption of normally distributed errors with constant
variance and should not be used if this assumption is violated for the data being analyzed.

50

100

150

200

250

300

350

> predict(production.lm, interval="confidence")
fit
lwr
upr
1 195.1152 187.2000 203.0305
2 198.7447 191.0450 206.4443
…
20 167.3762 154.4448 180.3077
> predict(production.lm, interval="prediction")
fit
lwr
upr
1 195.1152 160.0646 230.1659
2 198.7447 163.7421 233.7472
…
20 167.3762 130.8644 203.8881
# Create a new data frame containing the values of X
# at which we want the predictions to be made
pred.frame <- data.frame(RunSize=seq(55,345,by=10))
# Confidence bands
pc <- predict(production.lm, int="c", newdata=pred.frame)
# Prediction bands
pp <- predict(production.lm, int="p", newdata=pred.frame)
require ( graphics )
# Standard scatterplot with extended limits
plot(RunSize, RunTime, ylim=range(RunSize,pp,na.rm=T))
pred.Size <- pred.frame$RunSize
# Add curves
matlines(pred.Size, pc, lty=c(1,2,2), lwd=1.5, col=1)
matlines(pred.Size, pp, lty=c(1,3,3), lwd=1.5, col=1)

RunSize

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

28
Validity of regression model

For all data sets, the fitted regression is the same:
^y = 3.0 + 0.5x
All models have r2= 0.67, ^σ = 1.24 and the slope coefficients are significant at < 1% level.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

29
Residual plots
 A residual plot is a graph that shows the residuals on the vertical axis and the independent
variable on the horizontal axis.

 If the points in a residual plot are randomly dispersed around the horizontal axis, a linear
regression model is appropriate for the data;
otherwise, a non-linear model is more appropriate.

The first plot shows a random pattern, indicating a good fit for a linear model. The other plot
patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model.
Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

30
Residual plots
Residuals vs. X

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

31
Residual plots
Residuals vs. fitted values

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

32
Influential point
Outliers
Data points that diverge in a big way from the overall
pattern are called outliers. There are four ways that a
data point might be considered an outlier.
 It could have an extreme X value compared to other
data points.
 It could have an extreme Y value compared to other
data points.
 It could have extreme X and Y values.
 It might be distant from the rest of the data, even
without extreme X or Y values.

Influential Points
An influential point is an outlier that greatly affects the
slope of the regression line.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

33
Influential point Cond…
How to deal with them:

Leverage/Influential Points
 Good leverage points have their standardized
residuals within the interval [ 2; 2]
 Outliers are leverage points whose standardized
residuals fall outside the interval [ 2; 2]

 Remove invalid data points
o if they look unusual or are different
from the rest of the data

 Fit a different regression model
o if the model is not valid for the data
 higher-order terms
 transformation

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

34
Normality & constant variance of errors
Normality and Constant Variance Assumptions, these assumptions are necessary for inference:
• hypothesis testing
• confidence intervals
• prediction intervals
 Check the Normal Q-Q plot of the standardized residuals.

 Check the Standardized Residuals vs. X plot.

When these assumptions do not hold, we can try to correct the problem using data transformations.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

35
Normality & constant variance check
> production.lm <- lm(RunTime~RunSize, data=production)
# Residual plots
> plot(production.lm)

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

36
Cook’s distance
Cook's Distance: D
the Cook's distance statistic combines the effects of leverage and the magnitude of the residual.
it is used to evaluate the impact of a given observation on the estimated regression coefficients.
D > 1: undue influence
The Cook's distance plot is obtained by applying the function plot() to the linear model object.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

37
Transformation
When to use transformation?
Transformations can be used to
correct for:
 non-constant variance

There are many ways to transform variables to achieve linearity for
regression analysis. Some common methods are summarized below.

 non-linearity
 non-normality

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

38
Assumptions for Simple LR
There are four principal assumptions which justify the use of linear
regression models for purposes of prediction:
I. linearity of the relationship between dependent & independent
variables
Y = β 0 + β 1X + ϵ
II. independence of the errors (no serial correlation)
III. homoscedasticity (constant variance) of the errors
a) versus time
b) versus the predictions (or versus any independent variable)

IV.normality of the error distribution.
If any of these assumptions is violated (i.e., if there is nonlinearity,
serial correlation, heteroscedasticity, and/or non-normality), then
the forecasts, confidence intervals, and economic insights yielded by
a regression model may be inefficient or seriously biased or
misleading.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

What can go wrong?
Violations:
In the linear regression model:
• linearity (e.g. quadratic relationship or higher
order terms)
• In the residual assumptions:
• non-normal distribution
• non-constant variances
• dependence
• outliers
Checks:
 Residuals vs. each predictor variable
o nonlinearity: higher-order terms in that
variable
 Residuals vs. fitted values
o variance increasing with the response:
transformation
 Residuals Q-Q norm plot
o deviation from a straight line: nonnormality
39
Violations of linearity
These are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are
likely to be seriously in error, especially when you extrapolate beyond the range of the sample data.
How to detect

Plot

• observed vs. predicted values, or
• residuals vs predicted values

Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is
making unusually large or small predictions.
How to fix
 Consider applying a nonlinear transformation to the dependent and/or independent variables. For example, if
the data are strictly positive, a log transformation may be feasible.
 Another possibility to consider is adding another regressor which is a nonlinear function of one of the other
variables. For example, if you have regressed Y on X, and the graph of residuals versus predicted suggests a
parabolic curve, then it may make sense to regress Y on both X and X^2 (i.e., X-squared). The latter
transformation is possible even when X and/or Y have negative values, whereas logging may not be.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

40
Violations of homoscedasticity
Violations of homoscedasticity makes it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence
intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-ofsample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small
subset of the data (namely the subset where the error variance was largest) when estimating coefficients.
How to detect
Plots of

• residuals vs. time, and
• residuals vs. predicted value

Check for residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be
really thorough, you might also want to plot residuals versus some of the independent variables.)
How to fix
 In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth, perhaps magnified by
a multiplicative seasonal pattern. Some combination of logging and/or deflating will often stabilize the variance in this case.
 A simple fix would be to work with shorter intervals of data in which volatility is more nearly constant.
 Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions, in which case it
may also be fixed as a byproduct of fixing those problems.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

41
Violations of normality
 It compromise the estimation of coefficients and the calculation of confidence intervals. Sometimes the error distribution is "skewed" by the presence of a few large outliers.
 Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates.
 Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors.
 If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.

How to detect
The best test for normally distributed errors is a normal probability plot of the residuals.
o This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should
fall close to the diagonal line.
o A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in the
same direction).
o An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either two many or two few large errors in both directions.

How to fix
Violations of normality often arise either because
(a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or
(b) the linearity assumption is violated.
In such cases, a nonlinear transformation of variables might cure both problems. In some cases, the problem with the residual distribution is mainly due to one or two very large errors.
Such values should be scrutinized closely: are they genuine (i.e., not the result of data entry errors), are they explainable, are similar events likely to occur again in the future, and how
influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.) If they are merely errors or if they can be
explained as unique events not likely to be repeated, then you may have cause to remove them. In some cases, however, it may be that the extreme values in the data provide the most
useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors.

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

42
Thank you!

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

43

Weitere ähnliche Inhalte

Was ist angesagt?

Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepDan Wellisch
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regressionalok tiwari
 
multiple regression
multiple regressionmultiple regression
multiple regressionPriya Sharma
 
Logistic regression with SPSS
Logistic regression with SPSSLogistic regression with SPSS
Logistic regression with SPSSLNIPE
 
Non Linear Equation
Non Linear EquationNon Linear Equation
Non Linear EquationMdAlAmin187
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regressiondessybudiyanti
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relationnuwan udugampala
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear RegressionAndrew Ferlitsch
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear RegressionAndrew Ferlitsch
 
Regression analysis
Regression analysisRegression analysis
Regression analysisRavi shankar
 
Regression analysis
Regression analysisRegression analysis
Regression analysissaba khan
 
Presentation on regression analysis
Presentation on regression analysisPresentation on regression analysis
Presentation on regression analysisSujeet Singh
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierAl Arizmendez
 
Basics of Regression analysis
 Basics of Regression analysis Basics of Regression analysis
Basics of Regression analysisMahak Vijayvargiya
 

Was ist angesagt? (20)

Linear regression
Linear regressionLinear regression
Linear regression
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
 
multiple regression
multiple regressionmultiple regression
multiple regression
 
Logistic regression with SPSS
Logistic regression with SPSSLogistic regression with SPSS
Logistic regression with SPSS
 
Regression
RegressionRegression
Regression
 
Non Linear Equation
Non Linear EquationNon Linear Equation
Non Linear Equation
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
 
ML - Simple Linear Regression
ML - Simple Linear RegressionML - Simple Linear Regression
ML - Simple Linear Regression
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Presentation on regression analysis
Presentation on regression analysisPresentation on regression analysis
Presentation on regression analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Chapter 14
Chapter 14 Chapter 14
Chapter 14
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
 
Probit and logit model
Probit and logit modelProbit and logit model
Probit and logit model
 
Basics of Regression analysis
 Basics of Regression analysis Basics of Regression analysis
Basics of Regression analysis
 

Andere mochten auch

Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)hitesh saini
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear RegressionIndus University
 
mathematical models for drug release studies
mathematical models for drug release studiesmathematical models for drug release studies
mathematical models for drug release studiesSR drug laboratories
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysisnadiazaheer
 
Compaction and compression PPT MANIK
Compaction and compression PPT MANIKCompaction and compression PPT MANIK
Compaction and compression PPT MANIKImran Nur Manik
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentationCarlo Magno
 
Analytical method validation
Analytical method validationAnalytical method validation
Analytical method validationGaurav Kr
 

Andere mochten auch (12)

Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
 
mathematical models for drug release studies
mathematical models for drug release studiesmathematical models for drug release studies
mathematical models for drug release studies
 
Chi squared test
Chi squared testChi squared test
Chi squared test
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Compaction and compression PPT MANIK
Compaction and compression PPT MANIKCompaction and compression PPT MANIK
Compaction and compression PPT MANIK
 
Regression
RegressionRegression
Regression
 
Chi square test
Chi square testChi square test
Chi square test
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
 
Analytical method validation
Analytical method validationAnalytical method validation
Analytical method validation
 
Tests of significance
Tests of significanceTests of significance
Tests of significance
 
Chi square test
Chi square testChi square test
Chi square test
 

Ähnlich wie Linear Regression Guide

Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Japheth Muthama
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JMJapheth Muthama
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92ohenebabismark508
 
1. Regression_V1.pdf
1. Regression_V1.pdf1. Regression_V1.pdf
1. Regression_V1.pdfssuser4c50a9
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
Machine Learning lecture2(linear regression)
Machine Learning lecture2(linear regression)Machine Learning lecture2(linear regression)
Machine Learning lecture2(linear regression)cairo university
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...IJERA Editor
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...IJERA Editor
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.pptarkian3
 
SimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptSimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptAdnanAli861711
 
Linear regression.ppt
Linear regression.pptLinear regression.ppt
Linear regression.pptbranlymbunga1
 
Slideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.pptSlideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.pptrahulrkmgb09
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regressionAlexander Decker
 
Formulation of model likelihood functions
Formulation of model likelihood functionsFormulation of model likelihood functions
Formulation of model likelihood functionsAndreas Scheidegger
 

Ähnlich wie Linear Regression Guide (20)

Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JM
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
 
1. Regression_V1.pdf
1. Regression_V1.pdf1. Regression_V1.pdf
1. Regression_V1.pdf
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Machine Learning lecture2(linear regression)
Machine Learning lecture2(linear regression)Machine Learning lecture2(linear regression)
Machine Learning lecture2(linear regression)
 
Chapter5.pdf.pdf
Chapter5.pdf.pdfChapter5.pdf.pdf
Chapter5.pdf.pdf
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
SimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptSimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.ppt
 
Linear regression.ppt
Linear regression.pptLinear regression.ppt
Linear regression.ppt
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
Slideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.pptSlideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.ppt
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regression
 
Formulation of model likelihood functions
Formulation of model likelihood functionsFormulation of model likelihood functions
Formulation of model likelihood functions
 

Kürzlich hochgeladen

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Kürzlich hochgeladen (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Linear Regression Guide

  • 1. Digg Data Linear Regression without Tears ANKIT SHARMA, DIGG DATA www.diggdata.in
  • 2. Content  What is Regression Analysis  When to use regression  Intuition behind linear regression - Machine learning  Simple Linear Regression  Multivariate Linear Regression  Performance Analysis  ANOVA  Goodness of fit  Confidence & Prediction bands  Assumptions Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 2
  • 3. What is Regression Analysis? In statistics, regression analysis is a statistical process for estimating the relationships among variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Regression analysis is widely used for prediction and forecasting. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 3
  • 4. When to use regression? Regression analysis is used to describe the relationship between:  A single response variable: Y ; and  One or more predictor variables: X1, X2,…,Xp • p = 1: Simple Regression • p > 1: Multivariate Regression Response Variable ‘Y’ must be a continuous variable. Predictor Variables X1,…,Xp can be continuous, discrete or categorical variables. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 4
  • 5. The Meaning of the term “Linear” Linearity in the Variables The first meaning of linearity is that the conditional expectation of Y, E(Y|Xi), is a linear function of Xi, the regression curve in this case is a straight line. But E(Y|Xi) = β1 + β2X2i is not a linear function Linearity in the Parameters The second interpretation of linearity is that the conditional expectation of Y, E(Y|Xi), is a linear function of the parameters, the β’s; it may or may not be linear in the variable X. E(Y|Xi) = β1 + β2X2i is a linear (in parameter) regression model. All the models shown in Figure are thus linear regression models, that is, models linear in the parameters. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 5
  • 6. The Meaning of the term “Linear” Cond... Now consider the model: E(Y|Xi) = β1 + β22 Xi The preceding model is an example of a nonlinear (in the parameter) regression model. From now on the term “linear” regression will always mean a regression that is linear in the parameters; the β’s (that is, the parameters are raised to the first power only). Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 6
  • 7. Intuition LINEAR REGRESSION Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 7
  • 8. Hypothesis(for one variable) 500 Price (in 1000s of dollars) Training Set 400 300 200 Learning Algorithm 100 0 0 Size of house Friday, November 22, 2013 h Estimated price 500 1000 1500 2000 2500 3000 Size (feet2) WITHOUT TEARS SERIES, DIGG DATA 8
  • 9. Cost function Hypothesis: How to choose ‘s ? Cost Function: Goal: Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 9
  • 10. (for fixed , this is a function of x) (function of the parameter 3 3 2 2 1 1 0 ) 0 y 0 1 x 2 -0.5 3 1 𝐽 1 = 1−1 2𝑚 Friday, November 22, 2013 2 + 2−2 2 + 3−3 WITHOUT TEARS SERIES, DIGG DATA 0 2 0.5 1 1.5 2 2.5 =0 10
  • 11. (for fixed , this is a function of x) (function of the parameter 3 3 2 2 1 1 0 ) 0 y 0 1 x 2 -0.5 1 𝐽 0.5 = 0.5 − 1 2𝑚 Friday, November 22, 2013 0 + 1.5 − 3 2 3 2 + 1−2 2 WITHOUT TEARS SERIES, DIGG DATA 0.5 1 1.5 2 2.5 = 0.68 11
  • 12. (for fixed , this is a function of x) (function of the parameter 3 3 2 2 1 1 0 ) 0 y 0 1 x 2 -0.5 3 0 0.5 1 1.5 2 2.5 min 𝐽 𝜃1 𝜃1 Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 12
  • 13. Contour plot Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 13
  • 14. Linear Regression in R SINGLE PREDICTOR Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 14
  • 15. Data cleaning & preprocessing Prior to any analysis, the data should always be inspected for Data-entry errors Missing values Outliers Numerical summaries 5-number summaries Correlations … Graphical summaries Boxplots Histograms Scatterplots Friday, November 22, 2013 Unusual distributions Changes in variability Clustering Non-linear bivariate relationships Unexpected patterns … WITHOUT TEARS SERIES, DIGG DATA 15
  • 16. Simple Linear Regression Objective Describe the relationship between two variables, say X and Y as a straight line, that is, Y is modeled as a linear function of X. X  X: explanatory variable (horizontal axis)  Y : response variable (vertical axis) After data collection, we have pairs of observations: (X1,Y1),…,(Xn,Yn) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA X1 Y1 X2 Y2 … The variables Y … Xn Yn 16
  • 17. Simple LR model The regression of variable Y on variable X is given by: yi = β0 + β1xi + ϵi Residuals i = 1,...,n The difference between the observed value yi and the fitted value ^yi is called residual and is given by: where: Random Error: ϵi ̴N(0, σ2), independent Linear Function: β0 + β1xi = E(Y|X = xi ) Unknown parameters ei = yi - ^yi - β0 (Intercept): point in which the line intercepts the y-axis; - β1 (Slope): increase in Y per unit change in X. Least Squares Method Estimation of unknown parameters We want to find the equation of the line that “best" fits the data. It means finding β0 and β1 such that the fitted values of yi , given by ^yi = β0 + β1 xi ; A usual way of calculating β0 and β1 is based on the minimization of the sum of the squared residuals, or residual sum of squares (RSS): are as “close" as possible to the observed values yi . 𝑒2 𝑖 = 𝑅𝑆𝑆 = 𝑖 (𝑦𝑖 − 𝑦 𝑖)2 𝑖 (𝑦𝑖 − β0 − β1xi)2 𝑅𝑆𝑆 = 𝑖 Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 17
  • 18. Simple LR in R > # Download the data from a url > production <read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/productio n.txt", header=T, sep="") > # analyze the data > head(production) Case RunTime RunSize 1 1 195 175 2 2 215 189 3 3 243 344 4 4 162 88 5 5 185 114 6 6 231 338 > table(is.na(production)) FALSE 60 > str(production) 'data.frame': 20 obs. of 3 variables: $ Case : int 1 2 3 4 5 6 7 8 9 10 ... $ RunTime: int 195 215 243 162 185 231 234 166 253 196 ... $ RunSize: int 175 189 344 88 114 338 271 173 284 277 ... > attach(production) The following object is masked from production (position 3): Case, RunSize, RunTime > # Lets plot the data > plot(RunTime~RunSize) > # Fit the regression model using the lm() > production.lm <- lm(RunTime~RunSize, data=production) > # Use the function summary() to get some results Friday, November 22, 2013 > summary(production.lm) Call: lm(formula = RunTime ~ RunSize, data = production) Residuals: Min 1Q Median 3Q Max -28.597 -11.079 3.329 8.302 29.627 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 149.74770 8.32815 17.98 6.00e-13 *** RunSize 0.25924 0.03714 6.98 1.61e-06 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.25 on 18 degrees of freedom Multiple R-squared: 0.7302, Adjusted R-squared: 0.7152 F-statistic: 48.72 on 1 and 18 DF, p-value: 1.615e-06 > # plot a line fitting the model > abline(production.lm) > production <data.frame(production,fitted.value=fitted(production.lm),residual=resid(productio n.lm)) > head(production) Case RunTime RunSize fitted.value residual 1 1 195 175 195.1152 -0.1152469 2 2 215 189 198.7447 16.2553496 3 3 243 344 238.9273 4.0726679 4 4 162 88 172.5611 -10.5610965 5 5 185 114 179.3014 5.6985827 6 6 231 338 237.3719 -6.3718734 WITHOUT TEARS SERIES, DIGG DATA 18
  • 19. Multivariate Linear Regression Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 19
  • 20. Multivariate Linear Regression Objective Generalize the simple regression methodology in order to describe the relationship between a response variable Y and a set of predictors X1,X2,…, Xp in terms of a linear function. The variables  Y : response variable (vertical axis) After data collection, we have pairs of observations: (X11,…,X1p,Y1),…,(Xn1,…,Xnp,Yn) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA … Xp Y X1 … X1p Y1 X2  X: explanatory variable (horizontal axis) X1 … X2p Y2 … … … … Xn … Xnp Yn 20
  • 21. Polynomial regression Price (y) Size (x) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 21
  • 22. Multivariate LR model The model is given by: yi = β0 + β1xi +…+ βpxp + ϵi i = 1,...,n Residuals The difference between the observed value yi and the fitted value ^yi is called residual and is given by: where: Random Error: ϵi ̴N(0, σ2), independent ei = yi - ^yi Linear Function: β0 + β1xi + βpxp = E(y|x1 ,…, xp) Unknown parameters - β0 : overall mean - βk : regression coefficient Least Squares Method Estimation of unknown parameters We want to find the equation of the line that “best" fits the data. It means finding β0 and βk such that the fitted values of yi , given by ^yi = β0 + β1 xi ; A usual way of calculating β0, β1, …, βp is based on the minimization of the sum of the squared residuals, or residual sum of squares (RSS): : are as “close" as possible to the observed values yi . 𝑒2 𝑖 = 𝑅𝑆𝑆 = 𝑖 (𝑦𝑖 − 𝑦 𝑖)2 𝑖 (𝑦𝑖 − β0 − β1xi − ⋯ )2 𝑅𝑆𝑆 = 𝑖 Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 22
  • 23. Performance measurement Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 23
  • 24. Analysis of Variance (ANOVA) Total sample variability TSS Unexplained (or error) variability RSS Variability explained by the model SSreg > anova(production.lm) Analysis of Variance Table Response: RunTime Df Sum Sq Mean Sq F value Pr(>F) RunSize 1 12868.4 12868.4 48.717 1.615e-06 *** Residuals 18 4754.6 264.1 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The ANOVA Table gives us the following information: • Degrees Of Freedom • The Sum Of The Squares • The Mean Square • The F ratio • The p-value Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 24
  • 25. ANOVA Cond… Select a model: y = β0+ β1x1+ β2x2+ β3x3+ … + ε Use sample data to estimate unknown parameters Evaluate how useful the model is If we want to test the usefulness of a particular term in our model, we would perform a t-test and look at the p-value for that term. However, if we wanted to test whether any of the terms in our model are useful in predicting y we would use the F-test. The F-test is a test of the hypothesis: H0: β1= β2= … = βk= 0 H1: At least one of the coefficients is non-zero Note1 our H0 will always include all of our parameters except our y-intercept β0. Note2 this test has a general set-up of: H0: None of the explanatory variables are helping H1: At least one of the explanatory variables are helping which shares the general format seen throughout the last couple of chapters of: H0: Model not useful H1: Model useful Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA Once we know the test statistic of our F-test, we will often want to determine whether it is significant. As in all our tests, if our test statistic is more extreme (i.e. greater) than our critical value, we reject H0. By rejecting H0 we are saying that our model is significantly better than just estimating y with avg(y). 25
  • 26. The Coefficient Of Correlation  The Correlation Coefficient (denoted r) is a measure of the strength of the linear relationship between x and y. It will always be between -1 and 1.  If r is near -1 or 1, then there is a strong linear relationship.  If r is near 0, then there is little or no linear relationship.  A positive correlation occurs when an increase in one variable typically leads to an increase in the other variable.  A negative correlation occurs when an increase in one variable typically leads to a decrease in the other variable. 𝑟= Friday, November 22, 2013 𝑆𝑆 𝑋𝑌 𝑆𝑆 𝑋𝑋 − 𝑆𝑆 𝑌𝑌 WITHOUT TEARS SERIES, DIGG DATA 26
  • 27. Measuring Goodness of Fit Coefficient of Determination, r2  Represents the proportion of the total sample variability explained by the regression model. Adjusted r2adj  For simple linear regression, the r2 statistic corresponds to the square of the correlation between Y and X.  The adjusted r2 takes into account the number of degrees of freedom and is preferable to r2.  Indicates of how well the model fits the data. 𝑟2 𝑆𝑆 𝑦𝑦 − 𝑆𝑆𝐸 𝑆𝑆𝐸 = =1 − 𝑆𝑆 𝑦𝑦 𝑆𝑆 𝑦𝑦 Important Note: Neither r2 nor r2adj give direct indication on how well the model will perform in the prediction of a new observation. About 100(r2)% of the sample variation in y can be explained by (or attributed to) using x to predict y in the straight line model. Ideally this value will be close to 1. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 27
  • 28. Confidence & Prediction band Confidence Bands Reflect the uncertainty about the regression line (how well the line is determined). Prediction Bands Include also the uncertainty about future observations. Attention 250 200 50 100 150 RunTime 300 350 These limits rely strongly on the assumption of normally distributed errors with constant variance and should not be used if this assumption is violated for the data being analyzed. 50 100 150 200 250 300 350 > predict(production.lm, interval="confidence") fit lwr upr 1 195.1152 187.2000 203.0305 2 198.7447 191.0450 206.4443 … 20 167.3762 154.4448 180.3077 > predict(production.lm, interval="prediction") fit lwr upr 1 195.1152 160.0646 230.1659 2 198.7447 163.7421 233.7472 … 20 167.3762 130.8644 203.8881 # Create a new data frame containing the values of X # at which we want the predictions to be made pred.frame <- data.frame(RunSize=seq(55,345,by=10)) # Confidence bands pc <- predict(production.lm, int="c", newdata=pred.frame) # Prediction bands pp <- predict(production.lm, int="p", newdata=pred.frame) require ( graphics ) # Standard scatterplot with extended limits plot(RunSize, RunTime, ylim=range(RunSize,pp,na.rm=T)) pred.Size <- pred.frame$RunSize # Add curves matlines(pred.Size, pc, lty=c(1,2,2), lwd=1.5, col=1) matlines(pred.Size, pp, lty=c(1,3,3), lwd=1.5, col=1) RunSize Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 28
  • 29. Validity of regression model For all data sets, the fitted regression is the same: ^y = 3.0 + 0.5x All models have r2= 0.67, ^σ = 1.24 and the slope coefficients are significant at < 1% level. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 29
  • 30. Residual plots  A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis.  If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate. The first plot shows a random pattern, indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 30
  • 31. Residual plots Residuals vs. X Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 31
  • 32. Residual plots Residuals vs. fitted values Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 32
  • 33. Influential point Outliers Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.  It could have an extreme X value compared to other data points.  It could have an extreme Y value compared to other data points.  It could have extreme X and Y values.  It might be distant from the rest of the data, even without extreme X or Y values. Influential Points An influential point is an outlier that greatly affects the slope of the regression line. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 33
  • 34. Influential point Cond… How to deal with them: Leverage/Influential Points  Good leverage points have their standardized residuals within the interval [ 2; 2]  Outliers are leverage points whose standardized residuals fall outside the interval [ 2; 2]  Remove invalid data points o if they look unusual or are different from the rest of the data  Fit a different regression model o if the model is not valid for the data  higher-order terms  transformation Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 34
  • 35. Normality & constant variance of errors Normality and Constant Variance Assumptions, these assumptions are necessary for inference: • hypothesis testing • confidence intervals • prediction intervals  Check the Normal Q-Q plot of the standardized residuals.  Check the Standardized Residuals vs. X plot. When these assumptions do not hold, we can try to correct the problem using data transformations. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 35
  • 36. Normality & constant variance check > production.lm <- lm(RunTime~RunSize, data=production) # Residual plots > plot(production.lm) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 36
  • 37. Cook’s distance Cook's Distance: D the Cook's distance statistic combines the effects of leverage and the magnitude of the residual. it is used to evaluate the impact of a given observation on the estimated regression coefficients. D > 1: undue influence The Cook's distance plot is obtained by applying the function plot() to the linear model object. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 37
  • 38. Transformation When to use transformation? Transformations can be used to correct for:  non-constant variance There are many ways to transform variables to achieve linearity for regression analysis. Some common methods are summarized below.  non-linearity  non-normality Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 38
  • 39. Assumptions for Simple LR There are four principal assumptions which justify the use of linear regression models for purposes of prediction: I. linearity of the relationship between dependent & independent variables Y = β 0 + β 1X + ϵ II. independence of the errors (no serial correlation) III. homoscedasticity (constant variance) of the errors a) versus time b) versus the predictions (or versus any independent variable) IV.normality of the error distribution. If any of these assumptions is violated (i.e., if there is nonlinearity, serial correlation, heteroscedasticity, and/or non-normality), then the forecasts, confidence intervals, and economic insights yielded by a regression model may be inefficient or seriously biased or misleading. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA What can go wrong? Violations: In the linear regression model: • linearity (e.g. quadratic relationship or higher order terms) • In the residual assumptions: • non-normal distribution • non-constant variances • dependence • outliers Checks:  Residuals vs. each predictor variable o nonlinearity: higher-order terms in that variable  Residuals vs. fitted values o variance increasing with the response: transformation  Residuals Q-Q norm plot o deviation from a straight line: nonnormality 39
  • 40. Violations of linearity These are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. How to detect Plot • observed vs. predicted values, or • residuals vs predicted values Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is making unusually large or small predictions. How to fix  Consider applying a nonlinear transformation to the dependent and/or independent variables. For example, if the data are strictly positive, a log transformation may be feasible.  Another possibility to consider is adding another regressor which is a nonlinear function of one of the other variables. For example, if you have regressed Y on X, and the graph of residuals versus predicted suggests a parabolic curve, then it may make sense to regress Y on both X and X^2 (i.e., X-squared). The latter transformation is possible even when X and/or Y have negative values, whereas logging may not be. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 40
  • 41. Violations of homoscedasticity Violations of homoscedasticity makes it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-ofsample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when estimating coefficients. How to detect Plots of • residuals vs. time, and • residuals vs. predicted value Check for residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be really thorough, you might also want to plot residuals versus some of the independent variables.) How to fix  In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth, perhaps magnified by a multiplicative seasonal pattern. Some combination of logging and/or deflating will often stabilize the variance in this case.  A simple fix would be to work with shorter intervals of data in which volatility is more nearly constant.  Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions, in which case it may also be fixed as a byproduct of fixing those problems. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 41
  • 42. Violations of normality  It compromise the estimation of coefficients and the calculation of confidence intervals. Sometimes the error distribution is "skewed" by the presence of a few large outliers.  Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates.  Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors.  If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow. How to detect The best test for normally distributed errors is a normal probability plot of the residuals. o This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should fall close to the diagonal line. o A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in the same direction). o An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either two many or two few large errors in both directions. How to fix Violations of normality often arise either because (a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or (b) the linearity assumption is violated. In such cases, a nonlinear transformation of variables might cure both problems. In some cases, the problem with the residual distribution is mainly due to one or two very large errors. Such values should be scrutinized closely: are they genuine (i.e., not the result of data entry errors), are they explainable, are similar events likely to occur again in the future, and how influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.) If they are merely errors or if they can be explained as unique events not likely to be repeated, then you may have cause to remove them. In some cases, however, it may be that the extreme values in the data provide the most useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 42
  • 43. Thank you! Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 43