1. Statistical Analysis Software
Click to edit Master title style
Bivariate and Multivariate Regression
Analysis
Academic Department of Marketing
Caucasus School of Business
Caucasus University 1
2011
2. Problems of Test 1
• Formulating null and alternative
hypotheses incorrectly
• Ignoring question “why”
• Ignoring the necessity to comment on
the scale used
• Mixing up Wilcoxon and paired
samples T test
• Massively ignoring the necessity to
check the equality of variances
(Levene’s test)
• Kolmogorov-Smirnov test
3. Homework 1
• Three or four homework assignments will be
given throughout the course. You will be
informed about the number of points you can
get from each assignment.
• The first homework assignment will include
two problems: the first one is the ANOVA
problem from test 1 – each one of you will
have individual databases. The second problem
will be about using Pearson’s Chi Square
statistic in cross-tabulations. However, you will
have to come up with your own example and
your own fictional database.
• The assignment is worth 2 points and is due
4. Important Note (Homework)
• EVEN IF ALL THE INTERPRETATION IS
CORRECT, YOU WILL GET ZERO POINTS
IN CASE YOU SUBMIT THE WRONG
OUTPUT WHETHER IT’S BECAUSE YOU
DID THE WRONG TEST OR YOU USED
SOMEBODY ELSE’S DATASET.
5. Warming Up – Linear Equations
• What does a linear relationship imply?
• How does a linear relationship look like
(mathematically)?
• What are the variables in this equation
and what are the parameters?
• How are the parameters interpreted?
6. Scatterplot (1)
• Scatterplot – collection of points (x,y) on the
coordinate system. Each point on a scatterplot
depicts a single case, that has a specific X value
and a specific Y value, which you can find on the
X and Y axis.
7. Scatterplot (2)
• As we see, there is a certain relationship
between income and saving – the higher
the income, the higher the saving.
• But are we interested only in the
direction? Not really. It is important to
measure by how much saving increases as
income increases by, say, 1 Lari.
• By saying this we imply that there is a
linear relationship between income and
saving (which is not necessarily true, but
let’s ignore this for now).
8. Scatterplot (3)
• Going back to our scatterplot, we need
to find a line (i.e. determine the
intercept and the slope) which best
describes the relationship between two
variables (in this case saving and
income).
• This is exactly where regression comes
into play – it helps to identify such a line
by using the sample information.
9. Bivariate Regression Model
• In theory, the relationship between saving
and income already exists and is
somewhere out there – we can’t really
determine it in practice. Why? Because we
would need to collect information about
everybody’s income and everybody’s
saving (i.e. we would need information
about the whole population).
• If we could, the bivariate regression model
would look like this:
Y=β0+ β1*X, where Y is saving and X is
income.
10. Error Term
• Note that even in the ideal case, where we
have information about the population, we are
still unable to exactly predict the level of saving
by the level of income. Why? Because income
is not the only factor that determines saving.
There are other factors that aren’t accounted
for in our bivariate regression model.
• All the other factors not explicitly accounted
for in the regression model fall in the so called
error term, denoted by ε.
• Therefore, the population regression model
looks like this:
Y=β0+ β1*X+ ε
11. Linear Regression Analysis
(Bivariate)
• Identifying the line that depicts the
relationship between X and Y boils down to
estimating β0 and β1.
• What a regression does is basically
providing us with estimates (regression
coefficients) of β0 and β1, which are
denoted by b0 and b1.
• The estimated regression model looks like
this:
Ŷ= b0 + b1*X
12. Interpreting Regression
Coefficients
• Ŷ= b0 + b1*X
• Ŷ – predicted values, shows us the predicted
values of Y as X takes specific values.
• b0 - intercept, shows the predicted value of Y
when X=0.
• b1 - slope estimate, shows by how much the
predicted value of Y changes as X changes by 1
unit.
13. Residual
• Residual is the difference between the
actual value of Y and predicted value of
Y, and is denoted by e.
• e=Y – Ŷ
• Do not mix up residual and error term.
They are NOT the same. We never know
the error term. However, we can easily
estimate the residual. Residual is an
estimate of the error term.
15. Linear Regression - Output
• Thus, if income is 0, the predicted saving is
equal to 124.842. And if income increases by
1 Lari, saving will increase by 0.147 Lari.
• Is this model appropriate to predict the levels
of saving? Not really. Saving is also
determined by other factors, like family
size, education level of household
head, his/her age and gender. (Of course
there may be other determinants as well, but
let’s focus on these for now)
16. Multiple Regression Analysis
• Multiple regression implies including more
than one independent variable in the
regression model. Basically it looks like this:
Y=β0+ β1*X1+ β2*X2+ β3*X3+…+ βk*Xk+ ε
• In this case we need to estimate (k+1)
parameters - b0, b1, b2 … bk.
• Interpretation of slope coefficients: b1
shows by how much predicted Y changes as
X1 changes, holding all other X-s constant.
• Interpretation of intercept – the predicted
value of Y when all the X-s are equal to
zero.
18. Major Goals of Conducting
Regression Analysis
• Goal 1. Measuring partial effects – by how
much does Y change when X1 changes by 1
unit, holding all other X-s constant?
• Goal 2. Forecasting the values of the
dependent variable – what is the predicted
saving level (measured in Laris) of a family
that has a family income of 1000 Laris, that
has 5 members, whose household head
studied for 15 years and whose household
head is 47 years old?
• Regression provides answers to these
questions.
19. Predictive Power of a Model
• In order to know how good our model
is for forecasting, we need to measure
the predictive power of the model. In
other words, we want to know how
well the independent variables explain
the dependent variable.
• Coefficient of determination (R-
squared) is widely used for this
purpose.
20. Coefficient of Determination –
R-Squared (1)
• Coefficient of determination (R-squared)
measures the portion of the variation in Y
explained by the variation in X-s, in other
words, how much of the variation in the
dependent variable is explained by the
independent variables.
• This is also called goodness-of-fit.
• R-squared ranges from 0 to 1 and shows how well
the regression line describes the data cloud that
you see on the scatterplot.
• The closer the data are clustered around the
regression line, the closer the R-squared is to 1.
R2=1 is perfect fit (never possible in practice). The
closer the R-squared is to 0, the worse the fit.
21. Coefficient of Determination –
R-Squared (2)
• For example, if R-squared is equal to
0.045, it means that independent
variables explain only 4.5% of variation in
the dependent variable.
• This is an example of low predicting
power.
• The higher the R-squared, the better the
predictive power of your model.
22. Testing Significance of Regression
Coefficients (1)
• As we already mentioned, the other goal
of regression analysis is to determine
partial effects.
• Basically, partial effects measure pure
effects of respective independent
variables on the dependent variable.
• What we want to know is whether these
pure effects are important. How can we
find this out?
• This is done by testing the significance of
the regression coefficients.
23. Testing Significance of Regression
Coefficients (2)
• Suppose we want to test whether age
of household head (X4) has an
important effect on saving once all the
other factors (household
size, income, education of household
head) are controlled for.
• Null hypothesis is that β4 = 0. (i.e., as X4
changes by 1 unit, nothing happens to
Y, no effect on Y)
• Alternative hypothesis is that β4 is
different from 0 (two-tailed test).
24. Testing Significance of Regression
Coefficients (3)
• It can be shown that if we divide the estimate of
β4 (b4) by standard error of b4 (which is standard
deviation of b4 ), the resulting statistic follows t
distribution.
• Thus, we can either calculate the t statistic and
compare it to the critical t value at 5%
significance level, or we can simply look at the p-
value (Sig.) of the regression coefficient. If the
latter is less than 0.05, we conclude that the
regression coefficient is significantly different
from zero (or just significant, shortly). In other
words, the partial effect of this variable is
statistically important.
25. Testing Significance of Regression
Coefficients - Example
• Going back to our multivariate regression
example, no single independent variable
appears to be statistically significant – all
the p-values are more than 0.05.
• However, even though these variables
are separately insignificant, there is a
chance that they are collectively
significant.
• This hypothesis is tested by joint F test.
26. Joint F Test
• Null Hypothesis: β1 = β2 = β3 = β4 = 0
• Alternative Hypothesis: at least one of them is different
from zero.
• This is equivalent to testing whether R2=0.
27. Important Note
• It can happen that all the coefficients
are separately insignificant but jointly
significant, even though in our example
they’re also jointly insignificant at 5%
significance level.
• It can also happen that regression
coefficients are separately significant
but jointly insignificant. WHEN?