2. Introduction
Types of regression
Regression line and equation
Logistic regression
Relation between probability, odds ratio and logit
Purpose
Uses
Assumptions
Logistic regression equation
Interpretation of log odd and odds ratio
Example
CONTENTS
3. REGRESSION is the measure of the average
relationship between two or more variables in terms
of the original units of the data.
There are different types of regression.
Among many types of regression, the most common
in medical research is LOGISTIC REGRESSION.
Introduction
4. SIMPLE LINEAR REGRESSION uses one independent
variable to explain and/or predict the outcome of Y
Y = α + βX + e
MULTIPLE LINEAR REGRESSION uses two or more
independent variables to predict the outcome. The general
form of each type of regression is:
Introduction
5. The equation of the straight line
is given by regression equation.
Population Regression equation
Y = α + βX + e
Sample regression equation
Y= a + bx
Where ‘α’ or ‘a’ is the intercept
‘β’ or ‘b’ is the slope of the line
which measures the amount of
change in y for unit change in x.
‘e’ is the regression residual/error
8. Used to analyze relationships between a CATEGORICAL
dependent variable and metric or categorical independent
variables.
Often chosen if the predictor/independent variables are a
mix of continuous and categorical variables
ln[p/(1-p)] = α + β1X1
+ β2X2 + β3X3 + ... + βtXt + e
The estimated probability is:
p = 1/[1 + exp-(α + β1X1
+ β2X2 + β3X3 + ... + βtXt )]
• p is the probability that the event Y occurs, p(Y=1)
• p/(1-p) is the "odds ratio"
• ln[p/(1-p)] is the log odds ratio, or "logit"
Logistic Regression
9. Each predictor (IV) is given a coefficient ‘b’
which measures its independent contribution
to variations in the DV, the DV can only take
on one of the two values: 0 or 1.
What we want to predict from a knowledge of
relevant IVs and coefficients is therefore not a
numerical value of a DV as in linear
regression, but rather the probability (p) that it
is 1 rather than 0 (belonging to one group
rather than the other).
Logistic regression equation
10. When And Why
Used because having a categorical outcome
variable violates the assumption of linearity in
normal regression.
Does not assume a linear relationship between
DV and IV
Predictors do not have to be normally
distributed
Logistic regression does not make any
assumptions of normality, linearity, and
homogeneity of variance for the independent
13. Binary logistic regression model:
Used to model a binary response—e.g. yes or no.
Ordinal (ordered) logistic regression model (ordinal
multinomial logistic model.)
Used to model an ordered response—e.g. low,
medium, or high.
Nominal (unordered) logistic regression model
(polytomous, polychotomous, or multinomial)
Used to model a multilevel response with no
ordering—e.g. eye color with levels brown,
green, and blue.
Types Of Logistic Regression
15. Example :100 participant are randomized to a new or
standard treatment (50 subjects to each treatment
group)
Are chances of success equal for each treatment
group?
Groups New Standard Total
Success 20 10 30
Failure 30 40 70
Total 50 50 100
16. The probability of success:
Pnew = Pr (success/ new treatment) =20/50=40%
Pst = Pr (success / std. treatment) = 10/50 =20 %
The odds of success:
Onew = Pnew/ (1-Pnew) = 20/30 = 0.66
Ost = Pst/(1-Pst) = 10/40 = 0.25
The natural logarithm of odds of success (= LOGIT)
LOGITnew = log (20/30) = -0.41 (new treatment)
LOGITst = log (10/40) = log(0.25) = -1.39 (std.
treatment)
How to measure the chances of success?
17. OR = Onew/Ost =(20/30)/(10/40)= 0.67/0.25 = 2.67
If OR = 1 then the success chances are the
same in each group which means
Pnew = Pst or Onew = Ost
The null hypothesis is H0. OR=1 vs the
alternative Ha: OR is not equal to 1
In this case, the odds of success are 2.67
times higher for the new treatment
comparing to the standard one
Odds Ratio is a possible way in the chances of
success to capture inequality
18. The probability of success can be represented via
odds or LOGITs of success
From above example
LOGITnew = -0.41 (new treatment)
LOGITst = -1.39 (standard treatment)
So the difference between the log odds = .98
We can combine these two log odds for different
groups into one formula
Log(odds) = -1.39 +0.98*(treatment is new)
(example of simple logistic regression)
Simple logistic regression
19. In this logistic regression -1.39 and 0.98 are
regression coefficients
-1.39 is called the model intercept
0.98 is the treatment effect or the difference
between LOGITs
Simple logistic regression
20. LOGIT = -1.39 + 0.98 (treatment is new)
If treatment is ‘standard” then
LOGIT = -1.39 +0.98*0 = -1.39 and
odds = Ost = exp(-1.39) = 0.25 and
Pst = 20%
If treatment is ‘new” then
LOGIT = -1.39 +0.98*1 = -0.41 and
odds = Onew = exp(-0.41) = 0.67 and
Pnew = 40%
Simple logistic regresion
21. If we apply antilog to 0.98 then exp(0.98) =2.67,
the odds ratio!!!
This 2.67 is different from 1, which means we
have a significant increase in odds of
treatment success (chi-square p-value was
<5%)
Simple logistic regresion
22. The crucial limitation of linear regression is that it
cannot deal with DV’s that are dichotomous and
categorical
Logistic regression employs binomial probability
theory in which there are only two values to predict:
that probability (p) is 1 rather than 0, i.e. the
event/person belongs to one group rather than the
other.
Logistic regression forms a best fitting equation or
function using the maximum likelihood method, which
maximizes the probability of classifying the observed
data into the appropriate category given the
regression coefficients.
Purpose of logistic regression
23. Like ordinary regression, logistic regression
provides a coefficient ‘b’, which measures each
IV’s partial contribution to variations in the DV.
To accomplish this goal, a model (i.e. an equation)
is created that includes all predictor variables that
are useful in predicting the response variable.
Variables can, if necessary, be entered into the
model in the order specified by the researcher in a
stepwise fashion like regression.
Purpose of logistic regression
24. The first is the prediction of group membership.
Since logistic regression calculates the
probability of success over the probability of
failure, the results of the analysis are in the
form of an ODDS RATIO.
It also provides knowledge of the relationships
and strengths among the variables (e.g.
marrying the boss’s daughter puts you at a
higher probability for job promotion than
undertaking five hours unpaid overtime each
week).
Uses of logistic regression
25. Methods
Simultaneous method: in which all independents
are included at the same time
Hierarchical method: Variables entered in blocks.
Blocks should be based on past research, or theory
being tested. Good Method.
Stepwise method: (forward conditional in SPSS) in
which variables are selected in the order in which
they maximize the statistically significant
contribution to the model.
Binary Logistic Regression
26. The minimum number of cases per independent
variable is 10.
For preferred case-to-variable ratios, we will
use 20 to 1 for simultaneous and hierarchical
logistic regression and 50 to 1 for stepwise
logistic regression.
Sample size requirements
27. 1. Assumes a linear relationship between the LOGIT of the
IVs and DVs
However, does not assume a liner relationship
between the actual dependent and independent
variables
2. The sample is ‘large’- reliability of estimation declines
when there are only a few cases. A minimum of 50
cases per predictor is recommended.
3. IVs are not linear functions of each other
4. Normal distribution is not necessary or assumed for
the dependent variable..
5. Homoscedasticity is not necessary for each level of the
independent variables.
Assumptions
29. In SPSS the b coefficients are located in column ‘B’ in
the ‘Variables in the Equation’ table.
Logistic regression calculates changes in the log
odds of the dependent, not changes in the
dependent value.
Odds value can range from 0 to infinity and tell you
how much more likely it is that an observation is a
member of the target group rather than a member
of the other group.
SPSS actually calculates this value of the ln(odds
ratio) for us and presents it as EXP(B) in the results
printout in the ‘Variables in the Equation’ table.
Interpreting log odds and the odds ratio
30. compare the fit of
two models. How
well a model fits
as compared to
the other.
-2
Logliklihood
Lower the
Value better
the fit of
Alternative
Chi Square
Test
Base Model is
better
Alternative is
better
Table showing how
many observations
have been predicted
correctly
Both Models
are same
Proposed is
better
Larger
difference is
better
P < 0.05
Diagnosis of LR
Classification
Table
Difference
between the Base
Model and
Proposed Model
Higher the correct
prediction the better
31. Likelihood Ratio Test
Based On
it checks whether the fuller model is better
than the base model.
What is it?
Loglikelihood function= -2loglikelihood
Measures the discrepancy between the
observed and predicted values
Interpretation
loglikelihood
Lower the value the better
32. Wald Test
Based On
give the “importance” of the contribution of
each variable in the model
What is it?
Chi Square distribution at 1 df
Interpretatio
n
Higher the value, the more “important” it is.
33. Measure of the Proportion of Variance
Based On
Measure of the proportion of variation
explained
What is it?
Comparison of log-liklihood of the base and
proposed model
Measures Cox & Snell’s R2 Nagelkerke’s R2
Interpretati
on
The higher the better (Value is between 0 & 1)
Does not attain 1 for
the perfect model
Attains1 for the
perfect model
34. The Hosmer-Lemeshow Goodness-of-
Fit Test
Based On
How well does your model fit the dataWhat is it?
produce a p-value
Interpretation if it’s low (< .05), you reject the model. If it’s
high, then your model passes the test
35. Interpreting the Logistic ModelModel
With one unit
increase in x
log(OR) of the
success will
increase by 1.3
units on average
Interpretation
Logit Odd Ratio Probability
With one unit
increase in x OR
of success will
increase by e1.3
units or by 3.67
units.
It gives the
probability of
success for a
particular value
of x
36. Data from a survey of home owners conducted by an electricity
company about an offer of roof solar panels with a 50% subsidy
from the state government as part of the state’s environmental
policy.
The variables involve household income measured in units of a
thousand dollars, age, monthly mortgage, size of family
household, and whether the householder would take or decline
the offer.
1. Click Analyze >>Regression >> Binary Logistic
2. Select the grouping variable (the variable to be predicted)
which must be a dichotomous measure and place it into the
Dependent box.
3. Enter your predictors (IV’s) into the Covariates box. These are
‘family size’ and ‘mortgage’.
SPSS Example
37. In SPSS, the model is always constructed to predict the
group with higher numeric code.
• If responses are coded 1 for Yes and 2 for No, SPSS will predict
membership in the No category.
• If responses are coded 1 for No and 2 for Yes, SPSS will predict
membership in the Yes category.
We will refer to the predicted event for a
particular analysis as the modeled event.
39. 4. Whether there is any categorical predictor
variables, click “categorical” button and enter it (
there is none in the example).
40. 5. Click on options botton and select Classification
plots, Hosmer-Lemeshow Goodnes of Fit, Casewise
Listing Of Residuals and select Outliers Outside
2sd.
Retain default entries for probability of
stepwise, classifi cation cutoff and maximum
iterations
6. Continue then OK.
42. The first one to take note of is the Classification table in
Block 0 Beginning Block.
Block 0: Beginning Block. Block 0 presents the results
with only the constant included before any coefficients
(i.e. those relating to family size and mortgage) are
entered into the equation.
The table suggests that if we knew nothing about our
variables and guessed that a person would take the
offer we would be correct 53.3% of the time.
Interpretation of printout tables
43.
44. The variables not in the equation table tells us whether
each IV improves the model
The answer is yes for both variables, with family size
slightly better than mortgage size, as both are
significant and if included would add to the predictive
power of the model.
If they had not been significant and able to contribute
to the prediction, then termination of the analysis would
obviously occur at this point.
Variables not in the equation
45.
46. The overall significance is tested using what SPSS calls
the Model Chi square, which is derived from the
likelihood of observing the actual data under the
assumption that the model that has been fitted is
accurate.
In our case model chi square has 2 degrees of freedom,
a value of 24.096 and a probability of p < 0.000 .
Thus, the indication is that the model has a poor fit,
with the model containing only the constant indicating
that the predictors do have a significant effect and
create essentially a different model.
So we need to look closely at the predictors and from
later tables determine if one or both are significant
predictors.
Model chi-square
47.
48. Cox and Snell’s R-Square attempts to imitate multiple
R-Square based on ‘likelihood’, but its maximum can
be (and usually is) less than 1.0
The Nagelkerke modification that does range from 0
to 1 is a more reliable measure of the relationship.
Nagelkerke’s R2 is part of SPSS output in the ‘Model
Summary’ table and is the most-reported of the R-
squared estimates.
In this case it is 0.737, indicating a moderately strong
relationship of 73.7% between the predictors and the
prediction.
Model Summary
49.
50. R2 = +1
Examples of Approximate R2 Values
y
x
y
x
R2 = 1
R2 = 1
Perfect linear relationship
between x and y:
100% of the variation in y is
explained by variation in x
51. y
x
y
x
0 < R2 < 1
Weaker linear relationship
between x and y:
Some but not all of the
variation in y is explained
by variation in x
Examples of Approximate R2 Values
52. R2 = 0
No linear relationship
between x and y:
The value of Y does not
depend on x. (None of the
variation in y is explained
by variation in x)
y
xR2 = 0
Examples of Approximate R2 Values
53. If the H-L goodness-of-fit test statistic is greater than .05,
as we want for well-fitting models, we fail to reject the
null hypothesis that there is no difference between
observed and model-predicted values, implying that
the model’s estimates fit the data at an acceptable
level.
That is, well-fitting models show non-significance on the
H-L goodness-of-fit test.
Hosmer and Lemeshow statistic
54.
55. In the Classification table, the columns are the two
predicted values of the dependent, while the rows are the
two observed (actual) values of the dependent.
In this study, 87.5% were correctly classified for the take
offer group and 92.9% for the decline offer group.
Overall 90% were correctly classified.
This is a considerable improvement on the 53.3% correct
classification with the constant model so we know that
the model with predictors is a significantly better mode.
The benchmark that we will use to characterize a logistic
regression model as useful is a 25% improvement over the
rate of accuracy achievable by chance alone.
Classification table
56.
57. In this case, we note that family size contributed
significantly to the prediction (p = .013) but
mortgage did not (p = .075).
The EXP(B) value associated with family size is
11.007.
Hence when family size is raised by one unit (one
person) the odds ratio is 11 times as large and
therefore householders are 11 more times likely to
belong to the take offer group.
Variables in the Equation
58.
59. The odds ratio is a measure of effect size.
The ratio of odds ratios of the
independents is the ratio of relative
importance of the independent variables in
terms of effect on the dependent variable’s
odds.
In this example family size is 11 times as
important as monthly mortgage in
determining the decision.
Effect size
DISCRIMINANT FUNCTION ANALYSIS
is usually employed with a categorical dependent variable, & all of the predictors are continuous and nicely distributed;
LOGIT ANALYSIS
is usually employed if all of the predictors are categorical;
DISCRIMINANT FUNCTION ANALYSIS
is usually employed with a categorical dependent variable, & all of the predictors are continuous and nicely distributed;
LOGIT ANALYSIS
is usually employed if all of the predictors are categorical;
DISCRIMINANT FUNCTION ANALYSIS
is usually employed with a categorical dependent variable, & all of the predictors are continuous and nicely distributed;
LOGIT ANALYSIS
is usually employed if all of the predictors are categorical;
Homoscedasticity
This assumption means that the variance around the regression line is the same for all values of the predictor variable (X)
The likelihood-ratio test uses the ratio of the maximized value of the likelihood function for the full model
(L1) over the maximized value of the likelihood function for the simpler model (L0). This log
transformation of the likelihood functions yields a chi-squared statistic.
A Wald test is used to test the statistical significance of each coefficient () in the model. A Wald test
calculates a statistic. This z value is then squared, yielding a Wald statistic with a chi-square
distribution.
Wald estimates give the “importance”
of the contribution of each variable in the model.
The higher the value, the more “important” it is.
R2 is a measure of predictive power, that is, how well you can predict the dependent variable based on the independent variables.
Hosmer DW, Lemeshow S. Applied logistic regression.Wiley & Sons, New York, 1989