2. What is binary logistic
regression
• It is prediction of a dichotomous nominal scale
outcome variable based on predictor variable/s.
• Coding in SPSS the outcome variable of interest
should be coded as 1/0 with 1 being for the more
important level.
• Although both nominal and interval scale variables can
be entered as predictor/independent variables, it is
better to code interval scale data into dichotomous
for easier interpretation.
3. How to on SPSS
• All variables should be in separate columns and outcome variable
should be binary.
• Suppose I wanted to find independent predictors of
Paraesthesia from the data in our “JMI Log reg” worksheet.
• First you need to see, which of the putative predictors are
actually influencing the backache.
• Do univariate analyses such as t-test for interval scale
predictors (BMI/Weight/Height/Age/Duration of DM) and Chi-
sq for dichotomous predictors (DM/Sex/HO Neuropathy) to see
whether they are actually different in those with and without
Paraesthesia.
• Enter those which are significant in the model – here we enter
Age, Duration of DM, DM, HO Neuropathy.
• Go to Analyze Regression Binary Logistic
4. • Enter Paraesthesia as “Dependent” and the predictors as
“Covariates”.
• Click “Categorical” tab and transfer the nominal covariates to
“Categorical Covariates” and change Contrast reference
category to First Continue.
• Any other type of contrast other than Indicator, does not make
sense in categorical data.
5. • In “Options” tab check the options as shown
below and change outlier definition to 3 SD.
• Continue Ok.
6. Output
Tells about any missing cases. Any case with
any missing data of the variables being used,
will be excluded from analysis.
Tells about data encoding. Pay attention to
how it is coded. “Yes” category should be 1.
7. Block 0: Beginning Block
Skip Iteration history.
Classification table shows the
percentage of correct prediction
with no variables in model.
Skip Variables in equation box.
Variables not in equation Shows the putative
variables with their univariate probabilities of
association with the outcome variable Same as
Chi-sq and T-tests we had done before
8. Block 1: Method = Enter
Checks that the new model (with
explanatory variables included) is an
improvement over the baseline model. P <
0.05 is good.
Tells us the predictive ability of the model.
Interpreted similarly as Adjusted R2 in
linear regression. See the Nagelkerke R Sq.
Here it shows 89.3% of variablity in
Presence of paraesthesia is explained by
the 4 variables entered.
This is a goodness of fit test. If Sig > 0.05,
model is a good fit.
9. This is classification table with
variables included. Notice it has
improved over the previous 50%
This is the most important table. B is the coefficient of regression (but not
interpretable in log reg), SE is standard error of the B. Wald tells us the
importance of each variable in prediction of dependent, higher the better. Sig. is
the p-value for independent predictive ability of each variable (Its based on the
Wald statistic; <0.05 is good). Exp (B) tells us the Odds ratio for dichotomous
predictors for predicting presence of outcome. For continuous predictors, it tells
the change in odds of outcome with unit change in predictor (here for unit change
in age, the odds of presence of paraesthesia increases by 13.4%)
10. But did you notice the odds ratio for DM, was less than 1. Meaning that
presence of DM actually prevents Paraesthesia? Isn’t this against what we know
clinically? For this look that the next table – Correlation matrix.
See the high correlation between DM and Duration of DM. Due to this probably
the weird association is being seen. This is called Multicollinearity between
varables It can distort the overall picture. In this situation its better to
remove the DM categorical, as Duration of DM encapsulates both
presence/absence of DM and the duration, so no loss of information But
remember whenever you omit or add any variables, the B values will change. So
it becomes an interative process until you come to a final model which includes
all important variables of interest and provides good classification power for
the whole model.
11. • If there are too many multicollinearities, and you
don’t know which to remove, start with those with
highest SE for B. But remember your models
predictivity changes with removal of each variable.
• So how to use the model. The B coefficients can be
used in similar manner as linear regression to provide
a number (z). Then use the z in a formula (1/1-e-z) to
give probability of the outcome in that individual.
• Suppose in our case, a patient of 55 yrs, with DM for
10 years. So z = 55*0.125 + 10*1.604 = 22.915. So
probability of having Paraesthesia = 1/1-e-22.915 = 1
(meaning 100%).