GLMs and extensions in R

Generalized linear models, and extensions, in R

Ben Bolker

Departments of Mathematics & Statistics and Biology, McMaster University

7 January 2011

Ben Bolker (McMaster University) GLMs in R 7 January 2011 1 / 25

1 Introduction

2 Example

3 Challenges, tricks, extensions

4 (Extended examples)


What are generalized linear models?

Modeling framework to solve two common statistical problems:
Non-normal data
Non-linearity (continuous predictors)
. . . superset of, and often confused with,
“general” linear models (i.e. ANOVA/ANCOVA/regression:
SAS PROC GLM)


GLMs: technical details

Constraints:
Distributions from exponential family
(Normal, Poisson, binomial, Gamma, inverse Gaussian)
Invertible nonlinearities, i.e. there exists a link function that would
make the relationship linear
(log, logit, probit, inverse, square root, “cauchit” . . . )
,
Eﬃcient, stable algorithm: iteratively re-weighted least squares (IRLS)
/ Fisher scoring)
standard methods (methods(class="glm")):
coef, summary, plot, predict, residuals, vcov, profile,
update, confint, simulate, anova, add1/drop1, logLik, AIC, . . .
logistic and Poisson regression probably make up 99% of GLMs . . .


Google scholar scraping

logistic+regression q
580000

Poisson+regression q
39300

generalized+linear+model q
28700

binomial+regression q
13500

104 104.5 105 105.5 106
Ghits


Example: reed frog predation data

1.0

Vonesh and Bolker (2005):
0.8

q
> library(emdbook)
Fraction killed

0.6 q q > data(ReedfrogFuncresp)
q q q
> glm1 <- glm(Killed/Initial~
q q
0.4 q
q
q
q
Initial,
q q
weight=Initial,
0.2 q
family=binomial,
q
data=ReedfrogFuncresp)
0.0
20 40 60 80 100
Initial density


Summary
> summary(glm1)
Call:
glm(formula = Killed/Initial ~ Initial, family = binomial, data = ReedfrogFuncresp,
weights = Initial)

Deviance Residuals:
Min 1Q Median 3Q Max
-4.4132 -0.7275 0.4347 1.0120 1.8172

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.094563 0.188952 -0.50 0.61675
Initial -0.008416 0.002697 -3.12 0.00181 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 47.518 on 15 degrees of freedom
Residual deviance: 37.717 on 14 degrees of freedom
AIC: 98.639

Number of Fisher Scoring iterations: 4

Diagnostics
Residuals vs Fitted Normal Q−Q

20
2

q13 5q
13q
16q
q
q
q q

10
q q
q q
qq q
diagnostics inherit

Std. deviance resid.
0

q qq
q q

0
Residuals

q q q
q qq
q
q
q from plot.lm

−10
−2

q

overdispersion:

−20
−4

residual deviance
−30
q11
q11

−0.8 −0.6 −0.4
Predicted values
−0.2 −2 −1 0
Theoretical Quantiles
1 2
≈ χ2 n−p
Scale−Location Residuals vs Leverage
11q (Venables and Ripley,
q13
2

16q
q
2002, p. 209):
1
5

q 0.5
q q
q16 q
q
q13
sum(residuals(glm1,
Std. deviance resid.

q
4

q
Std. Pearson resid.
0

q
q
q q
q
type="pearson")^2)
3

q 0.5
q
q q
q q 1
−2

=34.3:
2

q
q q q q

q
1

q
p 0.05
−4

q11
Cook's distance
0

−0.8 −0.6 −0.4 −0.2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Predicted values Leverage


Inference

Coefficients: may be hard to communicate (reflect differences on the
scale of linear predictor, e.g. logit/log-odds differences)
Wald statistics: beware the Hauck-Donner effect
(Venables and Ripley, 2002, p. 198). Wald CI of slope:
stats:::confint.lm(glm1) (-0.0142,-0.0026)
Likelihood ratio test, via anova:
> anova(glm1,test="Chisq") ## OR
> glm0 <- update(glm1, . ~ -Initial)
> anova(glm1,glm0,test="Chisq")
Likelihood profiles (via MASS::profile.glm),
profile confidence intervals:
MASS:::confint.glm(glm1) (-0.0137,-0.0031)


Estimation issues

Convergence diﬃculties, especially with non-standard links: set
starting values, center/scale variables (?)
Complete separation: brglm, logistf, arm (bayesglm)
Big data: biglm (bigglm)
Many predictors (penalized regression):
glmnet, glmpath, penalized (Machine learning task view)


Tricks (within GLM framework)

non-standard link functions:
fitting hyperbolic models of predator attack rates (Michaelis-Menten)
via binomial/inverse link
(http://emdbolker.wikidot.com/voneshglm)
exponential survivorship models via binomial/log link (Strong et al.,
1999; Tiwari et al., 2006)
Gaussian family with log link: fit exponential growth models with
constant variance
subtleties with Gamma GLMs and dispersion parameter:
V&R MASS online complements,
Paul Johnson’s notes
offsets: variation in sampling area/intensity
(e.g. strict proportionality)


Overdispersion

Quasilikelihood models:
> glmQ <- update(glm1,family="quasibinomial")
> anova(glmQ,test="F")
ˆ
(φ = 2.45). No likelihood: qAIC requires some contortions
extended GLMs
negative binomial: MASS (glm.nb)
beta-binomial:
aod (betabin)
gnlm (gnlr)
VGAM (vglm)
bbmle (mle2)
GLMMs: lognormal-Poisson, logit-normal-binomial
robust estimation (lmtest, sandwich):
> coeftest(glm1,vcov=sandwich)
See also the vignette for the pscl package.

Extensions

Generalized additive models (Wood, 2006): mgcv, gamlss
Zero-inﬂated/altered/hurdle models: pscl, VGAM
Beta regression: betareg
Generalized regression models: bbmle, VGAM, gnlm
Random eﬀects (generalized linear mixed models): lme4 and other
packages (http://glmm.wikidot.com/faq)


References

Strong, D.R., Whipple, A.V., et al., 1999. Ecology, 80:2750–2761.
Tiwari, M., Bjorndal, K.A., et al., 2006. Marine Ecological Progress Series,
326:283–293.
Venables, W. and Ripley, B.D., 2002. Modern Applied Statistics with S.
Springer, New York, 4th edition.
Vonesh, J.R. and Bolker, B.M., 2005. Ecology, 86(6):1580–1591.
Wood, S.N., 2006. Generalized Additive Models: An Introduction with R.
Chapman & Hall/CRC.


Basic ggplot code

> qplot(Initial,Killed/Initial,data=ReedfrogFuncresp)+
geom_smooth(method=glm,family=binomial,
aes(weight=Initial,group=NA))


Conﬁdence intervals on # killed, by hand

> pframe <- data.frame(Initial=1:100)
> pp <- predict(glm1,newdata=pframe,se.fit=TRUE)
> pmat <- with(pp,plogis(cbind(fit,
fit-1.96*se.fit,
fit+1.96*se.fit)))
> par(bty="l",las=1)
> with(ReedfrogFuncresp,plot(Initial,Killed/Initial,
xlim=c(0,100),ylim=c(0,1),
pch=16))
> matlines(pframe$Initial,pmat,lty=c(1,2,2),col=1,type="l")


Prediction intervals

> simhack <- function(params) {
glmnew <- glm1
glmnew$coefficients <- params
## simulates on PROBABILITY scale
simulate(glmnew)[[1]]
1.0
}
> set.seed(101)
0.8 > params <- MASS::mvrnorm(1000,mu=coef(glm1),
Sigma=vcov(glm1))
q
> sims <- apply(params,1,simhack)
Killed/Initial

0.6 q q
> qmat <- t(apply(sims,1,quantile,
q q q
q
q
c(0.5,0.025,0.975)))
q q
0.4 q q q
q
q
q
q

q
q

q
q
(Constructing the simulated
0.2 q

q
values at Initial densities from
0.0 1 to 100 is a bit more work —
0 20 40 60 80 100
ideally all simulate methods
Initial
would have newdata and
newparam arguments . . . )


Alternative display (display, coefplot from arm
package)

−0.015 −0.010 −0.005 0.000

Initial q

> display(glm1)
glm(formula = Killed/Initial ~ Initial, family = binomial, data = Re
weights = Initial)
coef.est coef.se
(Intercept) -0.09 0.19
Initial -0.01 0.00
---
n = 16, k = 2
residual deviance = 37.7, null deviance = 47.5 (difference = 9.8)

Beta-binomial with aod

> library(aod)
> glmBB1 <- betabin(cbind(Killed, Initial-Killed)~Initial,
random=~1,


Beta-binomial with bbmle

> library(bbmle)
> glmBB3 <- mle2(Killed~dbetabinom(prob=plogis(logitp),
theta=exp(logtheta),size=Initial),
parameters=list(logitp~Initial),
data=ReedfrogFuncresp,
start=list(logitp=0,logtheta=0))


Beta-binomial with VGAM

> library(VGAM)
> glmBB4 <- vglm(cbind(Killed,Initial-Killed)~Initial,
betabinomial,
> coef(glmBB4,matrix=TRUE)


Beta-binomial with gnlm

> library(gnlm)
> attach(ReedfrogFuncresp) ## no data= argument!
> glmBB2 <- gnlr(cbind(Killed,Initial-Killed),
dist="beta binomial",
pmu=c(0,0),pshape=0,
mu=function(p,linear) plogis(linear),
linear=~Initial)
> detach(ReedfrogFuncresp)
> detach("package:gnlm")
> detach("package:rmutil")


Logit-normal-Poisson with lme4

> library(lme4)
> ReedfrogFuncresp$ID <- 1:nrow(ReedfrogFuncresp)
> glmLNP <- glmer(cbind(Killed,Initial-Killed)~Initial+(1|ID),
family=binomial,
> summary(glmLNP)


Alternate link functions for reed frog data

1.0

0.8
Fraction killed

q
0.6 q q
q q q
q q q
0.4 q
q q
q q
0.2 q
q
0.0
20 40 60 80 100
Initial density


Comparing overdispersion estimates

LN−binomial q
beta−binomial q
sandwich q
model

q−binom Wald q
binomial profile q
binomial Wald q
−0.015 −0.010 −0.005 0.000
initial density effect


GLMs and extensions in R

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Mehr von Ben Bolker

Mehr von Ben Bolker (20)

GLMs and extensions in R