Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta

Survival Data
Analysis
Setia Pramana

Educational Background
• Hasselt Universiteit, Belgium, MSc in Applied Statistics
2005-2006.
• Hasselt Universiteit, Belgium, MSc in Biostatistics 2006-
2007.
• Hasselt Universiteit, Belgium, PhD Statistical
Bioinformatics, 2007-2011.
• Medical Epidemiology And Biostatistics Dept.
Karolinska Institutet, Sweden, Postdoctoral, 2011-2014

Course Outline
• Introduction
o Overview of Survival data analysis
o Type of censoring
• Kaplan-Meier Survival Model
o Kaplan-Meier curve
o Comparison of survival curves
o Logrank test & Wilcoxon (Gehan) test
o Application in R.
Setia PramanaSurvival Data Analysis 3

Course Outline
• Cox Proportional Hazard:
o Parameter Estimation
o Partial likelihood
o Model diagnostics
o Hazard Ratio
o Application in R.

Course Outline
• Parametric Survival Functions:
o Weibull dist
o Exponential
• Competing risk
• Frailty Model

Course Workload
• 40% Theory, 60% practice
• Group Project (5 students)
• Presentation every week
• Software used mainly R, others are allowed
• R code would be provided
• Slides can be seen at :
http://www.slideshare.net/hafidztio/

Reference Books

Survival Analysis
• Statistical procedures focuses on time to event
data. Outcome: “time until an event occurs”
• Events:
o time to death
o time to onset (or relapse) of a disease
o length of stay in a hospital
o duration of a strike
o money paid by health insurance
o viral load measurements
o time to finish our study

Survival Studies
• Clinical trials
• Prospective cohort studies
• Retrospective cohort studies
• Typically, survival data are not fully
observed, but rather are censored.

Goals
• To Estimate and interpret Survivor and
Hazard functions
• To compare Survivor and Hazard functions
• To assess the relationship of explanatory
variables to Survival time

Survival Studies
• Clinical trials
• Prospective cohort studies
• Retrospective cohort studies
• Typically, survival data are not fully
observed, but rather are censored.

Example
• Survival times of cancer patients
• Patients with advanced cancer of the
stomach, bronchus, colon, ovary, or breast
were treated (in addition to standard
treatment) with ascorbate.
• Research questions:
o What is the prognosis for a patient with specific
type of cancer ?
o Do survival times differ with organ affected ?

Example

Gene Signature for
Prostate Cancer
Setia Pramana 14

Gene Signature for
Prostate Cancer
Setia Pramana 15

The survival time
response
• Usually continuous
• May be incompletely determined for some subjects
o i.e.- For some subjects we may know that their survival
• Time was at least equal to some time t. Whereas, for
other subjects, we will know their exact time of
event.
• Incompletely observed responses are censored
• Is always ≥ 0

Censoring

Censoring
• We have some information about a
subject’s event time, but we don’t know the
exact event time.
• Censoring mechanism must be
independent of the survival mechanism.
• Three reasons:
o Study ends (no event)
o Lost to follow-up
o Withdraws

Censoring
• Right Censoring: The
survival time is
incomplete at the right
side.

Censoring
• Right Censoring: The
survival time is
incomplete at the right
side.
• Left Censoring: True
survival time <=
observed survival time
• Most studies are right
censoring

No Censoring

Right Censoring due to
End of Study

Right Censoring due to
Drop out

Left Censoring (due to
Late Study Onset)

Interval Censoring

Terminology & Notation

• Survival functions:
• Downwards as t increases
• At time t=0 S(t=0)=1
• S(~)= 0

Survival Curve

Survival Curve
• Make survival curve for Stomach

Data Layout
censored

Data Layout

R
• http://www.r-project.org
• http://cran.r-project.org/doc/manuals/R-
intro.html
• The continued rapid growth in add-on
packages.
• The near monopoly R has on the latest
analytic methods.
• Its free price.

R Sources for Survival
Analysis
• http://cran.r-project.org/web/views/Survival.html
• http://anson.ucdavis.edu/~hiwang/teaching/10fall/
R_tutorial%201.pdf

Next Class
• Kaplan-Meier Survival Model

Review Prev. Class

Another example

Kaplan Meier Curve

Example in R

Comparing Survival
curves

Leukemia Data

• Df =1

Review
• Hazard Function
o The risk of failure in a time interval after
time t, given that the customer has
survived to time t
o denoted as: h(t)
• Survival Function
o The probability that a person/patients will
have a survival time >= t
o denoted as: S(t)

Hazard Function

Survival Function

Survival Application
• Telco – customer lifetime
• Insurance – time to lapsing on policy
• Mortgages – time to mortgage redemption
• Mail Order Catalogue – time to next purchase
• Retail – time till food customer starts purchasing
non-food
• Manufacturing - lifetime of a machine component
• Public Sector – time intervals to critical events

Compare Survival Curves

• The hazard rate is defined for non repairable
populations as the (instantaneous) rate of failure for
the survivors to time t during the next instant of time.

Regression for Survival
Data
• The relation with factors can be studied using
group-specific Kaplan-Meier estimates, together
with Logrank and/or Wilcoxon tests
• Investigating the relation with covariates, requires a
regression-type model
• Relating the outcome to several factors and/or
covariates simultaneously requires multiple
regression, ANOVA, or ANCOVA models
• The most frequently used model is the Cox
(proportional hazards) model

Cox PH Regression

78
Characteristics of Cox
Regression, continued
• Cox models the effect of covariates on the hazard rate
but leaves the baseline hazard rate unspecified.
• Does NOT assume knowledge of absolute risk.
• Estimates relative rather than absolute risk.

Properties

PH Assumption
• The PH assumption requires that the HR is constant
over time
• If the hazards of each group is different (not
proportioned), then a CoxPH model is not
appropriate.
• Use extended Cox model

Independent Variables

ML Estimation of Cox PH
Model

Model

Example

Likelihood ratio tests
• Likelihood ratio tests (LRTs) have been used to compare
two nested models.
• The form :
• the ratio of two likelihood functions; the simpler model (s)
has fewer parameters than the general (g) model.
• LRT ~ chi-squared random variable, DF = the difference
in the number of parameters between the two models.

93
• Does not require that you choose some particular
probability model to represent survival times, and is
therefore more robust than parametric methods
discussed last week.
• Semi-parametric
(recall: Kaplan-Meier is non-parametric; exponential and
Weibull are parametric)
• Can accommodate both discrete and continuous
measures of event times
• Easy to incorporate time-dependent covariates—
covariates that may change in value over the course of
the observation period

Characteristics of Cox
Regression

95
Assumptions of Cox Regression
• Proportional hazards assumption: the hazard for any
individual is a fixed proportion of the hazard for any
other individual
• Multiplicative risk

Hazard Ratio

HR Model 1

HR Model 2

HR Model 3

Hazard Ratio

104
Cox regression vs.logistic
regression
Distinction between rate and proportion:
• Incidence (hazard) rate: number of new cases of
disease per population at-risk per unit time (or
mortality rate, if outcome is death)
• Cumulative incidence: proportion of new cases
that develop in a given time period

105
Cox regression vs.logistic
regression
Distinction between hazard/rate ratio and odds
ratio/risk ratio:
• Hazard/rate ratio: ratio of incidence rates
• Odds/risk ratio: ratio of proportions
By taking into account time, you are taking into account
more information than just binary yes/no.
Gain power/precision.
Logistic regression aims to estimate the odds ratio; Cox
regression aims to estimate the hazard ratio

HR Ex. Data Model 1

Pneumonia data

Single variable Cox

Multiple Cox

Adjusted Survival Curves
• No Model: Kaplan-Meier method (Prev.
chapter)
• Cox model: adjusted survival curves
o Adjust for explanatory variables used as
predictors
o Like KM curves plotted as step functions

• Converting Hazard Functions to Survival Functions

• Converting Hazard Functions to Survival Functions
Xi must be specified before

Case: Telco
Survival Analysis:
• To understand length of time before an event
occurs
• To predict time till next event
• To analyze duration of time in a particular state
• “Event” can be:
o Customer churn (the tendency of the subscribers to
switch providers)
o Take-up new product
o Default on credit
o Make next purchase

Case: Telco
• Compute the survival curve for your customer base
– Understand ‘natural patterns’ in customer survival
– Identify key points where survival rates fall
• Compare survival curves between
– Demographic groups
– Customer segments
– Sales channels
– Product plans, etc
• Identifies key factors influencing ‘time till churn’
• Enables you to predict monthly numbers of churners
– but does not identify which customers will churn

Example

Evaluating PH
Assumption

Log-log Plots

Example of non PH

Observed Versus
Expected Plots
• One-at-a-time: uses KM curves to
obtain observed plots
• Adjusting for other variables: uses
stratified Cox PH model to obtain
observed plot.
• One-at-a-time:
• stratify data by categories of
predictor
• obtain KM curves for each
category

• Continuous Var.

GOF testing

Schoenfeld Residuals Test

• > time.dep <- coxph( Surv(time,
censor)~age+race+treat+ site+age:site,
• + uis, method="breslow",
na.action=na.exclude)
• > time.dep.zph <- cox.zph(time.dep, transform = 'log')
• > time.dep.zph
• rho chisq p
• age 0.0245 0.283 0.595
• race 0.0601 1.851 0.174
• treat 0.0346 0.597 0.440
• site 0.0355 0.587 0.444
• age:site -0.0289 0.385 0.535

Stratified Cox Regression

General Stratified Cox

Interaction model Cox
Regression

Example

Test for Significance of
Interaction Model

Graphical Comparison

Parametric Survival
Models

Survival Analysis so far
• The methods that are most often employed to analyze time-
to-event data are
o Kaplan-Meier + Log-Rank/Wilcoxon Test.
• Produces empirical estimate of the time-to-event
distribution and compare between groups
o Cox (proportional hazard) regression Cox (proportional
hazard) regression.
• Measure the effect of multiple predictors without
modeling underlying distribution
• Assuming proportional hazards between levels of
predictors
• Neither of these methods produce an estimate of the
functional form of the underlying distribution

Parametric Survival
Analysis
• The survival time follows a distribution.
• Explicitly models the functional form of the event times using
various statistical distributions
• Exact distribution is unknown if parameters are unknown
• Data is used to estimate parameters
• Examples of parametric models:
o Linear regression
o Logistic regression
o Poisson regression

Parametric Survival
Analysis
• Most commonly used
o Exponential
o Weibull
o Gompertz
o Log-Logistic
o Log-Normal
o Gamma
• Generally involve two parameters
Scale () and Shape (p) parameters
• Shape generally assumed constant across individuals
• Scale related to determinants via regression
o Can quantify the effect of predictors, particularly treatment

Parametric vs Cox PH

• Parametric Survival Model
+ Completely specified h(t) and S(t)
+ More consistent with theoretical S(t)
+ time-quantile prediction possible
– Assumption on underlying distribution
• Cox PH Model
– distribution of survival time unknown
– Less consistent with theoretical S(t) (typically step
function)
+ Does not rely on distributional assumptions
+ Baseline hazard not necessary for estimation of
hazard ratio

Parametric Survival
Analysis
• Conceptually same as linear case, but Normal is
replaced by appropriate distribution
• It is implemented in a regression framework,
estimated by maximizing the likelihood of the data:
o For patients observed to have event at time t:
• Likelihood contribution: P(T=t) = f(t) (density
function)
o For patients censored at time t
• Likelihood contribution: Prob = P(T> t) = S(t)
(survival function)

Functions Characterizing
Parametric Distributions
• The survival time T is assumed to follow a distribution
with density function f (t)
• Cumulative Incidence: F(t) = P[T≤ t]
• Survival Distribution: S(t) = P[T > t ]

Commonly Used
Distributions and Parameters
•  is reparameterized in terms of predictor variables
and regression parameters.
• p Typically for parametric models, the shape
parameters p is held fixed

Ex: Exponential Dist

Weibull Distribution

• p is Shape Parameter
o p > 1: Hazards increase over time
o p = 1: Hazard is constant (Exponential Distribution)
o p < 1: Hazards decreases over time

Gompertz Distribution

Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta

Ähnlich wie Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta (20)

Mehr von Setia Pramana

Mehr von Setia Pramana (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Survival Data Analysis for Sekolah Tinggi Ilmu Statistik Jakarta