Application of survival data analysis introduction and discussion

Application of Survival Data
Analysis‐ Introduction and
Analysis Introduction and
Discussion (存活数据分析及应
(
用‐ 简介和讨论)
Shaoang Zhang, Ph.D. ,
(张少昂博士)
©2012 ASQ & Presentation Xing
©2012 ASQ & Presentation Xing
Presented live on Dec 15th, 2012

http://reliabilitycalendar.org/The_Rel
iability Calendar/Webinars_‐
y_ /
_Chinese/Webinars_‐_Chinese.html

ASQ Reliability Division
ASQ Reliability Division
Chinese Webinar Series
Chinese Webinar Series
One of the monthly webinars
One of the monthly webinars
on topics of interest to
reliability engineers.
To view recorded webinar (available to ASQ Reliability
( y
Division members only) visit asq.org/reliability

To sign up for the free and available to anyone live webinars
To sign up for the free and available to anyone live webinars
visit reliabilitycalendar.org and select English Webinars to
find links to register for upcoming events

http://reliabilitycalendar.org/The_Rel
iability Calendar/Webinars_‐
y_ /
_Chinese/Webinars_‐_Chinese.html

Survival Analysis
- Introduction and Discussion
(存活数据分析及应用- 简介和讨论)

Shaoang Zhang, Ph.D.
Biostatistics, OptumRX

December 16, 2012

Outline
 Introduction
 Measurements of ARR and reliability
 Survival data – a glance
 Special Features in survival data
 Overview of Statistical Methods
 Parametric approach
 Distribution based approach
 Semi-parametric approach – Cox PH model
 Accelerated Failure Time Model
 Frailty Model
 Non-parametric approach
 Kaplan–Meier curve
 Log-Rank Test
 Examples
 Discussion
 Summary

Measurements of Field Failure and
Reliability
 ARR (Annual Return Rate) – based on field returns in one year.
 How to define one year in field? - Shipments can go out at different times, so one year
in the field may mean different starting date in calendar. One year from the first
shipment, one year for every shipment, or one year of continuous operation for every
unit included in the shipments considered?
 Many different ARR calculations by applying different adjustments
 Linear extrapolation
 Prediction based on survival curve
 Reliability Prediction
 MTTF – MTTF is estimated based on reliability tests. For example, MTTF of a hard
disk drive can be millions of hours. However, the reliability may only cover thousands
of hours (in field). How accurate is the estimation?
 Multiple distributions for failure time
 Multiple failure modes may govern failures at different life time.
 Example – bathtub hazard curve

ARR Calculation
Annual Returns
• Shipment 1 Actual

Estimated
• Shipment 2
• Shipment 3 Estimated




How to estimate or predict survival at a future time point?

Survival Data – A Glance
 What is survival data?
 Data measuring the time to event Number alive
and under
 Events: death, failure, received, a complication, etc. Year since observation at
 Incomplete data in terms of event time entry into the beginning Number dyning Numbercensor 1- Mortality Survival
study of interval during interval ed or withdraw Mortality rate Rate function
[0,1) 146 27 3 0.18 0.82 0.82
An example [1,2) 116 18 10 0.16 0.84 0.69
[2,3) 88 21 10 0.24 0.76 0.52
[3,4) 57 9 3 0.16 0.84 0.44
[4,5) 45 1 3 0.02 0.98 0.43
Year since Number alive and Number dying Number [5,6) 41 2 11 0.05 0.95 0.41
entry into under observation during intervalcensored or [6,7) 28 3 5 0.11 0.89 0.37
study withdrawn [7,8) 20 1 8 0.05 0.95 0.35
at the beginning of [8,9) 11 2 1 0.18 0.82 0.28
interval [9,10) 8 2 6 0.25 0.75 0.21

[0,1) 146 27 3
[1,2) 116 18 10 1
[2,3) 88 21 10
0.8
[3,4) 57 9 3 Survival Function
[4,5) 45 1 3 0.6
[5,6) 41 2 11
[6,7) 28 3 5 0.4

[7,8) 20 1 8
0.2
[8,9) 11 2 1
[9,10) 8 2 6 0
0 1 2 3 4 5 6 7 8 9 10
Data cited from a clinical trial on myocardial infarction
(MI) (Svetlana, S., 2002) Year after enter into study

Special Features of Survival Data?
 Time-to-event - The primary interest of the survival analysis is
time to event.
 Time to event can be modeled by a distribution function
 Random variable
 The „time to event‟ for every unit is available as time goes infinity (or approaching
to a limit)
 The time to event is usually not normally distributed
 Censored - with incomplete information about the „time to
event‟.
 General issues in survival data analysis
 The non-normality aspect of the survival data violates the normality
assumption of most commonly used statistical model such as regression
or ANOVA, etc.
 Incompleteness may cause issues such as:
 Estimation bias.
 Difficulty in validating the assumption

Censoring
 A censored observation is defined as an
observation with incomplete information
about the „time-to-event‟
 Different types of censoring, such as
right censoring, left censoring, and interval
censoring, etc.
 Right censoring --- The information about
time to event is incomplete because the
subject did not have an event during the
time when the subject was studied.

Overview of Statistical Methods
 Objectives:
 Characterize and estimate the distribution of the failure time;
 Compare failure times among different groups, e.g. generations of products (old vs.
new), treatment vs. control, etc.
 Assess the relationship of covariates to time-to-event, e.g. which factors
significantly affect the distribution of time-to-event?
 Approaches:
 To estimate the survival (hazard) function:
 parametric approach: specify a parametric model, i.e. a specific distribution
(exponential, Weibull, etc.)
 empirical approach: use nonparametric or semi-parametric estimation (more
popular in biomedical sciences), such as Kaplan–Meier estimator
 To compare two survival functions:
 Log-rank test
 To model the relationship between failure time and covariates:
 Cox proportional hazard model
 Accelerate failure-time model
 Frailty model

Parametric Survival Model
 Parametric Survival Model
 Assumption on underlying distribution
 Hazard function, h(t), and survival function, S(t), is completely
specified
 Continuous process
 Prediction possible
 Main Assumption
 The survival time t is assumed to follow a distribution with density
function f (t). Specifying one of the three functions f(t), S(t), or h(t)
means to specify the other two functions.

S (t )  P (T  t )   f (u )du
t
d
 S (t )  t 
f (t )    h(u )du 
h(t )   dt S (t )  exp  
S (t ) S (t )  0 

Weibull Model
 Assumption:
 Time to event, t, follows Weibull ( ,  ) with probability
function:
f (t )  t  1 exp(t  ), where  ,   0
 The hazard function is given by:
h(t )  t  1
 The survival function
S (t )  exp(t  ) S (t )  exp( t  )   log( S (t ))  t 
 log(  log( S (t )))  log(  )   log(t )
 Exponential Distribution – nice properties
 Flexible
 Graphical evaluation

Likelihood and Censored Survival Data
 Likelihood estimate (right censored data):
 The likelihood function of parameter(s)  :
n
L( , t )    f (ti , ) i [ S (ti ,  )]1 i


i 1

 MLE ˆ of  :
( ; t )
U ( ; t )   0 where ( ; t ) is the log likelihood function

ˆ
 ~ N ( ,V ) where V  J 1 and J denotes Fisher informatio n matrix

 Hypothesis Tests
 Score test
 Likelihood ratio test

Semi-Parametric Model
 Cox PH Model - a very popular model in Biostatistics
 Distribution of time-to-event unknown but proportional hazard ratio is assumed.
 Baseline hazard is not needed in the estimation of hazard ratio
 Semi-parametric - The baseline hazard can take any form, the covariates enter
the model linearly
 Proportional hazard assumption
h(t | X )  h0 (t ) exp( X )
h(t | X 1 ) h0 (t ) exp( X 1 )
  exp(( X 1  X 0 )  )
h(t | X 0 ) h0 (t ) exp( X 0  )

 Parameter estimation – based on partial likelihood function
k
exp( X [ j ]  )
L
j 1 lR exp( X l )
j

where X [ j ] denotes the covariate vector for the observation which actually experience d
the event at t j ; R j denotes the risk set at time t j ; k denotes dictinct event time s.

Cox PH Model
 Effect of treatment vs. control (X=1 vs. X=0)
ˆ
HR  exp(  )
ˆ is
exp(  ) the relative odds of observations from the treatment group,
relative to observations from the control group. An intuitive way of
understanding the influence of covariates on the hazard
 Weibull model and proportional hazard
 If the shape parameter does not change but the scale parameter is influenced by
the covariates, Weibull model implies the assumption of proportional hazard
holds.

Let   exp( X ) in the Weibull Model, we have
h(t | X 1 )  exp( X 1 )t  1
  1
 exp(( X 1  X 0 )  )
h(t | X 2)  exp( X 0  )t

Accelerate Failure Time Model
 Accelerated failure time model (AFT)
 A parametric model that describes covariate effects in terms of
survival time instead of relative hazard as Cox PH model. A
distribution has a scale parameter.
 Log-logistic distribution
 Other distributions, such as Weibull distribution Gamma distribution, etc.
 Assumption:
 The influence of a covariate is to multiply the predicted time to event (not
hazard) by some constant. Therefore, it can be expressed as a linear
model for the logarithm of the survival time.
 Model:
S (t | X 1 )  S (t | X 2 ) where  is the accelerati factor
on
log(t )  X

 Weibull distribution and AFT
1 1
Assume :  exp( X ), we have : log( t )  X  
 1/ 


Frailty Model
 Model Assumption:

h j (t | X i , j )  h0 (t ) j exp( X j  )

 It is assumed that the frailty factor  j follow a distribution (such as Gamma
and inverse Gaussian) with mean of 1 and an unknown variance that can be
specified by a parameter.
 Frailty model is usually used to a population that are likely to have a
mixture of hazards (with heterogeneity). Some subjects are more
failure-prone so that more „frail‟.
 A random effect model - to count for unmeasured or unobserved
„frailties‟.
 Weibull Model:
For Weibul l Model, with a simple gamma frailty assumption ,  ~ g (1 /  ,  ), we have :

h(t )   (t ) 1 S (t ) , where S (t )  1    t 


1 / 

Non-Parametric Approach
 Kaplan-Meier survival curve
 The approach was published in 1958 by Edward L. Kaplan and Paul Meier in
their paper, “Non-parametric estimation from incomplete observations”. J. Am.
Stat. Assoc. 53:457-481. Kaplan and Meier were interested in the lifetime of
vacuum tubes and the duration of cancer, respectively.
 Also called product limit method, since
 d 
S (t )   1  i 
ˆ

ti t  ni 

where d i is number of events at time ti and ni is the number of subjects at risk
just prior to time t i .

 Confidence interval: Kalbfleisch and Prentice (2002) suggested using:
ˆ ˆ 
V log(  log( S (t ))) 
1
ˆ
(log( S (t )) 2
 n (n
di
 di )
ˆ
to get a confidence for log(  log( S (t ))).
ti t i i

ˆ
The confidence interval for s (t ) can be derived accordingly.

Non-Parametric Approach
 Log-Rank test is used to test the equality of two survival
functions. For comparing two survival curves, we have:

Z 
 j
(o1 j  e1 j )

 j
v1 j

Z 2 ~ 1
2

v1 j is estimated based on a hypergeometric distribution.

Example 1
 Example 1: Field survival data can be used to
further evaluate product quality and may indicate
possible quality related issues. The hazard
function for hard disk drive field returns (or Weibull fit
failures) shows a significant peak at early life
time.

Lognormal fit

 Commonly used parametric distribution models
such as Weibull, Lognormal, or Logistic model
fit such a hazard function poorly. Therefore,
Kaplan-Meier and Log-Rank test are used to Logistic fit
describe survival functions and evaluate the
effects of two interested factors on drive‟s field
survivals, respectively.

Example 1
 In addition, field survival data is observational. Propensity score matching is
applied to balance out possible effect from other factors (covariates). Both
before and after matching results are presented here.

Chi-
Test Chi-Square DF ProbChiSq Test Square DF ProbChiSq

Matched Sample
Log-Rank 138.5724 1 <.0001 Log-Rank 1.2565 1 0.2613
Original Data

Description HazardRatio WaldLower WaldUpper Description HazardRatio WaldLower WaldUpper
GROUP1 vs. GROUP2 2.287 1.971 2.653 GROUP1 vs. 1.151 0.643 2.060
GROUP2

Example 2
This is an example to demonstrate Cox PH model
application. The time to event is the disease free
time for a Acute Myelocytic Leukemia (AML) patient
after a special treatment. It is interested to evaluate
if the disease free time after the treatment may vary
by gender and by age.

Obs Group gender age Time Status
1 AML-Low Risk M 24 3395 0
2 AML-Low Risk F 26 3471 0
5 AML-Mediate Risk F 29 3034 0
6 AML-Mediate Risk F 31 3676 0
9 AML-High Risk F 32 4123 0
16 AML-High Risk F 34 3328 0 Test of Equality over Strata
17 AML-High Risk F 35 2640 0 Test Chi-Square DF Pr >Chi-Square
… … … … … Log-Rank 26.9998 5 <.0001
273 AML-High Risk M 74 16 1

Part of the data used in this example is from an
example published by SAS

Example 2
• SAS codes
proc phreg data=Example2;
class gender group;
model Time*Status(0)=age group gender
/selection=stepwise;
run;

Analysis of Maximum Likelihood Estimates

Parameter DF Paramete Standard Chi-Square Pr > ChiS Hazard
r Error q Ratio
Estimate
age 1 0.15180 0.01229 152.5961 <.0001 1.164
Group AML-High 1 0.46243 0.19063 5.8844 0.0153 1.588
Risk
Group AML-Low 1 -0.18436 0.20569 0.8034 0.3701 0.832
Risk
Summary of Stepwise Selection

Step Effect DF Number Score Wald Pr > ChiSq
In Chi-Square Chi-Square
Entered Removed

1 age 1 1 169.3010 <.0001
2 Group 2 2 13.1022 0.0014 Test of Equality over Strata
Test Chi-Square DF Pr >Chi-Square
The modeling result suggests that the effect of gender on
Log-Rank 17.1657 2 <.0002
survival function after the transplant is not statistically significant,
but the effects of age and severity group are significant.

Discussion – Parametric Models
 Nice properties
 Efficient data reduction – a function with a few parameters completely
describes a survival pattern.
 Enable Standardized comparison – evaluation and comparison based on
statistics such as MTTF
 Prediction into future possible
 Possible issues
 Assumptions
 Non-informative censoring
 Parametric distribution
 Exponential family – flexible enough?
 One vs. multiple distributions – three Weibull distributions for describing a bathtub hazard?
 How confident we are about future survival path?
 Estimation
 Distribution – usually non symmetric
 Sample size and time period covered by observations
 Censoring

Discussion – Cox PH Model
 Nice properties:
 Parametric distribution assumption is not needed.
 Easy to evaluate or test the hypotheses about the effect of a covariate on survival
 Very popular in clinical trail analysis and outcome studies
 Possible issues:
 Proportional hazard – a strong assumption
 When violated, stratified or extended Cox models may be used.
 Tests of the assumption
 log(-log(S(t))) plot
 Including interactions with time in the model
 Scaled Schoenfeld residuals plot
 Estimation
 Censored observation – not informative
 Similar issues as seen in a multivariate regression model

Discussion – Non-Parametric Approach
 Nice properties:
 Distribution free
 Graphical and intuitive
 Describe well observed survival
 Possible issues
 Not continuous
 Estimates can be biased when improperly stratified– For example,
survival function estimates on the tail can be poor.
 Smoothing is usually needed when estimating hazard function
 Not informative in terms of future survival function
 In cases with cross survival or hazard curves, Log-Rank test is not
appropriate.

Discussion – Estimation Improvement
 Bayesian based survival analysis approaches
 Introducing prior knowledge to improve parameter
estimation
 Application of multiple imputation to survival
analysis
 May reduce the effect of censored observations.
 The availability of large historical observations may be
informative to the imputation.

Summary
 Survival analysis – has found its applications in many fields. It can be powerful in
providing insightful information to evaluate a product reliability, monitoring field
quality, assisting in making warranty policy, and validating new drug efficacy, etc.
 Parametric distribution based approach would be the most popular survival
analysis approach in reliability engineering while Cox PH model and non-
parametric approach are usually favored in biostatistical survival analysis.
 Each approach comes with its own assumptions and is designed to meet a
specified purpose. Validation of these assumptions should always be conducted to
ensure the appropriate applications of an approach.
 Censored data – a major characteristic for survival data that contributes to the
uniqueness of survival data analysis and possible issues in model estimation. It
should always be kept in mind when designing related experiments and analyzing
survival data.

Questions?

Thanks!
Contact Email: shao_zhang100@yahoo.com

Application of survival data analysis introduction and discussion

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Application of survival data analysis introduction and discussion

Ähnlich wie Application of survival data analysis introduction and discussion (20)

Mehr von ASQ Reliability Division

Mehr von ASQ Reliability Division (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Application of survival data analysis introduction and discussion