Contents

Survival Models

生存模型(Survival Models)属于General Linear Model, 被广泛用于Censored Data的建模, 譬如用户流失预测. 这里介绍下最基本的生存模型以及在Censored Data上的MLE估计

https://my-imgshare.oss-cn-shenzhen.aliyuncs.com/58319465_p0.jpg

Assume TT is a continuous random variable indicates the death occurrence time, we have:

F(t)=P{T<t}=0tf(t)dt(1.1) F(t) = P\lbrace T < t\rbrace = \int_0^t f(t) dt \tag{1.1}

Then the Survival Function should be:

S(t)=P{T>t}=1F(t)=tf(t)dt(1.2) S(t) = P\lbrace T > t\rbrace = 1 - F(t) = \int_t^\infty f(t) dt \tag{1.2}

An alternative way to characterization the distribution is given by harzard function, or instantaneous rate of occurrence of the event:

λ(t)=limdt0P{tT<t+dtTt}dt=limdt0P{tT<t+dt}P{Tt}dt=limdt0f(t)dtS(t)dt=f(t)S(t)(2.1) \begin{align} \lambda(t) &= \lim_{dt \to 0} \frac{P\lbrace t \le T < t + dt | T \ge t\rbrace}{dt} \\ &= \lim_{dt \to 0} \frac{P\lbrace t \le T < t + dt \rbrace}{P \lbrace T \ge t\rbrace dt} \\ &= \lim_{dt \to 0} \frac{f(t)dt}{S(t) dt} \\ &= \frac{f(t)}{S(t)} \end{align} \tag{2.1}

Given (1.2)(1.2) we have ddtS(t)=f(t)\frac{d}{dt} S(t) = -f(t), so (2.1)(2.1) has another form

λ(t)=ddtlogS(t)(2.2) \lambda(t) = -\frac{d}{dt} log S(t) \tag{2.2}

We could derive survival function from harzard function as well:

S(t)=exp{0tλ(x)dx}=exp{Λ(t)}(2.3) S(t) = exp\lbrace - \int_0^t \lambda(x)dx \rbrace = exp\lbrace -\Lambda(t) \rbrace \tag{2.3}

In which Λ(t)=0tλ(x)dx\Lambda(t) = \int_0^t \lambda(x)dx, called cumulative hazard


Here we’re modeling a constant risk over time: λ(t)=λ \lambda(t) = \lambda From (2.2)(2.2), we could solve corresponding survival function and pdf S(t)=exp{0tλ(x)dx}=eλtf(t)=λeλt \begin{align} S(t) &= exp\lbrace - \int_0^t \lambda(x)dx \rbrace = e^{-\lambda t} \\ f(t) &= \lambda e^{-\lambda t} \end{align} That is exactly an exponential distribution


Given S(t)S(t) or λ(t)\lambda(t), it’s easy to denote expected value of TT μ=0tf(t)dt=0S(t)dt \mu = \int_0^\infty tf(t)dt =\int_0^\infty S(t)dt

  1. Type I Typically 2 types of observatioin:
    • A sample of nn units is followed for a fixed time τ\tau
    • Generalization, fixed censoring: each unit has a fixed time τi\tau_i

In cases above, number of deaths is a random variable.

  1. Type II
  • A sample of nn units is followed as long as necessary until dd units have experienced the event
  • Generalization, random censoring: Each unit has:
    • Censoring time CiC_i
    • Potential lifetime TiT_i
    • Observe time Yi=min{Ci,Ti}Y_i = min\lbrace C_i, T_i\rbrace
    • Indicator di,δid_i, \delta_i tells us whether the observation is terminated by death or censoring
  1. Unit died at tit_i. Since we know it is dead while survives till tit_i, we have: Li=f(ti)=S(ti)λ(ti)(3.1) L_i = f(t_i) = S(t_i)\lambda(t_i) \tag{3.1}

  2. Unit still alive at tit_i. We only know it survives till tit_i Li=f(ti)=S(ti)(3.2) L_i = f(t_i) = S(t_i) \tag{3.2}

Given 2 conditions above, we have: L=i=1nLi=iλ(ti)diS(ti)(3.3) L = \prod\limits_{i=1}^{n}L_i = \prod\limits_{i} \lambda(t_i)^{d_i}S(t_i) \tag{3.3} Taking logs, considering (2.3)(2.3), we have: logL=i=1n{dilogλ(ti)Λ(ti)}(3.4) log L = \sum\limits_{i=1}^{n} \lbrace d_ilog\lambda(t_i) - \Lambda(t_i) \rbrace \tag{3.4}


Considering exponential distribution λ(t)=λ\lambda(t) = \lambda, from(3.4)(3.4), we have logL=i=1n{dilogλλti} log L = \sum\limits_{i=1}^{n} \lbrace d_ilog\lambda - \lambda t_i \rbrace

We could estimate λ\lambda using MLE:

Let D=diD=\sum d_i denotes the total number of deaths, T=tiT = \sum t_i denotes total number of observation time:

logL=DlogλTλλL=DλT \begin{align} log L &= Dlog\lambda - T\lambda \\ \frac{\partial}{\partial \lambda} L &= \frac{D}{\lambda} - T \end{align}

Letting λL=0\frac{\partial}{\partial \lambda} L = 0 we get the estimation of λ\lambda

λ^=DT \hat \lambda = \frac{D}{T}


Reference