Contents

Survival Models

生存模型(Survival Models)属于General Linear Model, 被广泛用于Censored Data的建模, 譬如用户流失预测. 这里介绍下最基本的生存模型以及在Censored Data上的MLE估计

https://my-imgshare.oss-cn-shenzhen.aliyuncs.com/58319465_p0.jpg

Survival Function

Assume $T$ is a continuous random variable indicates the death occurrence time, we have:

$$ F(t) = P\lbrace T < t\rbrace = \int_0^t f(t) dt \tag{1.1} $$

Then the Survival Function should be:

$$ S(t) = P\lbrace T > t\rbrace = 1 - F(t) = \int_t^\infty f(t) dt \tag{1.2} $$

Harzard Function

An alternative way to characterization the distribution is given by harzard function, or instantaneous rate of occurrence of the event:

$$ \begin{align} \lambda(t) &= \lim_{dt \to 0} \frac{P\lbrace t \le T < t + dt | T \ge t\rbrace}{dt} \\ &= \lim_{dt \to 0} \frac{P\lbrace t \le T < t + dt \rbrace}{P \lbrace T \ge t\rbrace dt} \\ &= \lim_{dt \to 0} \frac{f(t)dt}{S(t) dt} \\ &= \frac{f(t)}{S(t)} \end{align} \tag{2.1} $$

Given $(1.2)$ we have $\frac{d}{dt} S(t) = -f(t)$, so $(2.1)$ has another form

$$ \lambda(t) = -\frac{d}{dt} log S(t) \tag{2.2} $$

We could derive survival function from harzard function as well:

$$ S(t) = exp\lbrace - \int_0^t \lambda(x)dx \rbrace = exp\lbrace -\Lambda(t) \rbrace \tag{2.3} $$

In which $\Lambda(t) = \int_0^t \lambda(x)dx$, called cumulative hazard


Example 2.1

Here we’re modeling a constant risk over time: $$ \lambda(t) = \lambda $$ From $(2.2)$, we could solve corresponding survival function and pdf $$ \begin{align} S(t) &= exp\lbrace - \int_0^t \lambda(x)dx \rbrace = e^{-\lambda t} \\ f(t) &= \lambda e^{-\lambda t} \end{align} $$ That is exactly an exponential distribution


Expectation of Life

Given $S(t)$ or $\lambda(t)$, it’s easy to denote expected value of $T$ $$ \mu = \int_0^\infty tf(t)dt =\int_0^\infty S(t)dt $$

Censoring and the likelihood function

Censoring Type

  1. Type I Typically 2 types of observatioin:
    • A sample of $n$ units is followed for a fixed time $\tau$
    • Generalization, fixed censoring: each unit has a fixed time $\tau_i$

In cases above, number of deaths is a random variable.

  1. Type II
  • A sample of $n$ units is followed as long as necessary until $d$ units have experienced the event
  • Generalization, random censoring: Each unit has:
    • Censoring time $C_i$
    • Potential lifetime $T_i$
    • Observe time $Y_i = min\lbrace C_i, T_i\rbrace$
    • Indicator $d_i, \delta_i$ tells us whether the observation is terminated by death or censoring

Likelihood of censoring model

  1. Unit died at $t_i$. Since we know it is dead while survives till $t_i$, we have: $$ L_i = f(t_i) = S(t_i)\lambda(t_i) \tag{3.1} $$

  2. Unit still alive at $t_i$. We only know it survives till $t_i$ $$ L_i = f(t_i) = S(t_i) \tag{3.2} $$

Given 2 conditions above, we have: $$ L = \prod\limits_{i=1}^{n}L_i = \prod\limits_{i} \lambda(t_i)^{d_i}S(t_i) \tag{3.3} $$ Taking logs, considering $(2.3)$, we have: $$ log L = \sum\limits_{i=1}^{n} \lbrace d_ilog\lambda(t_i) - \Lambda(t_i) \rbrace \tag{3.4} $$


Example 3.1

Considering exponential distribution $\lambda(t) = \lambda$, from$(3.4)$, we have $$ log L = \sum\limits_{i=1}^{n} \lbrace d_ilog\lambda - \lambda t_i \rbrace $$

We could estimate $\lambda$ using MLE:

Let $D=\sum d_i$ denotes the total number of deaths, $T = \sum t_i$ denotes total number of observation time:

$$ \begin{align} log L &= Dlog\lambda - T\lambda \\ \frac{\partial}{\partial \lambda} L &= \frac{D}{\lambda} - T \end{align} $$

Letting $\frac{\partial}{\partial \lambda} L = 0$ we get the estimation of $\lambda$

$$ \hat \lambda = \frac{D}{T} $$


Reference