生存模型(Survival Models)属于General Linear Model, 被广泛用于Censored Data的建模, 譬如用户流失预测. 这里介绍下最基本的生存模型以及在Censored Data上的MLE估计

Assume T is a continuous random variable indicates the death occurrence time, we have:
F(t)=P{T<t}=∫0tf(t)dt(1.1)
Then the Survival Function should be:
S(t)=P{T>t}=1−F(t)=∫t∞f(t)dt(1.2)
An alternative way to characterization the distribution is given by harzard function, or instantaneous rate of occurrence of the event:
λ(t)=dt→0limdtP{t≤T<t+dt∣T≥t}=dt→0limP{T≥t}dtP{t≤T<t+dt}=dt→0limS(t)dtf(t)dt=S(t)f(t)(2.1)
Given (1.2) we have dtdS(t)=−f(t), so (2.1) has another form
λ(t)=−dtdlogS(t)(2.2)
We could derive survival function from harzard function as well:
S(t)=exp{−∫0tλ(x)dx}=exp{−Λ(t)}(2.3)
In which Λ(t)=∫0tλ(x)dx, called cumulative hazard
Here we’re modeling a constant risk over time:
λ(t)=λ
From (2.2), we could solve corresponding survival function and pdf
S(t)f(t)=exp{−∫0tλ(x)dx}=e−λt=λe−λt
That is exactly an exponential distribution
Given S(t) or λ(t), it’s easy to denote expected value of T
μ=∫0∞tf(t)dt=∫0∞S(t)dt
Censoring and the likelihood function
- Type I
Typically 2 types of observatioin:
- A sample of n units is followed for a fixed time τ
- Generalization, fixed censoring: each unit has a fixed time τi
In cases above, number of deaths is a random variable.
- Type II
- A sample of n units is followed as long as necessary until d units have experienced the event
- Generalization, random censoring: Each unit has:
- Censoring time Ci
- Potential lifetime Ti
- Observe time Yi=min{Ci,Ti}
- Indicator di,δi tells us whether the observation is terminated by death or censoring
-
Unit died at ti. Since we know it is dead while survives till ti, we have:
Li=f(ti)=S(ti)λ(ti)(3.1)
-
Unit still alive at ti. We only know it survives till ti
Li=f(ti)=S(ti)(3.2)
Given 2 conditions above, we have:
L=i=1∏nLi=i∏λ(ti)diS(ti)(3.3)
Taking logs, considering (2.3), we have:
logL=i=1∑n{dilogλ(ti)−Λ(ti)}(3.4)
Considering exponential distribution λ(t)=λ, from(3.4), we have
logL=i=1∑n{dilogλ−λti}
We could estimate λ using MLE:
Let D=∑di denotes the total number of deaths, T=∑ti denotes total number of observation time:
logL∂λ∂L=Dlogλ−Tλ=λD−T
Letting ∂λ∂L=0 we get the estimation of λ
λ^=TD
Reference