Contents

对数线性模型, MEMM与CRF

可能是有史以来公式最多的文章. 敲到吐血

https://my-imgshare.oss-cn-shenzhen.aliyuncs.com/13010631_p0_master1200.jpg
你的PGM也会和Steins Gate的世界线一样收束吗?

设输入是xx, 输出是yy, HMM刻画的是联合概率分布P(x,yλ)P(x, y | \lambda), 因此是生成模型, 可以同时进行推断P(yx,λ)P(y|x, \lambda)和解码P(xy,λ)P(x|y, \lambda). 而CRF刻画的是条件概率分布P(yx,λ)P(y|x, \lambda), 是判别模型

符号 含义
xXx \in \mathcal X input/feature
yYy \in \mathcal Y output/label
ϕ(x,y):X×YRd\vec{\phi}(x, y): \mathcal X \times \mathcal Y \rightarrow R^d 特征函数, 将(x,y)(x, y)映射为RdR^d维特征向量
wRd\vec{w} \in R^d weights for ϕ\vec \phi

Log Linear Models take the following form:

P(yx;w)=1Zexp(wϕ(x,y))Z=yiYexp(wϕ(x,yi))(1.1) \begin{align} P(y|x; \vec w) &= \frac{1}{Z} \exp(\vec w \cdot \vec \phi(x, y)) \\ Z &= \sum\limits_{y_i \in \mathcal Y} \exp(\vec w \cdot \vec \phi(x, y_i)) \end{align} \tag{1.1} 其中ZZ是归一化常数, 使得输出构成概率分布. 这是条件概率, 因此是判别模型. w×ϕ\vec w \times \vec \phi代表了给定xx之后yy的’可能性’, 而模型的核心就是特征函数ϕ\vec \phi.

  1. ϕ\vec \phi实际上是不同的特征函数, 每个特征函数匹配不同pattern并score, 赋予模型分辨能力. 例如在logistic regression中: ϕk(x,y)={xk,y=10,y=0\vec \phi_k(\vec x, y) = \begin{cases} x_k, & \text{y=1} \\ 0, & \text{y=0} \end{cases} 特征函数赋予模型二分类能力(不论x\vec x如何, 只要出现了不同的yy, 模型就要区分出来)
  2. w\vec w让模型对ϕ\vec \phi 的每个特征给予不同的权重, 赋予模型判断特征权重的能力
  3. 使用指数控制其非负
  4. 使用ZZ归一化之后就转化成了概率分布.

给定一组容量为nn的样本S={(xi,yi)}i=1nS = \lbrace (x_i, y_i) \rbrace^{n}_{i=1}, 使用Log Linear Model计算其对数似然 logP(Sw)=logL(w)=logi=1nP(yixi;w)=i=1nlogP(yixi;w)(1.2) \begin{align} \log P(S | \vec w) = \log L(\vec w) &= \log \prod\limits_{i=1}^n P(y_i|x_i; \vec w) \\ &= \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) \end{align} \tag{1.2}

使用最大似然估计参数w\vec w w=argmaxwRdi=1nlogP(yixi;w) \vec w^{*} = \arg \max\limits_{\vec w \in R^d} \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) 常用的优化手段有梯度下降(负对数似然), 其中梯度计算方式如下:

logL(w)w=wi=1nlogP(yixi;w)=wi=1nlogexp(wϕ(xi,yi))yjYexp(wϕ(x,yj))=i=1nϕ(xi,yi)i=1nyjYP(yjxi;w)ϕ(xi,yj)(1.3) \begin{align} \frac{\partial \log L(\vec w)}{\partial \vec w} &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) \\ &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\sum\limits_{y_j \in \mathcal Y} \exp(\vec w \cdot \vec \phi(x, y_j))} \\ &= \sum\limits_{i=1}^n \vec \phi(x_i, y_i) - \sum\limits_{i=1}^n \sum\limits_{y_j \in \mathcal Y} P(y_j|x_i; \vec w) \vec \phi(x_i, y_j) \end{align} \tag{1.3}

正则化是优化模型性能的重要方式, L2正则项为12w2\frac{1}{2} ||\vec w||^2, 损失函数变为logL(w)+12w2\log L(\vec w) + \frac{1}{2} ||\vec w||^2即可

定义: xRny{0,1}ϕk(x,y)={xk,y=10,y=0(e.0) \begin{align} \vec x &\in R^n \\ y &\in \lbrace 0, 1 \rbrace \\ \vec \phi_k(\vec x, y) &= \begin{cases} x_k, & \text{y=1} \\ 0, & \text{y=0} \end{cases} \end{align} \tag{e.0} 由定义(1.1)(1.1)求出条件概率 P(yx;w)=exp(wϕ(x,y))exp(wϕ(x,1))+1(e.1) P(y|\vec x; \vec w) = \frac{\exp(\vec w \cdot \vec \phi(\vec x, y))}{\exp(\vec w \cdot \vec \phi(\vec x, 1)) + 1} \tag{e.1}

(1.2)(1.2)可知对数似然 logL(w)=i=1nlogp(yixi;w)=i=1nlogexp(wϕ(xi,yi))exp(wϕ(xi,1))+exp(wϕ(xi,0))=i=1nlogexp(wϕ(xi,yi))exp(wxi)+1=i=1n[wϕ(xi,yi)log(exp(wxi)+1)](e.2) \begin{align} \log L(\vec w) &= \sum\limits_{i=1}^n \log p(y_i|x_i; \vec w) \\ &= \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\exp(\vec w \cdot \vec \phi(x_i, 1)) + \exp(\vec w \cdot \vec \phi(x_i, 0))} \\ &= \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\exp(\vec w \cdot \vec x_i) + 1} \\ &= \sum\limits_{i=1}^n [\vec w \cdot \vec \phi(x_i, y_i) - \log (\exp(\vec w \cdot \vec x_i) + 1)] \end{align} \tag{e.2} (1.3)(1.3)求出梯度 logL(w)w=wi=1n[wϕ(xi,yi)log(exp(wxi)+1)]=i=1n[ϕ(xi,yi)exp(wxi)xiexp(wxi)+1]=i=1n[ϕ(xi,yi)P(1xi;w)xi](e.3) \begin{align} \frac{\partial \log L(\vec w)}{\partial \vec w} &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n [\vec w \cdot \vec \phi(x_i, y_i) - \log (\exp(\vec w \cdot \vec x_i) + 1)] \\ &= \sum\limits_{i=1}^n [\vec \phi(x_i, y_i) - \frac{\exp(\vec w \cdot \vec x_i) \vec x_i}{\exp(\vec w \cdot \vec x_i) + 1}] \\ &= \sum\limits_{i=1}^n [\vec \phi(x_i, y_i) - P(1|\vec x_i; \vec w) \vec x_i] \end{align} \tag{e.3} 事实上, 当y{0,1}y \in \lbrace 0, 1 \rbrace时, (e.0)(e.0)中的特征函数可以简化为ϕ(x,y)=yx\vec \phi(\vec x, y) = y \vec x, (e.2)(e.2), (e.3)(e.3)可以进一步简化, 但当yy为其他取值时不通用, 因此没有扔进去算

把log linear models推广到序列上, 就有了MEMM

符号 含义
xXx \in \mathcal X input, for example the jj’th word in a sentence
x=(x1,x2,,xT)\vec x = (x_1,x_2, \cdots, x_T) input sequence
siS,1iSs^i \in \mathcal S, 1 \leq i \leq \lvert \mathcal S \rvert state or output, like tag of a word
s=(s1,s2,,sT)\vec s = (s_1,s_2, \cdots, s_T) state sequence
ϕ(x,y):X×YRd\vec \phi(x, y): \mathcal X \times \mathcal Y \rightarrow R^d 特征函数, 将(x,y)(x, y)映射为RdR^d维特征向量
wRd{\vec w} \in R^d weights for ϕ\vec \phi

MEMM的结构中, 状态链是马尔科夫链, 符合局部Markov性, 而transition和emission矩阵由特征函数刻画. 整个模型刻画了条件概率分布: P((s1,s2,,sT)x)=i=1TP(si(s1,s2,,si1),x)=i=1TP(sisi1,x)(2.1) \begin{align} P((s_1,s_2, \cdots, s_T)|\vec x) &= \prod\limits_{i=1}^{T} P(s_i | (s_1,s_2, \cdots, s_{i-1}), \vec x) \\ &= \prod\limits_{i=1}^{T} P(s_i | s_{i-1}, \vec x) \end{align} \tag{2.1} 其中

  1. 第一步使用了chain rule of conditional probabilities, 可以把联合概率拆分为条件概率: P(An,,A1)=P(AnAn1,,A1)P(An1,,A1)P(A4,A3,A2,A1)=P(A4A3,A2,A1)P(A3A2,A1)P(A2A1) \begin{align} P(A_n, \cdots, A_1) &= P(A_n | A_{n-1}, \cdots, A_1) P(A_{n-1}, \cdots, A_1) \\ P(A_4, A_3, A_2, A_1) &= P(A_4|A_3, A_2, A_1) P(A_3|A_2, A_1) P(A_2|A_1) \end{align}
  2. 第2步使用了independence assumption: P(sisi1,si2,,s1)=P(sisi1)P(s_i|s_{i-1}, s_{i-2}, \cdots, s_1) = P(s_i|s_{i-1})

为了计算(2.1)(2.1), 我们使用Log Linear Model刻画输出与输入/上一个输出之间的条件概率: P(sisi1,x;w)=exp(wϕ(x,i,si1,si))sSexp(wϕ(x,i,si1,s))(2.2) P(s_i | s_{i-1}, \vec x; \vec w) = \frac{\exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \Bigr )}{\sum\limits_{s’ \in \mathcal S} \exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s’ \bigr ) \Bigr )} \tag{2.2}

对于ϕ(x,i,si1,si) \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ), 有:

  1. x=(x1,,xT)\vec x = (x_1, \cdots, x_T)是整个输入序列
  2. ii是输出序号
  3. si1s_{i-1}是上一次输出
  4. sis_i是输出

获取(x1,x2,,xT)(x_1,x_2, \cdots, x_T)(s1,s2,,sT)(s_1,s_2, \cdots, s_T)之后, 学习可以使用最大似然, 参考(1.2)(1.2)(1.3)(1.3)

和Log Linear Models中的预测不同, 前者只需给定输入预测输出的概率分布即可, 但MEMM给定输入之后可以刻画ST|\mathcal S|^T种输出, 找到条件概率最大的输出序列就成为了一个任务: argmax(s1sT)P((s1,s2,,sT)x) \arg \max\limits_{(s_1\cdots s_T)} P\bigl ((s_1,s_2, \cdots, s_T)|\vec x \bigr ) 这和HMM的Viterbi算法如出一辙. 使用动态规划解决:

π[i,s]=maxs1si1P((s1,s2,,si=s)x)=maxs1si1P((si=ssi1,x))P((s1,s2,,si1)x)=maxs1si1P((si=ssi1,x)j=1i1P((sjsj1,x)(2.3) \begin{align} \pi[i, s] &= \max\limits_{s_1 \cdots s_{i-1}} P \bigl ( (s_1, s_2, \cdots, s_{i}=s) | \vec x \bigr ) \\ &= \max\limits_{s_1 \cdots s_{i-1}} P\bigl ( (s_i=s|s_{i-1}, \vec x) \bigr)P \bigl ( (s_1, s_2, \cdots, s_{i-1}) |\vec x \bigr ) \\ &= \max\limits_{s_1 \cdots s_{i-1}} P\bigl ( (s_i=s|s_{i-1}, \vec x) \prod\limits_{j=1}^{i-1} P\bigl ( (s_j|s_{j-1}, \vec x) \end{align} \tag{2.3}

递推式为 π[i,s]=maxsSπ[i1,s]P(si1=ssi=s,x)(2.4) \pi[i, s] = \max\limits_{s’ \in \mathcal S} \pi[i-1, s’] \cdot P(s_{i-1}=s’ | s_i=s, \vec x) \tag{2.4}

在计算transition & emission概率时, HMM和MEMM分别使用如下模型: P(sisi1)P(xisi)=Asi1siBsi(oi)P(sisi1,x)=exp(wϕ(x,i,si1,si))sSexp(wϕ(x,i,si1,s)) \begin{align} P(s_i|s_{i-1})P(x_i|s_i) &= A_{s_{i-1}s_i}B_{s_i}(o_i)\\ P(s_i | s_{i-1}, \vec x) &= \frac{\exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \Bigr )}{\sum\limits_{s’ \in \mathcal S} \exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s’ \bigr ) \Bigr )} \end{align} MEMM的关键优势在于使用特征函数ϕ\vec\phi可以捕捉更多信息.

  • HMM中ARS×SA \in R^{|\mathcal S| \times |\mathcal S|}, BRS×OB \in R^{|\mathcal S| \times |\mathcal O|}均为二维矩阵, 用来刻画transition和emission
  • MEMM中特征函数ϕRX×T×S×S\vec\phi \in R^{|\mathcal X| \times T \times |\mathcal S| \times |\mathcal S|}, 捕捉特征的能力远强于HMM.

For example, the transition probability can be sensitive to any word in the input sequence x1xTx_1 \cdots x_T. In addition, it is very easy to introduce features that are sensitive to spelling features (e.g., prefixes or suffixes) of the current word xix_i, or of the surrounding words. These features are useful in many NLP applications, and are difficult to incorporate within HMMs in a clean way.

并且, HMM刻画的是联合概率分布(joint probability), MEMM刻画的是条件概率分布(conditional probability), 后者往往有更高的precision

条件随机场的任务依然是刻画序列的条件概率P(sx)P(\vec s | \vec x): P(sx;w)=exp(wΦ(x,s))sSTexp(wΦ(x,s))(3.1) P(\vec s | \vec x; \vec w) = \frac{\exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr)}{\sum\limits_{\vec s’ \in \mathcal S^T} \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s’) \Bigr)} \tag{3.1} 然而这里的特征函数是直接建立在整个输入序列对输出序列的.

feature vector maps an entire input sequence x\vec x paired with an entire state sequence s\vec s to some dd-dimensional feature vector. Φ(x,s)=x×sRd \vec \Phi(\vec x, \vec s) = \vec x \times \vec s \rightarrow R^d

但由于局部Markov性, 特征函数可以通过简单的方式进行分解 Φ(x,s)=i=1Tϕ(si1,si,x,i)(3.2) \begin{align} \vec \Phi(\vec x, \vec s) &= \sum\limits_{i=1}^T \vec \phi \bigl ( s_{i-1}, s_i, \vec x, i \bigr ) \end{align} \tag{3.2} 这里的ϕ(si1,si,x,i)\phi \bigl ( s_{i-1}, s_i, \vec x, i \bigr )和MEMM中的一致.

由于CRF是序列对序列的, 计算概率的过程比较复杂

非规范化概率的表示 P^(sx;w)=exp(wΦ(x,s))=exp(wi=1T[t(si1,si,x,i)+e(si,x,i)])=exp(k=1ti=1Tλktk(si1,si,x,i)+k=1ei=1Tμkek(si,x,i))(3.3) \begin{align} \hat P(\vec s | \vec x; \vec w) &= \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr) \\ &= \exp \Bigl ( \vec w \cdot \sum\limits_{i=1}^T [ \vec t(s_{i-1}, s_i, \vec x, i) + \vec e(s_i, \vec x, i) ] \Bigr ) \\ &= \exp \Bigl ( \sum\limits_{k=1}^{|\vec t|} \sum\limits_{i=1}^T \lambda_k t_k(s_{i-1}, s_i, \vec x, i) + \sum\limits_{k=1}^{|\vec e|} \sum\limits_{i=1}^T \mu_k e_k(s_i, \vec x, i) \Bigr ) \end{align} \tag{3.3} 其中t,e\vec t, \vec e分别是transition和emission特征, transition定义在边上, 依赖于当前和前一个位置, emission定义在节点上, 依赖于当前位置. λ,μ\lambda, \mu分别是他们的权重, CRF完全由λ,t,μ,e\lambda, t, \mu, e定义.

假设转移矩阵和位置及输入无关, 可以这样刻画transition和emission: t[i][j]=λt(si,sj)si,sjSe[i][j]=μe(xi,sj)xix,sjS \begin{align} t[i][j] &= \lambda t(s^i, s^j) & s^i, s^j \in \mathcal S \\ e[i][j] &= \mu e(x_i, s^j) & x_i \in \vec x, s^j \in \mathcal S \end{align}

为了处理边界条件, 定义起点为s0s_0, 终点为s1s_{-1}, 即 s=(s0,s1,,sT,s1) \vec s = (s_0, s_1, \cdots, s_T, s_{-1})

(3.1)(3.1)分母是全局规范化因子, 涉及计算ST\mathcal S^T条路径的非规范化概率之和, 这和HMM中计算P(Oλ)P(O|\lambda)时需要对所有状态求和是一致的, 都可以使用前向/后向算法解决.

举个例子, 假设Sa,b\mathcal S \in {a, b} α1(a)=P^((s0),s1=a(x1))=exp(t(0,a)+e(a,x1))α1(b)=P^((s0),s1=b(x1))=exp(t(0,b)+e(b,x1))(3.4) \begin{align} \alpha_1(a) = \hat P \bigl ( (s_0),s_1=a |(x_1) \bigr ) &= \exp \bigl ( t(0, a) + e(a, x_1) \bigr ) \\ \alpha_1(b) = \hat P \bigl ( (s_0),s_1=b |(x_1) \bigr ) &= \exp \bigl ( t(0, b) + e(b, x_1) \bigr ) \\ \end{align} \tag{3.4} α2(a)=P^((s0,s1),s2=a(x1,x2))=exp(t(0,a)+e(a,x1)+t(a,a)+e(a,x2))+exp(t(0,b)+e(b,x1)+t(b,a)+e(a,x2))=α1(a)exp(t(a,a)+e(a,x2))+α1(b)exp(t(b,a)+e(a,x2))(3.5) \begin{align} \alpha_2(a) &= \hat P \bigl ( (s_0, s_1), s_2=a |(x_1, x_2) \bigr ) \\ &= \exp \bigl ( t(0, a) + e(a, x_1) + t(a, a) + e(a, x_2) \bigr ) \\ &+ \exp \bigl ( t(0, b) + e(b, x_1) + t(b, a) + e(a, x_2) \bigr ) \\ &= \alpha_1(a) \cdot \exp \bigl ( t(a, a) + e(a, x_2) \bigr ) \\ &+ \alpha_1(b) \cdot \exp \bigl ( t(b, a) + e(a, x_2) \bigr ) \end{align} \tag{3.5} (3.4)(3.4)(3.5)(3.5)可以归纳出递推式 αt+1(si)=j=1Sαt(sj)exp(t(sj,si)+e(si,xt+1))(3.6) \alpha_{t+1}(s^i) = \sum\limits_{j = 1}^{|\mathcal S|} \alpha_{t}(s^j) \cdot \exp\bigl (t(s^j, s^i) + e(s^i, x_{t+1}) \bigr ) \tag{3.6} αt+1(si)=j=1Sexp(logαt(sj)+t(sj,si)+e(si,xt+1))(3.7) \alpha_{t+1}(s^i) = \sum\limits_{j = 1}^{|\mathcal S|} \exp \bigl ( \log \alpha_{t}(s^j) + t(s^j, s^i) + e(s^i, x_{t+1}) \bigr ) \tag{3.7} βt+1(si)=logj=1Sexp(βt(sj)+t(sj,si)+e(si,xt+1))(3.8) \beta_{t+1}(s^i) = \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \bigl ( \beta_{t}(s^j) + t(s^j, s^i) + e(s^i, x_{t+1}) \bigr ) \tag{3.8} 其中βt(si)=logαt(si)=logP^(st=si(x1xt))\beta_t(s^i) = \log \alpha_t(s^i) = \log \hat P \bigl ( s_t=s^i |(x_1 \cdots x_t) \bigr )

给定x\vec x, 求argmaxsSTP(sx;w)\arg \max\limits_{\vec s \in \mathcal S^T} P(\vec s | \vec x; \vec w)

argmaxsSTP(sx;w)=argmaxsSTexp(wΦ(x,s))sSTexp(wΦ(x,s))=argmaxsSTwΦ(x,s)=argmaxsSTwi=1Tϕ(x,i,si1,si)=argmaxsSTk=1ti=1Tλktk(si1,si,x,i)+k=1ei=1Tμkek(si,x,i) \begin{align} \arg \max\limits_{\vec s \in \mathcal S^T} P(\vec s | \vec x; \vec w) &= \arg \max\limits_{\vec s \in \mathcal S^T} \frac{\exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr)}{\sum\limits_{\vec s’ \in \mathcal S^T} \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s’) \Bigr)} \\ &= \arg \max\limits_{\vec s \in \mathcal S^T} \vec w \cdot \vec \Phi(\vec x, \vec s) \\ &= \arg \max\limits_{\vec s \in \mathcal S^T} \vec w \cdot \sum\limits_{i=1}^T \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \\ &= \arg \max\limits_{\vec s \in \mathcal S^T} \sum\limits_{k=1}^{|\vec t|} \sum\limits_{i=1}^T \lambda_k t_k(s_{i-1}, s_i, \vec x, i) + \sum\limits_{k=1}^{|\vec e|} \sum\limits_{i=1}^T \mu_k e_k(s_i, \vec x, i) \end{align}

依然是求t+st + s的最大带权路径. 使用Viterbi算法解决:

  1. 初始化 δ1(si)=t(s0,s1=si)ϕ1(s1)=0(3.9.1) \delta_1(s^i) = t(s_0, s_1=s^i) \\ \phi_1(s_1) = 0 \tag{3.9.1}
  2. 递推. 对t=[2,T]t=[2, T]: δt(si)=max1jS[δt1(sj)+t(sj,si)]+e(si,xt)ϕt(qi)=argmax1jS[δt1(sj)+t(sj,si)](3.9.2) \delta_t(s^i) = \max\limits_{1 \leq j \leq |\mathcal S|} [\delta_{t-1}(s^j) + t(s^j, s^i)] + e(s^i, x_t) \\ \phi_t(q_i) = arg\max\limits_{1 \leq j \leq |\mathcal S|} [\delta_{t-1}(s^j) + t(s^j, s^i)] \tag{3.9.2}
  3. 终止 P=max1iSδT(si)IT=argmax1iSδT(si)(3.9.3) P = \max\limits_{1 \leq i \leq |\mathcal S|} \delta_T(s^i) \\ I_T = arg\max\limits_{1 \leq i \leq |\mathcal S|} \delta_T(s^i) \tag{3.9.3}
  4. 逆推最优状态序列. 对t=T1,T2,,1t=T-1, T-2, \cdots, 1 It=ϕt+1(It+1)(3.9.4) I_t = \phi_{t+1}(I_{t+1}) \tag{3.9.4}

其中 δt(si)=max(s1st1)P((s1st1),st=six;w)(3.9.5) \delta_t(s^i)=\max\limits_{(s_1 \cdots s_{t-1})} P \left ((s_1 \cdots s_{t-1}), s_t=s^i | \vec x; \vec w \right ) \tag{3.9.5}

使用矩(跃)阵(迁)运(引)算(擎)加速, 并充分利用broadcast简化代码, 降低可读性(

注意s=(s0,s1,,sT,s1)\vec s = (s_0, s_1, \cdots, s_T, s_{-1}), S|\mathcal S|包括起点和终点

数据结构 βt(s)=[βt(s1)βt(sS)]RST=[t(s1,s1)e(s1,sS)t(sS,s1)e(sS,sS)]RS×SE=[e(s1,x1)e(s1,xT)e(sS,x1)e(sS,xT)]RS×T(a.1) \begin{align} \vec \beta_t(s) &= \begin{bmatrix} \beta_t(s^1) \\ \vdots \\ \beta_t(s^{|\mathcal S|}) \end{bmatrix} \in R^{|\mathcal S|}\\ T &= \begin{bmatrix} t(s^1, s^1) & \cdots & e(s^1, s^{|\mathcal S|}) \\ \vdots & \ddots & \vdots \\ t(s^{|\mathcal S|}, s^1) & \cdots & e(s^{|\mathcal S|}, s^{|\mathcal S|}) \end{bmatrix} \in R^{|\mathcal S| \times |\mathcal S|} \\ E &= \begin{bmatrix} e(s^1, x_{1}) & \cdots & e(s^1, x_{T}) \\ \vdots & \ddots & \vdots \\ e(s^{|\mathcal S|}, x_{1}) & \cdots & e(s^{|\mathcal S|}, x_{T}) \end{bmatrix} \in R^{|\mathcal S| \times T} \end{align} \tag{a.1} 那么根据(3.8)(3.8) βt+1(si)=logj=1Sexp([βt(s1)βt(sS)]+[t(s1,si)t(sS,si)]+[e(si,xt+1)e(si,xt+1)])=logj=1Sexp(βt(s)+T[:,i]+E[i,t+1])(e.1) \begin{align} \beta_{t+1}(s^i) &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left ( \begin{bmatrix} \beta_t(s^1) \\ \vdots \\ \beta_t(s^{|\mathcal S|}) \end{bmatrix} + \begin{bmatrix} t(s^1, s^i) \\ \vdots \\ t(s^{|\mathcal S|}, s^i) \end{bmatrix} + \begin{bmatrix} e(s^i, x_{t+1}) \\ \vdots \\ e(s^i, x_{t+1}) \end{bmatrix} \right ) \\ &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (\vec \beta_t(s) + T[:, i] + E[i, t+1]\right ) \\ \end{align} \tag{e.1} 再根据广播(broadcasting)运算, 不难得到 βt+1T(s)=logj=1Sexp(βt(s)+T+ET[:,t+1])(e.2) \begin{align} \vec \beta_{t+1}^T(s) &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (\vec \beta_t(s) + T + E^T[:, t+1] \right ) \end{align} \tag{e.2}

(e.1)(e.1)中, E[i,t+1]E[i, t+1]被广播到βt(s)+T[:,i]\vec \beta_t(s) + T[:, i]的每一行 在(e.2)(e.2)中, βt(s)\vec \beta_t(s)被广播到TT的每一列, ET[:,t+1]E^T[:, t+1]被广播到T的每一行, 最后逐列求log_sum_exp

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def neg_log_likelihood(self, x, s):
    """
    计算非规范化概率, 然后调用_forward_algo计算规范化因子, 返回负对数概率
    :param x: input, words encoded in a sentence
    :param s: output, tag sequence to calculate probability
    :return: negative log probability of given tag sequence
    """
    score = autograd.Variable(torch.FloatTensor([0]))
    # 添加起点和终点
    s = torch.cat([torch.LongTensor([0]),
                   s,
                   torch.LongTensor([self.num_label-1])])
    # emission直接用lstm求出来
    emits = self._lstm_forward(x)
    for i in range(len(emits)):
        score = score + self.trans[s[i+1], s[i]] + emits[i][s[i+1]]
    score = score + self.trans[-1, s[-2]]
    # 计算规范化因子
    forward_score = self._forward_algo(emits)
    return forward_score - score

def _forward_algo(self, emits):
    """
    前向算法, 计算规范化因子, 涉及计算$\mathcal S^T$条路径的非规范化概率之和
    :param emits: emit scores for each word
                  Tensor(n, m), n = words in the sentence; m = num_label
    :return: Tensor(1, 1), Z(s) used to normalize probability
    """
    beta_init = torch.FloatTensor(1, self.num_label).fill_(-10000.)
    beta_init[0][0] = 0.                        # init start tag
    beta_t = autograd.Variable(init_alphas)  # track grad
    # 利用e.2式
    for emit_t in emits:
        sum_score = beta_t.view(1, -1) + self.trans + emit_t.view(-1, 1)
        beta_t = log_sum_exp(sum_score).view(1, - 1)

    terminal_var = beta_t + self.trans[-1].view(1, -1)
    Z = log_sum_exp(terminal_var)
    return Z

假设 δt=[δt(s1)δt(sS)](e.3) \vec \delta_t = \begin{bmatrix} \delta_t(s^1) \\ \vdots \\ \delta_t(s^{|\mathcal S|}) \end{bmatrix} \tag{e.3} 参考(a.1)(a.1)

  1. For t[0,T1]t \in [0, T-1] δt+1=max(δt+T)T+E[:,t+1](e.4)\vec \delta_{t+1} = \max(\vec \delta_t + T)^T + E[:, t+1] \tag{e.4}
  2. t=Tt=T, 这一步没有EE δt+1=max(δt+T)T(e.5)\vec \delta_{t+1} = \max(\vec \delta_t + T)^T \tag{e.5}

其中max\max函数取函数每一列的最大值.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def _viterbi_decode(self, emits):
        """
        Decode the tag sequence with highest un-normalized log-probability(max_score)
        :param emits: emit scores for each word
        :return: max_score: un-normalized log-probability(max_score)
                 tag_seq: tag sequence with highest max_score
        """
        backpointers = []

        # Initialize the viterbi variables in log space
        init_vvars = torch.FloatTensor(1, self.num_label).fill_(-10000.)
        init_vvars[0][0] = 0.

        # forward_var at step i holds the viterbi variables for step i-1
        # actually no need to calculate grad for decoding
        forward_var = autograd.Variable(init_vvars)
        for emit_t in emits:
            # bptrs_t holds the backpointers for this step
            # viterbivars_t holds the viterbi variables for this step
            next_step_score = self.trans + forward_var
            viterbivars_t, bptrs_t = torch.max(next_step_score, dim=1)
            forward_var = viterbivars_t.view(1, -1) + emit_t.view(1, -1)
            backpointers.append(bptrs_t.data.numpy())

        # Transition to end tag. We won't add terminal_var into packpointers
        terminal_var = forward_var + self.trans[-1].view(1, -1)
        max_score, best_tag = torch.max(terminal_var, dim=1)
        best_tag = best_tag.data[0]

        # Follow the back pointers to decode the best path.
        tag_seq = [best_tag]
        for bptrs_t in reversed(backpointers):
            best_tag = bptrs_t[best_tag]
            tag_seq.append(best_tag)

        # Pop off the start tag (we dont want to return that to the caller)
        start = tag_seq.pop()
        assert start == 0  # Sanity check
        tag_seq.reverse()
        return max_score, tag_seq
  1. <统计学习方法> 李航
  2. http://www.cs.columbia.edu/~mcollins/notes-spring2013.html
  3. http://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html