可能是有史以来公式最多的文章. 敲到吐血
你的PGM也会和Steins Gate的世界线一样收束吗?
设输入是x x x , 输出是y y y , HMM刻画的是联合概率分布P ( x , y ∣ λ ) P(x, y | \lambda) P ( x , y ∣ λ ) , 因此是生成模型, 可以同时进行推断P ( y ∣ x , λ ) P(y|x, \lambda) P ( y ∣ x , λ ) 和解码P ( x ∣ y , λ ) P(x|y, \lambda) P ( x ∣ y , λ ) . 而CRF刻画的是条件概率分布P ( y ∣ x , λ ) P(y|x, \lambda) P ( y ∣ x , λ ) , 是判别模型
符号
含义
x ∈ X x \in \mathcal X x ∈ X
input/feature
y ∈ Y y \in \mathcal Y y ∈ Y
output/label
ϕ ⃗ ( x , y ) : X × Y → R d \vec{\phi}(x, y): \mathcal X \times \mathcal Y \rightarrow R^d ϕ ( x , y ) : X × Y → R d
特征函数, 将( x , y ) (x, y) ( x , y ) 映射为R d R^d R d 维特征向量
w ⃗ ∈ R d \vec{w} \in R^d w ∈ R d
weights for ϕ ⃗ \vec \phi ϕ
Log Linear Models take the following form:
P ( y ∣ x ; w ⃗ ) = 1 Z exp ( w ⃗ ⋅ ϕ ⃗ ( x , y ) ) Z = ∑ y i ∈ Y exp ( w ⃗ ⋅ ϕ ⃗ ( x , y i ) ) (1.1)
\begin{align}
P(y|x; \vec w) &= \frac{1}{Z} \exp(\vec w \cdot \vec \phi(x, y)) \\
Z &= \sum\limits_{y_i \in \mathcal Y} \exp(\vec w \cdot \vec \phi(x, y_i))
\end{align}
\tag{1.1}
P ( y ∣ x ; w ) Z = Z 1 exp ( w ⋅ ϕ ( x , y )) = y i ∈ Y ∑ exp ( w ⋅ ϕ ( x , y i )) ( 1.1 )
其中Z Z Z 是归一化常数, 使得输出构成概率分布. 这是条件概率, 因此是判别模型.
w ⃗ × ϕ ⃗ \vec w \times \vec \phi w × ϕ 代表了给定x x x 之后y y y 的’可能性’, 而模型的核心就是特征函数ϕ ⃗ \vec \phi ϕ .
ϕ ⃗ \vec \phi ϕ 实际上是不同的特征函数, 每个特征函数匹配不同pattern并score, 赋予模型分辨能力 . 例如在logistic regression中:
ϕ ⃗ k ( x ⃗ , y ) = { x k , y=1 0 , y=0 \vec \phi_k(\vec x, y) = \begin{cases}
x_k, & \text{y=1} \\
0, & \text{y=0}
\end{cases} ϕ k ( x , y ) = { x k , 0 , y=1 y=0
特征函数赋予模型二分类能力(不论x ⃗ \vec x x 如何, 只要出现了不同的y y y , 模型就要区分出来)
w ⃗ \vec w w 让模型对ϕ ⃗ \vec \phi ϕ 的每个特征给予不同的权重, 赋予模型判断特征权重 的能力
使用指数控制其非负
使用Z Z Z 归一化之后就转化成了概率分布.
给定一组容量为n n n 的样本S = { ( x i , y i ) } i = 1 n S = \lbrace (x_i, y_i) \rbrace^{n}_{i=1} S = {( x i , y i ) } i = 1 n , 使用Log Linear Model计算其对数似然
log P ( S ∣ w ⃗ ) = log L ( w ⃗ ) = log ∏ i = 1 n P ( y i ∣ x i ; w ⃗ ) = ∑ i = 1 n log P ( y i ∣ x i ; w ⃗ ) (1.2)
\begin{align}
\log P(S | \vec w) = \log L(\vec w) &= \log \prod\limits_{i=1}^n P(y_i|x_i; \vec w) \\
&= \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w)
\end{align}
\tag{1.2}
log P ( S ∣ w ) = log L ( w ) = log i = 1 ∏ n P ( y i ∣ x i ; w ) = i = 1 ∑ n log P ( y i ∣ x i ; w ) ( 1.2 )
使用最大似然估计参数w ⃗ \vec w w
w ⃗ ∗ = arg max w ⃗ ∈ R d ∑ i = 1 n log P ( y i ∣ x i ; w ⃗ )
\vec w^{*} = \arg \max\limits_{\vec w \in R^d} \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w)
w ∗ = arg w ∈ R d max i = 1 ∑ n log P ( y i ∣ x i ; w )
常用的优化手段有梯度下降(负对数似然), 其中梯度计算方式如下:
∂ log L ( w ⃗ ) ∂ w ⃗ = ∂ ∂ w ⃗ ∑ i = 1 n log P ( y i ∣ x i ; w ⃗ ) = ∂ ∂ w ⃗ ∑ i = 1 n log exp ( w ⃗ ⋅ ϕ ⃗ ( x i , y i ) ) ∑ y j ∈ Y exp ( w ⃗ ⋅ ϕ ⃗ ( x , y j ) ) = ∑ i = 1 n ϕ ⃗ ( x i , y i ) − ∑ i = 1 n ∑ y j ∈ Y P ( y j ∣ x i ; w ⃗ ) ϕ ⃗ ( x i , y j ) (1.3)
\begin{align}
\frac{\partial \log L(\vec w)}{\partial \vec w} &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) \\
&= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\sum\limits_{y_j \in \mathcal Y} \exp(\vec w \cdot \vec \phi(x, y_j))} \\
&= \sum\limits_{i=1}^n \vec \phi(x_i, y_i) - \sum\limits_{i=1}^n \sum\limits_{y_j \in \mathcal Y} P(y_j|x_i; \vec w) \vec \phi(x_i, y_j)
\end{align}
\tag{1.3}
∂ w ∂ log L ( w ) = ∂ w ∂ i = 1 ∑ n log P ( y i ∣ x i ; w ) = ∂ w ∂ i = 1 ∑ n log y j ∈ Y ∑ exp ( w ⋅ ϕ ( x , y j )) exp ( w ⋅ ϕ ( x i , y i )) = i = 1 ∑ n ϕ ( x i , y i ) − i = 1 ∑ n y j ∈ Y ∑ P ( y j ∣ x i ; w ) ϕ ( x i , y j ) ( 1.3 )
正则化 是优化模型性能的重要方式, L2正则项为1 2 ∣ ∣ w ⃗ ∣ ∣ 2 \frac{1}{2} ||\vec w||^2 2 1 ∣∣ w ∣ ∣ 2 , 损失函数变为log L ( w ⃗ ) + 1 2 ∣ ∣ w ⃗ ∣ ∣ 2 \log L(\vec w) + \frac{1}{2} ||\vec w||^2 log L ( w ) + 2 1 ∣∣ w ∣ ∣ 2 即可
定义:
x ⃗ ∈ R n y ∈ { 0 , 1 } ϕ ⃗ k ( x ⃗ , y ) = { x k , y=1 0 , y=0 (e.0)
\begin{align}
\vec x &\in R^n \\
y &\in \lbrace 0, 1 \rbrace \\
\vec \phi_k(\vec x, y) &=
\begin{cases}
x_k, & \text{y=1} \\
0, & \text{y=0}
\end{cases}
\end{align}
\tag{e.0}
x y ϕ k ( x , y ) ∈ R n ∈ { 0 , 1 } = { x k , 0 , y=1 y=0 ( e.0 )
由定义( 1.1 ) (1.1) ( 1.1 ) 求出条件概率
P ( y ∣ x ⃗ ; w ⃗ ) = exp ( w ⃗ ⋅ ϕ ⃗ ( x ⃗ , y ) ) exp ( w ⃗ ⋅ ϕ ⃗ ( x ⃗ , 1 ) ) + 1 (e.1)
P(y|\vec x; \vec w) = \frac{\exp(\vec w \cdot \vec \phi(\vec x, y))}{\exp(\vec w \cdot \vec \phi(\vec x, 1)) + 1}
\tag{e.1}
P ( y ∣ x ; w ) = exp ( w ⋅ ϕ ( x , 1 )) + 1 exp ( w ⋅ ϕ ( x , y )) ( e.1 )
由( 1.2 ) (1.2) ( 1.2 ) 可知对数似然
log L ( w ⃗ ) = ∑ i = 1 n log p ( y i ∣ x i ; w ⃗ ) = ∑ i = 1 n log exp ( w ⃗ ⋅ ϕ ⃗ ( x i , y i ) ) exp ( w ⃗ ⋅ ϕ ⃗ ( x i , 1 ) ) + exp ( w ⃗ ⋅ ϕ ⃗ ( x i , 0 ) ) = ∑ i = 1 n log exp ( w ⃗ ⋅ ϕ ⃗ ( x i , y i ) ) exp ( w ⃗ ⋅ x ⃗ i ) + 1 = ∑ i = 1 n [ w ⃗ ⋅ ϕ ⃗ ( x i , y i ) − log ( exp ( w ⃗ ⋅ x ⃗ i ) + 1 ) ] (e.2)
\begin{align}
\log L(\vec w) &= \sum\limits_{i=1}^n \log p(y_i|x_i; \vec w) \\
&= \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\exp(\vec w \cdot \vec \phi(x_i, 1)) + \exp(\vec w \cdot \vec \phi(x_i, 0))} \\
&= \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\exp(\vec w \cdot \vec x_i) + 1} \\
&= \sum\limits_{i=1}^n [\vec w \cdot \vec \phi(x_i, y_i) - \log (\exp(\vec w \cdot \vec x_i) + 1)]
\end{align}
\tag{e.2}
log L ( w ) = i = 1 ∑ n log p ( y i ∣ x i ; w ) = i = 1 ∑ n log exp ( w ⋅ ϕ ( x i , 1 )) + exp ( w ⋅ ϕ ( x i , 0 )) exp ( w ⋅ ϕ ( x i , y i )) = i = 1 ∑ n log exp ( w ⋅ x i ) + 1 exp ( w ⋅ ϕ ( x i , y i )) = i = 1 ∑ n [ w ⋅ ϕ ( x i , y i ) − log ( exp ( w ⋅ x i ) + 1 )] ( e.2 )
由( 1.3 ) (1.3) ( 1.3 ) 求出梯度
∂ log L ( w ⃗ ) ∂ w ⃗ = ∂ ∂ w ⃗ ∑ i = 1 n [ w ⃗ ⋅ ϕ ⃗ ( x i , y i ) − log ( exp ( w ⃗ ⋅ x ⃗ i ) + 1 ) ] = ∑ i = 1 n [ ϕ ⃗ ( x i , y i ) − exp ( w ⃗ ⋅ x ⃗ i ) x ⃗ i exp ( w ⃗ ⋅ x ⃗ i ) + 1 ] = ∑ i = 1 n [ ϕ ⃗ ( x i , y i ) − P ( 1 ∣ x ⃗ i ; w ⃗ ) x ⃗ i ] (e.3)
\begin{align}
\frac{\partial \log L(\vec w)}{\partial \vec w} &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n [\vec w \cdot \vec \phi(x_i, y_i) - \log (\exp(\vec w \cdot \vec x_i) + 1)] \\
&= \sum\limits_{i=1}^n [\vec \phi(x_i, y_i) - \frac{\exp(\vec w \cdot \vec x_i) \vec x_i}{\exp(\vec w \cdot \vec x_i) + 1}] \\
&= \sum\limits_{i=1}^n [\vec \phi(x_i, y_i) - P(1|\vec x_i; \vec w) \vec x_i]
\end{align}
\tag{e.3}
∂ w ∂ log L ( w ) = ∂ w ∂ i = 1 ∑ n [ w ⋅ ϕ ( x i , y i ) − log ( exp ( w ⋅ x i ) + 1 )] = i = 1 ∑ n [ ϕ ( x i , y i ) − exp ( w ⋅ x i ) + 1 exp ( w ⋅ x i ) x i ] = i = 1 ∑ n [ ϕ ( x i , y i ) − P ( 1∣ x i ; w ) x i ] ( e.3 )
事实上, 当y ∈ { 0 , 1 } y \in \lbrace 0, 1 \rbrace y ∈ { 0 , 1 } 时, ( e . 0 ) (e.0) ( e .0 ) 中的特征函数可以简化为ϕ ⃗ ( x ⃗ , y ) = y x ⃗ \vec \phi(\vec x, y) = y \vec x ϕ ( x , y ) = y x , ( e . 2 ) (e.2) ( e .2 ) , ( e . 3 ) (e.3) ( e .3 ) 可以进一步简化, 但当y y y 为其他取值时不通用, 因此没有扔进去算
把log linear models推广到序列上, 就有了MEMM
符号
含义
x ∈ X x \in \mathcal X x ∈ X
input, for example the j j j ’th word in a sentence
x ⃗ = ( x 1 , x 2 , ⋯ , x T ) \vec x = (x_1,x_2, \cdots, x_T) x = ( x 1 , x 2 , ⋯ , x T )
input sequence
s i ∈ S , 1 ≤ i ≤ ∣ S ∣ s^i \in \mathcal S, 1 \leq i \leq \lvert \mathcal S \rvert s i ∈ S , 1 ≤ i ≤ ∣ S ∣
state or output, like tag of a word
s ⃗ = ( s 1 , s 2 , ⋯ , s T ) \vec s = (s_1,s_2, \cdots, s_T) s = ( s 1 , s 2 , ⋯ , s T )
state sequence
ϕ ⃗ ( x , y ) : X × Y → R d \vec \phi(x, y): \mathcal X \times \mathcal Y \rightarrow R^d ϕ ( x , y ) : X × Y → R d
特征函数, 将( x , y ) (x, y) ( x , y ) 映射为R d R^d R d 维特征向量
w ⃗ ∈ R d {\vec w} \in R^d w ∈ R d
weights for ϕ ⃗ \vec \phi ϕ
MEMM的结构中, 状态链是马尔科夫链, 符合局部Markov性 , 而transition和emission矩阵由特征函数 刻画. 整个模型刻画了条件概率分布:
P ( ( s 1 , s 2 , ⋯ , s T ) ∣ x ⃗ ) = ∏ i = 1 T P ( s i ∣ ( s 1 , s 2 , ⋯ , s i − 1 ) , x ⃗ ) = ∏ i = 1 T P ( s i ∣ s i − 1 , x ⃗ ) (2.1)
\begin{align}
P((s_1,s_2, \cdots, s_T)|\vec x)
&= \prod\limits_{i=1}^{T} P(s_i | (s_1,s_2, \cdots, s_{i-1}), \vec x) \\
&= \prod\limits_{i=1}^{T} P(s_i | s_{i-1}, \vec x)
\end{align}
\tag{2.1}
P (( s 1 , s 2 , ⋯ , s T ) ∣ x ) = i = 1 ∏ T P ( s i ∣ ( s 1 , s 2 , ⋯ , s i − 1 ) , x ) = i = 1 ∏ T P ( s i ∣ s i − 1 , x ) ( 2.1 )
其中
第一步使用了chain rule of conditional probabilities , 可以把联合概率拆分为条件概率:
P ( A n , ⋯ , A 1 ) = P ( A n ∣ A n − 1 , ⋯ , A 1 ) P ( A n − 1 , ⋯ , A 1 ) P ( A 4 , A 3 , A 2 , A 1 ) = P ( A 4 ∣ A 3 , A 2 , A 1 ) P ( A 3 ∣ A 2 , A 1 ) P ( A 2 ∣ A 1 )
\begin{align}
P(A_n, \cdots, A_1) &= P(A_n | A_{n-1}, \cdots, A_1) P(A_{n-1}, \cdots, A_1) \\
P(A_4, A_3, A_2, A_1) &= P(A_4|A_3, A_2, A_1) P(A_3|A_2, A_1) P(A_2|A_1)
\end{align}
P ( A n , ⋯ , A 1 ) P ( A 4 , A 3 , A 2 , A 1 ) = P ( A n ∣ A n − 1 , ⋯ , A 1 ) P ( A n − 1 , ⋯ , A 1 ) = P ( A 4 ∣ A 3 , A 2 , A 1 ) P ( A 3 ∣ A 2 , A 1 ) P ( A 2 ∣ A 1 )
第2步使用了independence assumption:
P ( s i ∣ s i − 1 , s i − 2 , ⋯ , s 1 ) = P ( s i ∣ s i − 1 ) P(s_i|s_{i-1}, s_{i-2}, \cdots, s_1) = P(s_i|s_{i-1}) P ( s i ∣ s i − 1 , s i − 2 , ⋯ , s 1 ) = P ( s i ∣ s i − 1 )
为了计算( 2.1 ) (2.1) ( 2.1 ) , 我们使用Log Linear Model刻画输出与输入/上一个输出之间的条件概率:
P ( s i ∣ s i − 1 , x ⃗ ; w ⃗ ) = exp ( w ⃗ ⋅ ϕ ⃗ ( x ⃗ , i , s i − 1 , s i ) ) ∑ s ’ ∈ S exp ( w ⃗ ⋅ ϕ ⃗ ( x ⃗ , i , s i − 1 , s ’ ) ) (2.2)
P(s_i | s_{i-1}, \vec x; \vec w) = \frac{\exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \Bigr )}{\sum\limits_{s’ \in \mathcal S} \exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s’ \bigr ) \Bigr )}
\tag{2.2}
P ( s i ∣ s i − 1 , x ; w ) = s ’ ∈ S ∑ exp ( w ⋅ ϕ ( x , i , s i − 1 , s ’ ) ) exp ( w ⋅ ϕ ( x , i , s i − 1 , s i ) ) ( 2.2 )
对于ϕ ⃗ ( x ⃗ , i , s i − 1 , s i ) \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) ϕ ( x , i , s i − 1 , s i ) , 有:
x ⃗ = ( x 1 , ⋯ , x T ) \vec x = (x_1, \cdots, x_T) x = ( x 1 , ⋯ , x T ) 是整个输入序列
i i i 是输出序号
s i − 1 s_{i-1} s i − 1 是上一次输出
s i s_i s i 是输出
获取( x 1 , x 2 , ⋯ , x T ) (x_1,x_2, \cdots, x_T) ( x 1 , x 2 , ⋯ , x T ) 和( s 1 , s 2 , ⋯ , s T ) (s_1,s_2, \cdots, s_T) ( s 1 , s 2 , ⋯ , s T ) 之后, 学习可以使用最大似然, 参考( 1.2 ) (1.2) ( 1.2 ) 和( 1.3 ) (1.3) ( 1.3 )
和Log Linear Models中的预测不同, 前者只需给定输入预测输出的概率分布即可, 但MEMM给定输入之后可以刻画∣ S ∣ T |\mathcal S|^T ∣ S ∣ T 种输出, 找到条件概率最大的输出序列就成为了一个任务:
arg max ( s 1 ⋯ s T ) P ( ( s 1 , s 2 , ⋯ , s T ) ∣ x ⃗ )
\arg \max\limits_{(s_1\cdots s_T)} P\bigl ((s_1,s_2, \cdots, s_T)|\vec x \bigr )
arg ( s 1 ⋯ s T ) max P ( ( s 1 , s 2 , ⋯ , s T ) ∣ x )
这和HMM的Viterbi算法如出一辙. 使用动态规划解决:
π [ i , s ] = max s 1 ⋯ s i − 1 P ( ( s 1 , s 2 , ⋯ , s i = s ) ∣ x ⃗ ) = max s 1 ⋯ s i − 1 P ( ( s i = s ∣ s i − 1 , x ⃗ ) ) P ( ( s 1 , s 2 , ⋯ , s i − 1 ) ∣ x ⃗ ) = max s 1 ⋯ s i − 1 P ( ( s i = s ∣ s i − 1 , x ⃗ ) ∏ j = 1 i − 1 P ( ( s j ∣ s j − 1 , x ⃗ ) (2.3)
\begin{align}
\pi[i, s] &= \max\limits_{s_1 \cdots s_{i-1}} P \bigl ( (s_1, s_2, \cdots, s_{i}=s) | \vec x \bigr ) \\
&= \max\limits_{s_1 \cdots s_{i-1}} P\bigl ( (s_i=s|s_{i-1}, \vec x) \bigr)P \bigl ( (s_1, s_2, \cdots, s_{i-1}) |\vec x \bigr ) \\
&= \max\limits_{s_1 \cdots s_{i-1}} P\bigl ( (s_i=s|s_{i-1}, \vec x) \prod\limits_{j=1}^{i-1} P\bigl ( (s_j|s_{j-1}, \vec x)
\end{align}
\tag{2.3}
π [ i , s ] = s 1 ⋯ s i − 1 max P ( ( s 1 , s 2 , ⋯ , s i = s ) ∣ x ) = s 1 ⋯ s i − 1 max P ( ( s i = s ∣ s i − 1 , x ) ) P ( ( s 1 , s 2 , ⋯ , s i − 1 ) ∣ x ) = s 1 ⋯ s i − 1 max P ( ( s i = s ∣ s i − 1 , x ) j = 1 ∏ i − 1 P ( ( s j ∣ s j − 1 , x ) ( 2.3 )
递推式为
π [ i , s ] = max s ’ ∈ S π [ i − 1 , s ’ ] ⋅ P ( s i − 1 = s ’ ∣ s i = s , x ⃗ ) (2.4)
\pi[i, s] = \max\limits_{s’ \in \mathcal S} \pi[i-1, s’] \cdot P(s_{i-1}=s’ | s_i=s, \vec x)
\tag{2.4}
π [ i , s ] = s ’ ∈ S max π [ i − 1 , s ’ ] ⋅ P ( s i − 1 = s ’∣ s i = s , x ) ( 2.4 )
在计算transition & emission概率时, HMM和MEMM分别使用如下模型:
P ( s i ∣ s i − 1 ) P ( x i ∣ s i ) = A s i − 1 s i B s i ( o i ) P ( s i ∣ s i − 1 , x ⃗ ) = exp ( w ⃗ ⋅ ϕ ⃗ ( x ⃗ , i , s i − 1 , s i ) ) ∑ s ’ ∈ S exp ( w ⃗ ⋅ ϕ ⃗ ( x ⃗ , i , s i − 1 , s ’ ) )
\begin{align}
P(s_i|s_{i-1})P(x_i|s_i) &= A_{s_{i-1}s_i}B_{s_i}(o_i)\\
P(s_i | s_{i-1}, \vec x) &= \frac{\exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \Bigr )}{\sum\limits_{s’ \in \mathcal S} \exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s’ \bigr ) \Bigr )}
\end{align}
P ( s i ∣ s i − 1 ) P ( x i ∣ s i ) P ( s i ∣ s i − 1 , x ) = A s i − 1 s i B s i ( o i ) = s ’ ∈ S ∑ exp ( w ⋅ ϕ ( x , i , s i − 1 , s ’ ) ) exp ( w ⋅ ϕ ( x , i , s i − 1 , s i ) )
MEMM的关键优势在于使用特征函数ϕ ⃗ \vec\phi ϕ 可以捕捉更多信息.
HMM中A ∈ R ∣ S ∣ × ∣ S ∣ A \in R^{|\mathcal S| \times |\mathcal S|} A ∈ R ∣ S ∣ × ∣ S ∣ , B ∈ R ∣ S ∣ × ∣ O ∣ B \in R^{|\mathcal S| \times |\mathcal O|} B ∈ R ∣ S ∣ × ∣ O ∣ 均为二维矩阵, 用来刻画transition和emission
MEMM中特征函数ϕ ⃗ ∈ R ∣ X ∣ × T × ∣ S ∣ × ∣ S ∣ \vec\phi \in R^{|\mathcal X| \times T \times |\mathcal S| \times |\mathcal S|} ϕ ∈ R ∣ X ∣ × T × ∣ S ∣ × ∣ S ∣ , 捕捉特征的能力远强于HMM.
For example, the transition probability can be sensitive to any word in the input sequence x 1 ⋯ x T x_1 \cdots x_T x 1 ⋯ x T . In addition, it is very easy to introduce features that are sensitive to spelling features (e.g., prefixes or suffixes) of the current word x i x_i x i , or of the surrounding words. These features are useful in many NLP applications, and are difficult to incorporate within HMMs in a clean way.
并且, HMM刻画的是联合概率分布(joint probability), MEMM刻画的是条件概率分布(conditional probability), 后者往往有更高的precision
条件随机场的任务依然是刻画序列的条件概率P ( s ⃗ ∣ x ⃗ ) P(\vec s | \vec x) P ( s ∣ x ) :
P ( s ⃗ ∣ x ⃗ ; w ⃗ ) = exp ( w ⃗ ⋅ Φ ⃗ ( x ⃗ , s ⃗ ) ) ∑ s ⃗ ’ ∈ S T exp ( w ⃗ ⋅ Φ ⃗ ( x ⃗ , s ⃗ ’ ) ) (3.1)
P(\vec s | \vec x; \vec w) = \frac{\exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr)}{\sum\limits_{\vec s’ \in \mathcal S^T} \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s’) \Bigr)}
\tag{3.1}
P ( s ∣ x ; w ) = s ’ ∈ S T ∑ exp ( w ⋅ Φ ( x , s ’ ) ) exp ( w ⋅ Φ ( x , s ) ) ( 3.1 )
然而这里的特征函数是直接建立在整个输入序列对输出序列的.
feature vector maps an entire input sequence x ⃗ \vec x x paired with an entire state sequence s ⃗ \vec s s to some d d d -dimensional feature vector.
Φ ⃗ ( x ⃗ , s ⃗ ) = x ⃗ × s ⃗ → R d
\vec \Phi(\vec x, \vec s) = \vec x \times \vec s \rightarrow R^d
Φ ( x , s ) = x × s → R d
但由于局部Markov性, 特征函数可以通过简单的方式进行分解
Φ ⃗ ( x ⃗ , s ⃗ ) = ∑ i = 1 T ϕ ⃗ ( s i − 1 , s i , x ⃗ , i ) (3.2)
\begin{align}
\vec \Phi(\vec x, \vec s) &= \sum\limits_{i=1}^T \vec \phi \bigl ( s_{i-1}, s_i, \vec x, i \bigr )
\end{align}
\tag{3.2}
Φ ( x , s ) = i = 1 ∑ T ϕ ( s i − 1 , s i , x , i ) ( 3.2 )
这里的ϕ ( s i − 1 , s i , x ⃗ , i ) \phi \bigl ( s_{i-1}, s_i, \vec x, i \bigr ) ϕ ( s i − 1 , s i , x , i ) 和MEMM中的一致.
由于CRF是序列对序列的, 计算概率的过程比较复杂
非规范化概率的表示
P ^ ( s ⃗ ∣ x ⃗ ; w ⃗ ) = exp ( w ⃗ ⋅ Φ ⃗ ( x ⃗ , s ⃗ ) ) = exp ( w ⃗ ⋅ ∑ i = 1 T [ t ⃗ ( s i − 1 , s i , x ⃗ , i ) + e ⃗ ( s i , x ⃗ , i ) ] ) = exp ( ∑ k = 1 ∣ t ⃗ ∣ ∑ i = 1 T λ k t k ( s i − 1 , s i , x ⃗ , i ) + ∑ k = 1 ∣ e ⃗ ∣ ∑ i = 1 T μ k e k ( s i , x ⃗ , i ) ) (3.3)
\begin{align}
\hat P(\vec s | \vec x; \vec w) &= \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr) \\
&= \exp \Bigl ( \vec w \cdot \sum\limits_{i=1}^T [ \vec t(s_{i-1}, s_i, \vec x, i) + \vec e(s_i, \vec x, i) ] \Bigr ) \\
&= \exp \Bigl ( \sum\limits_{k=1}^{|\vec t|} \sum\limits_{i=1}^T \lambda_k t_k(s_{i-1}, s_i, \vec x, i) + \sum\limits_{k=1}^{|\vec e|} \sum\limits_{i=1}^T \mu_k e_k(s_i, \vec x, i) \Bigr )
\end{align}
\tag{3.3}
P ^ ( s ∣ x ; w ) = exp ( w ⋅ Φ ( x , s ) ) = exp ( w ⋅ i = 1 ∑ T [ t ( s i − 1 , s i , x , i ) + e ( s i , x , i )] ) = exp ( k = 1 ∑ ∣ t ∣ i = 1 ∑ T λ k t k ( s i − 1 , s i , x , i ) + k = 1 ∑ ∣ e ∣ i = 1 ∑ T μ k e k ( s i , x , i ) ) ( 3.3 )
其中t ⃗ , e ⃗ \vec t, \vec e t , e 分别是transition和emission特征, transition定义在边上, 依赖于当前和前一个位置, emission定义在节点上, 依赖于当前位置. λ , μ \lambda, \mu λ , μ 分别是他们的权重, CRF完全由λ , t , μ , e \lambda, t, \mu, e λ , t , μ , e 定义.
假设转移矩阵和位置及输入无关, 可以这样刻画transition和emission:
t [ i ] [ j ] = λ t ( s i , s j ) s i , s j ∈ S e [ i ] [ j ] = μ e ( x i , s j ) x i ∈ x ⃗ , s j ∈ S
\begin{align}
t[i][j] &= \lambda t(s^i, s^j) & s^i, s^j \in \mathcal S \\
e[i][j] &= \mu e(x_i, s^j) & x_i \in \vec x, s^j \in \mathcal S
\end{align}
t [ i ] [ j ] e [ i ] [ j ] = λ t ( s i , s j ) = μ e ( x i , s j ) s i , s j ∈ S x i ∈ x , s j ∈ S
为了处理边界条件, 定义起点为s 0 s_0 s 0 , 终点为s − 1 s_{-1} s − 1 , 即
s ⃗ = ( s 0 , s 1 , ⋯ , s T , s − 1 )
\vec s = (s_0, s_1, \cdots, s_T, s_{-1})
s = ( s 0 , s 1 , ⋯ , s T , s − 1 )
( 3.1 ) (3.1) ( 3.1 ) 分母是全局规范化因子, 涉及计算S T \mathcal S^T S T 条路径的非规范化概率之和, 这和HMM中计算P ( O ∣ λ ) P(O|\lambda) P ( O ∣ λ ) 时需要对所有状态求和是一致的, 都可以使用前向/后向算法解决.
举个例子, 假设S ∈ a , b \mathcal S \in {a, b} S ∈ a , b
α 1 ( a ) = P ^ ( ( s 0 ) , s 1 = a ∣ ( x 1 ) ) = exp ( t ( 0 , a ) + e ( a , x 1 ) ) α 1 ( b ) = P ^ ( ( s 0 ) , s 1 = b ∣ ( x 1 ) ) = exp ( t ( 0 , b ) + e ( b , x 1 ) ) (3.4)
\begin{align}
\alpha_1(a) = \hat P \bigl ( (s_0),s_1=a |(x_1) \bigr ) &=
\exp \bigl ( t(0, a) + e(a, x_1) \bigr ) \\
\alpha_1(b) = \hat P \bigl ( (s_0),s_1=b |(x_1) \bigr ) &=
\exp \bigl ( t(0, b) + e(b, x_1) \bigr ) \\
\end{align}
\tag{3.4}
α 1 ( a ) = P ^ ( ( s 0 ) , s 1 = a ∣ ( x 1 ) ) α 1 ( b ) = P ^ ( ( s 0 ) , s 1 = b ∣ ( x 1 ) ) = exp ( t ( 0 , a ) + e ( a , x 1 ) ) = exp ( t ( 0 , b ) + e ( b , x 1 ) ) ( 3.4 )
而
α 2 ( a ) = P ^ ( ( s 0 , s 1 ) , s 2 = a ∣ ( x 1 , x 2 ) ) = exp ( t ( 0 , a ) + e ( a , x 1 ) + t ( a , a ) + e ( a , x 2 ) ) + exp ( t ( 0 , b ) + e ( b , x 1 ) + t ( b , a ) + e ( a , x 2 ) ) = α 1 ( a ) ⋅ exp ( t ( a , a ) + e ( a , x 2 ) ) + α 1 ( b ) ⋅ exp ( t ( b , a ) + e ( a , x 2 ) ) (3.5)
\begin{align}
\alpha_2(a) &= \hat P \bigl ( (s_0, s_1), s_2=a |(x_1, x_2) \bigr ) \\
&= \exp \bigl ( t(0, a) + e(a, x_1) + t(a, a) + e(a, x_2) \bigr ) \\
&+ \exp \bigl ( t(0, b) + e(b, x_1) + t(b, a) + e(a, x_2) \bigr ) \\
&= \alpha_1(a) \cdot \exp \bigl ( t(a, a) + e(a, x_2) \bigr ) \\
&+ \alpha_1(b) \cdot \exp \bigl ( t(b, a) + e(a, x_2) \bigr )
\end{align}
\tag{3.5}
α 2 ( a ) = P ^ ( ( s 0 , s 1 ) , s 2 = a ∣ ( x 1 , x 2 ) ) = exp ( t ( 0 , a ) + e ( a , x 1 ) + t ( a , a ) + e ( a , x 2 ) ) + exp ( t ( 0 , b ) + e ( b , x 1 ) + t ( b , a ) + e ( a , x 2 ) ) = α 1 ( a ) ⋅ exp ( t ( a , a ) + e ( a , x 2 ) ) + α 1 ( b ) ⋅ exp ( t ( b , a ) + e ( a , x 2 ) ) ( 3.5 )
由( 3.4 ) (3.4) ( 3.4 ) 和( 3.5 ) (3.5) ( 3.5 ) 可以归纳出递推式
α t + 1 ( s i ) = ∑ j = 1 ∣ S ∣ α t ( s j ) ⋅ exp ( t ( s j , s i ) + e ( s i , x t + 1 ) ) (3.6)
\alpha_{t+1}(s^i) = \sum\limits_{j = 1}^{|\mathcal S|} \alpha_{t}(s^j) \cdot \exp\bigl (t(s^j, s^i) + e(s^i, x_{t+1}) \bigr )
\tag{3.6}
α t + 1 ( s i ) = j = 1 ∑ ∣ S ∣ α t ( s j ) ⋅ exp ( t ( s j , s i ) + e ( s i , x t + 1 ) ) ( 3.6 )
或
α t + 1 ( s i ) = ∑ j = 1 ∣ S ∣ exp ( log α t ( s j ) + t ( s j , s i ) + e ( s i , x t + 1 ) ) (3.7)
\alpha_{t+1}(s^i) = \sum\limits_{j = 1}^{|\mathcal S|} \exp \bigl ( \log \alpha_{t}(s^j) + t(s^j, s^i) + e(s^i, x_{t+1}) \bigr )
\tag{3.7}
α t + 1 ( s i ) = j = 1 ∑ ∣ S ∣ exp ( log α t ( s j ) + t ( s j , s i ) + e ( s i , x t + 1 ) ) ( 3.7 )
或
β t + 1 ( s i ) = log ∑ j = 1 ∣ S ∣ exp ( β t ( s j ) + t ( s j , s i ) + e ( s i , x t + 1 ) ) (3.8)
\beta_{t+1}(s^i) = \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \bigl ( \beta_{t}(s^j) + t(s^j, s^i) + e(s^i, x_{t+1}) \bigr )
\tag{3.8}
β t + 1 ( s i ) = log j = 1 ∑ ∣ S ∣ exp ( β t ( s j ) + t ( s j , s i ) + e ( s i , x t + 1 ) ) ( 3.8 )
其中β t ( s i ) = log α t ( s i ) = log P ^ ( s t = s i ∣ ( x 1 ⋯ x t ) ) \beta_t(s^i) = \log \alpha_t(s^i) = \log \hat P \bigl ( s_t=s^i |(x_1 \cdots x_t) \bigr ) β t ( s i ) = log α t ( s i ) = log P ^ ( s t = s i ∣ ( x 1 ⋯ x t ) )
给定x ⃗ \vec x x , 求arg max s ⃗ ∈ S T P ( s ⃗ ∣ x ⃗ ; w ⃗ ) \arg \max\limits_{\vec s \in \mathcal S^T} P(\vec s | \vec x; \vec w) arg s ∈ S T max P ( s ∣ x ; w )
arg max s ⃗ ∈ S T P ( s ⃗ ∣ x ⃗ ; w ⃗ ) = arg max s ⃗ ∈ S T exp ( w ⃗ ⋅ Φ ⃗ ( x ⃗ , s ⃗ ) ) ∑ s ⃗ ’ ∈ S T exp ( w ⃗ ⋅ Φ ⃗ ( x ⃗ , s ⃗ ’ ) ) = arg max s ⃗ ∈ S T w ⃗ ⋅ Φ ⃗ ( x ⃗ , s ⃗ ) = arg max s ⃗ ∈ S T w ⃗ ⋅ ∑ i = 1 T ϕ ⃗ ( x ⃗ , i , s i − 1 , s i ) = arg max s ⃗ ∈ S T ∑ k = 1 ∣ t ⃗ ∣ ∑ i = 1 T λ k t k ( s i − 1 , s i , x ⃗ , i ) + ∑ k = 1 ∣ e ⃗ ∣ ∑ i = 1 T μ k e k ( s i , x ⃗ , i )
\begin{align}
\arg \max\limits_{\vec s \in \mathcal S^T} P(\vec s | \vec x; \vec w) &= \arg \max\limits_{\vec s \in \mathcal S^T} \frac{\exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr)}{\sum\limits_{\vec s’ \in \mathcal S^T} \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s’) \Bigr)} \\
&= \arg \max\limits_{\vec s \in \mathcal S^T} \vec w \cdot \vec \Phi(\vec x, \vec s) \\
&= \arg \max\limits_{\vec s \in \mathcal S^T} \vec w \cdot \sum\limits_{i=1}^T \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \\
&= \arg \max\limits_{\vec s \in \mathcal S^T} \sum\limits_{k=1}^{|\vec t|} \sum\limits_{i=1}^T \lambda_k t_k(s_{i-1}, s_i, \vec x, i) + \sum\limits_{k=1}^{|\vec e|} \sum\limits_{i=1}^T \mu_k e_k(s_i, \vec x, i)
\end{align}
arg s ∈ S T max P ( s ∣ x ; w ) = arg s ∈ S T max s ’ ∈ S T ∑ exp ( w ⋅ Φ ( x , s ’ ) ) exp ( w ⋅ Φ ( x , s ) ) = arg s ∈ S T max w ⋅ Φ ( x , s ) = arg s ∈ S T max w ⋅ i = 1 ∑ T ϕ ( x , i , s i − 1 , s i ) = arg s ∈ S T max k = 1 ∑ ∣ t ∣ i = 1 ∑ T λ k t k ( s i − 1 , s i , x , i ) + k = 1 ∑ ∣ e ∣ i = 1 ∑ T μ k e k ( s i , x , i )
依然是求t + s t + s t + s 的最大带权路径. 使用Viterbi算法解决:
初始化
δ 1 ( s i ) = t ( s 0 , s 1 = s i ) ϕ 1 ( s 1 ) = 0 (3.9.1)
\delta_1(s^i) = t(s_0, s_1=s^i) \\
\phi_1(s_1) = 0
\tag{3.9.1}
δ 1 ( s i ) = t ( s 0 , s 1 = s i ) ϕ 1 ( s 1 ) = 0 ( 3.9.1 )
递推. 对t = [ 2 , T ] t=[2, T] t = [ 2 , T ] :
δ t ( s i ) = max 1 ≤ j ≤ ∣ S ∣ [ δ t − 1 ( s j ) + t ( s j , s i ) ] + e ( s i , x t ) ϕ t ( q i ) = a r g max 1 ≤ j ≤ ∣ S ∣ [ δ t − 1 ( s j ) + t ( s j , s i ) ] (3.9.2)
\delta_t(s^i) = \max\limits_{1 \leq j \leq |\mathcal S|} [\delta_{t-1}(s^j) + t(s^j, s^i)] + e(s^i, x_t) \\
\phi_t(q_i) = arg\max\limits_{1 \leq j \leq |\mathcal S|} [\delta_{t-1}(s^j) + t(s^j, s^i)]
\tag{3.9.2}
δ t ( s i ) = 1 ≤ j ≤ ∣ S ∣ max [ δ t − 1 ( s j ) + t ( s j , s i )] + e ( s i , x t ) ϕ t ( q i ) = a r g 1 ≤ j ≤ ∣ S ∣ max [ δ t − 1 ( s j ) + t ( s j , s i )] ( 3.9.2 )
终止
P = max 1 ≤ i ≤ ∣ S ∣ δ T ( s i ) I T = a r g max 1 ≤ i ≤ ∣ S ∣ δ T ( s i ) (3.9.3)
P = \max\limits_{1 \leq i \leq |\mathcal S|} \delta_T(s^i) \\
I_T = arg\max\limits_{1 \leq i \leq |\mathcal S|} \delta_T(s^i)
\tag{3.9.3}
P = 1 ≤ i ≤ ∣ S ∣ max δ T ( s i ) I T = a r g 1 ≤ i ≤ ∣ S ∣ max δ T ( s i ) ( 3.9.3 )
逆推最优状态序列. 对t = T − 1 , T − 2 , ⋯ , 1 t=T-1, T-2, \cdots, 1 t = T − 1 , T − 2 , ⋯ , 1
I t = ϕ t + 1 ( I t + 1 ) (3.9.4)
I_t = \phi_{t+1}(I_{t+1})
\tag{3.9.4}
I t = ϕ t + 1 ( I t + 1 ) ( 3.9.4 )
其中
δ t ( s i ) = max ( s 1 ⋯ s t − 1 ) P ( ( s 1 ⋯ s t − 1 ) , s t = s i ∣ x ⃗ ; w ⃗ ) (3.9.5)
\delta_t(s^i)=\max\limits_{(s_1 \cdots s_{t-1})} P \left ((s_1 \cdots s_{t-1}), s_t=s^i | \vec x; \vec w \right )
\tag{3.9.5}
δ t ( s i ) = ( s 1 ⋯ s t − 1 ) max P ( ( s 1 ⋯ s t − 1 ) , s t = s i ∣ x ; w ) ( 3.9.5 )
使用矩(跃)阵(迁)运(引)算(擎)加速, 并充分利用broadcast简化代码, 降低可读性(
注意s ⃗ = ( s 0 , s 1 , ⋯ , s T , s − 1 ) \vec s = (s_0, s_1, \cdots, s_T, s_{-1}) s = ( s 0 , s 1 , ⋯ , s T , s − 1 ) , ∣ S ∣ |\mathcal S| ∣ S ∣ 包括起点和终点
数据结构
β ⃗ t ( s ) = [ β t ( s 1 ) ⋮ β t ( s ∣ S ∣ ) ] ∈ R ∣ S ∣ T = [ t ( s 1 , s 1 ) ⋯ e ( s 1 , s ∣ S ∣ ) ⋮ ⋱ ⋮ t ( s ∣ S ∣ , s 1 ) ⋯ e ( s ∣ S ∣ , s ∣ S ∣ ) ] ∈ R ∣ S ∣ × ∣ S ∣ E = [ e ( s 1 , x 1 ) ⋯ e ( s 1 , x T ) ⋮ ⋱ ⋮ e ( s ∣ S ∣ , x 1 ) ⋯ e ( s ∣ S ∣ , x T ) ] ∈ R ∣ S ∣ × T (a.1)
\begin{align}
\vec \beta_t(s) &=
\begin{bmatrix}
\beta_t(s^1) \\
\vdots \\
\beta_t(s^{|\mathcal S|})
\end{bmatrix} \in R^{|\mathcal S|}\\
T &=
\begin{bmatrix}
t(s^1, s^1) & \cdots & e(s^1, s^{|\mathcal S|}) \\
\vdots & \ddots & \vdots \\
t(s^{|\mathcal S|}, s^1) & \cdots & e(s^{|\mathcal S|}, s^{|\mathcal S|})
\end{bmatrix} \in R^{|\mathcal S| \times |\mathcal S|} \\
E &=
\begin{bmatrix}
e(s^1, x_{1}) & \cdots & e(s^1, x_{T}) \\
\vdots & \ddots & \vdots \\
e(s^{|\mathcal S|}, x_{1}) & \cdots & e(s^{|\mathcal S|}, x_{T})
\end{bmatrix} \in R^{|\mathcal S| \times T}
\end{align}
\tag{a.1}
β t ( s ) T E = ⎣ ⎡ β t ( s 1 ) ⋮ β t ( s ∣ S ∣ ) ⎦ ⎤ ∈ R ∣ S ∣ = ⎣ ⎡ t ( s 1 , s 1 ) ⋮ t ( s ∣ S ∣ , s 1 ) ⋯ ⋱ ⋯ e ( s 1 , s ∣ S ∣ ) ⋮ e ( s ∣ S ∣ , s ∣ S ∣ ) ⎦ ⎤ ∈ R ∣ S ∣ × ∣ S ∣ = ⎣ ⎡ e ( s 1 , x 1 ) ⋮ e ( s ∣ S ∣ , x 1 ) ⋯ ⋱ ⋯ e ( s 1 , x T ) ⋮ e ( s ∣ S ∣ , x T ) ⎦ ⎤ ∈ R ∣ S ∣ × T ( a.1 )
那么根据( 3.8 ) (3.8) ( 3.8 )
β t + 1 ( s i ) = log ∑ j = 1 ∣ S ∣ exp ( [ β t ( s 1 ) ⋮ β t ( s ∣ S ∣ ) ] + [ t ( s 1 , s i ) ⋮ t ( s ∣ S ∣ , s i ) ] + [ e ( s i , x t + 1 ) ⋮ e ( s i , x t + 1 ) ] ) = log ∑ j = 1 ∣ S ∣ exp ( β ⃗ t ( s ) + T [ : , i ] + E [ i , t + 1 ] ) (e.1)
\begin{align}
\beta_{t+1}(s^i) &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (
\begin{bmatrix}
\beta_t(s^1) \\
\vdots \\
\beta_t(s^{|\mathcal S|})
\end{bmatrix} +
\begin{bmatrix}
t(s^1, s^i) \\
\vdots \\
t(s^{|\mathcal S|}, s^i)
\end{bmatrix} +
\begin{bmatrix}
e(s^i, x_{t+1}) \\
\vdots \\
e(s^i, x_{t+1})
\end{bmatrix} \right ) \\
&= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (\vec \beta_t(s) + T[:, i] + E[i, t+1]\right ) \\
\end{align}
\tag{e.1}
β t + 1 ( s i ) = log j = 1 ∑ ∣ S ∣ exp ⎝ ⎛ ⎣ ⎡ β t ( s 1 ) ⋮ β t ( s ∣ S ∣ ) ⎦ ⎤ + ⎣ ⎡ t ( s 1 , s i ) ⋮ t ( s ∣ S ∣ , s i ) ⎦ ⎤ + ⎣ ⎡ e ( s i , x t + 1 ) ⋮ e ( s i , x t + 1 ) ⎦ ⎤ ⎠ ⎞ = log j = 1 ∑ ∣ S ∣ exp ( β t ( s ) + T [ : , i ] + E [ i , t + 1 ] ) ( e.1 )
再根据广播(broadcasting)运算, 不难得到
β ⃗ t + 1 T ( s ) = log ∑ j = 1 ∣ S ∣ exp ( β ⃗ t ( s ) + T + E T [ : , t + 1 ] ) (e.2)
\begin{align}
\vec \beta_{t+1}^T(s) &=
\log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (\vec \beta_t(s) + T + E^T[:, t+1] \right )
\end{align}
\tag{e.2}
β t + 1 T ( s ) = log j = 1 ∑ ∣ S ∣ exp ( β t ( s ) + T + E T [ : , t + 1 ] ) ( e.2 )
在( e . 1 ) (e.1) ( e .1 ) 中, E [ i , t + 1 ] E[i, t+1] E [ i , t + 1 ] 被广播到β ⃗ t ( s ) + T [ : , i ] \vec \beta_t(s) + T[:, i] β t ( s ) + T [ : , i ] 的每一行
在( e . 2 ) (e.2) ( e .2 ) 中, β ⃗ t ( s ) \vec \beta_t(s) β t ( s ) 被广播到T T T 的每一列, E T [ : , t + 1 ] E^T[:, t+1] E T [ : , t + 1 ] 被广播到T的每一行, 最后逐列求log_sum_exp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def neg_log_likelihood ( self , x , s ):
"""
计算非规范化概率, 然后调用_forward_algo计算规范化因子, 返回负对数概率
:param x: input, words encoded in a sentence
:param s: output, tag sequence to calculate probability
:return: negative log probability of given tag sequence
"""
score = autograd . Variable ( torch . FloatTensor ([ 0 ]))
# 添加起点和终点
s = torch . cat ([ torch . LongTensor ([ 0 ]),
s ,
torch . LongTensor ([ self . num_label - 1 ])])
# emission直接用lstm求出来
emits = self . _lstm_forward ( x )
for i in range ( len ( emits )):
score = score + self . trans [ s [ i + 1 ], s [ i ]] + emits [ i ][ s [ i + 1 ]]
score = score + self . trans [ - 1 , s [ - 2 ]]
# 计算规范化因子
forward_score = self . _forward_algo ( emits )
return forward_score - score
def _forward_algo ( self , emits ):
"""
前向算法, 计算规范化因子, 涉及计算$\mathcal S^T$条路径的非规范化概率之和
:param emits: emit scores for each word
Tensor(n, m), n = words in the sentence; m = num_label
:return: Tensor(1, 1), Z(s) used to normalize probability
"""
beta_init = torch . FloatTensor ( 1 , self . num_label ) . fill_ ( - 10000. )
beta_init [ 0 ][ 0 ] = 0. # init start tag
beta_t = autograd . Variable ( init_alphas ) # track grad
# 利用e.2式
for emit_t in emits :
sum_score = beta_t . view ( 1 , - 1 ) + self . trans + emit_t . view ( - 1 , 1 )
beta_t = log_sum_exp ( sum_score ) . view ( 1 , - 1 )
terminal_var = beta_t + self . trans [ - 1 ] . view ( 1 , - 1 )
Z = log_sum_exp ( terminal_var )
return Z
假设
δ ⃗ t = [ δ t ( s 1 ) ⋮ δ t ( s ∣ S ∣ ) ] (e.3)
\vec \delta_t =
\begin{bmatrix}
\delta_t(s^1) \\
\vdots \\
\delta_t(s^{|\mathcal S|})
\end{bmatrix}
\tag{e.3}
δ t = ⎣ ⎡ δ t ( s 1 ) ⋮ δ t ( s ∣ S ∣ ) ⎦ ⎤ ( e.3 )
参考( a . 1 ) (a.1) ( a .1 )
For t ∈ [ 0 , T − 1 ] t \in [0, T-1] t ∈ [ 0 , T − 1 ]
δ ⃗ t + 1 = max ( δ ⃗ t + T ) T + E [ : , t + 1 ] (e.4) \vec \delta_{t+1} = \max(\vec \delta_t + T)^T + E[:, t+1] \tag{e.4} δ t + 1 = max ( δ t + T ) T + E [ : , t + 1 ] ( e.4 )
t = T t=T t = T , 这一步没有E E E
δ ⃗ t + 1 = max ( δ ⃗ t + T ) T (e.5) \vec \delta_{t+1} = \max(\vec \delta_t + T)^T \tag{e.5} δ t + 1 = max ( δ t + T ) T ( e.5 )
其中max \max max 函数取函数每一列的最大值.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def _viterbi_decode ( self , emits ):
"""
Decode the tag sequence with highest un-normalized log-probability(max_score)
:param emits: emit scores for each word
:return: max_score: un-normalized log-probability(max_score)
tag_seq: tag sequence with highest max_score
"""
backpointers = []
# Initialize the viterbi variables in log space
init_vvars = torch . FloatTensor ( 1 , self . num_label ) . fill_ ( - 10000. )
init_vvars [ 0 ][ 0 ] = 0.
# forward_var at step i holds the viterbi variables for step i-1
# actually no need to calculate grad for decoding
forward_var = autograd . Variable ( init_vvars )
for emit_t in emits :
# bptrs_t holds the backpointers for this step
# viterbivars_t holds the viterbi variables for this step
next_step_score = self . trans + forward_var
viterbivars_t , bptrs_t = torch . max ( next_step_score , dim = 1 )
forward_var = viterbivars_t . view ( 1 , - 1 ) + emit_t . view ( 1 , - 1 )
backpointers . append ( bptrs_t . data . numpy ())
# Transition to end tag. We won't add terminal_var into packpointers
terminal_var = forward_var + self . trans [ - 1 ] . view ( 1 , - 1 )
max_score , best_tag = torch . max ( terminal_var , dim = 1 )
best_tag = best_tag . data [ 0 ]
# Follow the back pointers to decode the best path.
tag_seq = [ best_tag ]
for bptrs_t in reversed ( backpointers ):
best_tag = bptrs_t [ best_tag ]
tag_seq . append ( best_tag )
# Pop off the start tag (we dont want to return that to the caller)
start = tag_seq . pop ()
assert start == 0 # Sanity check
tag_seq . reverse ()
return max_score , tag_seq
<统计学习方法> 李航
http://www.cs.columbia.edu/~mcollins/notes-spring2013.html
http://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html