对数线性模型, MEMM与CRF
可能是有史以来公式最多的文章. 敲到吐血
序
设输入是$x$, 输出是$y$, HMM刻画的是联合概率分布$P(x, y | \lambda)$, 因此是生成模型, 可以同时进行推断$P(y|x, \lambda)$和解码$P(x|y, \lambda)$. 而CRF刻画的是条件概率分布$P(y|x, \lambda)$, 是判别模型
Log-Linear Models
符号表
符号 | 含义 |
---|---|
$x \in \mathcal X$ | input/feature |
$y \in \mathcal Y$ | output/label |
$\vec{\phi}(x, y): \mathcal X \times \mathcal Y \rightarrow R^d$ | 特征函数, 将$(x, y)$映射为$R^d$维特征向量 |
$\vec{w} \in R^d$ | weights for $\vec \phi$ |
定义
Log Linear Models take the following form:
$$ \begin{align} P(y|x; \vec w) &= \frac{1}{Z} \exp(\vec w \cdot \vec \phi(x, y)) \\ Z &= \sum\limits_{y_i \in \mathcal Y} \exp(\vec w \cdot \vec \phi(x, y_i)) \end{align} \tag{1.1} $$ 其中$Z$是归一化常数, 使得输出构成概率分布. 这是条件概率, 因此是判别模型. $\vec w \times \vec \phi$代表了给定$x$之后$y$的’可能性’, 而模型的核心就是特征函数$\vec \phi$.
- $\vec \phi$实际上是不同的特征函数, 每个特征函数匹配不同pattern并score, 赋予模型分辨能力. 例如在logistic regression中: $$\vec \phi_k(\vec x, y) = \begin{cases} x_k, & \text{y=1} \\ 0, & \text{y=0} \end{cases} $$ 特征函数赋予模型二分类能力(不论$\vec x$如何, 只要出现了不同的$y$, 模型就要区分出来)
- $\vec w$让模型对$\vec \phi$ 的每个特征给予不同的权重, 赋予模型判断特征权重的能力
- 使用指数控制其非负
- 使用$Z$归一化之后就转化成了概率分布.
预测
给定一组容量为$n$的样本$S = \lbrace (x_i, y_i) \rbrace^{n}_{i=1}$, 使用Log Linear Model计算其对数似然 $$ \begin{align} \log P(S | \vec w) = \log L(\vec w) &= \log \prod\limits_{i=1}^n P(y_i|x_i; \vec w) \\ &= \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) \end{align} \tag{1.2} $$
学习
使用最大似然估计参数$\vec w$ $$ \vec w^{*} = \arg \max\limits_{\vec w \in R^d} \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) $$ 常用的优化手段有梯度下降(负对数似然), 其中梯度计算方式如下:
$$ \begin{align} \frac{\partial \log L(\vec w)}{\partial \vec w} &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n \log P(y_i|x_i; \vec w) \\ &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\sum\limits_{y_j \in \mathcal Y} \exp(\vec w \cdot \vec \phi(x, y_j))} \\ &= \sum\limits_{i=1}^n \vec \phi(x_i, y_i) - \sum\limits_{i=1}^n \sum\limits_{y_j \in \mathcal Y} P(y_j|x_i; \vec w) \vec \phi(x_i, y_j) \end{align} \tag{1.3} $$
正则化是优化模型性能的重要方式, L2正则项为$\frac{1}{2} ||\vec w||^2$, 损失函数变为$\log L(\vec w) + \frac{1}{2} ||\vec w||^2$即可
案例: Logistic Regression
定义: $$ \begin{align} \vec x &\in R^n \\ y &\in \lbrace 0, 1 \rbrace \\ \vec \phi_k(\vec x, y) &= \begin{cases} x_k, & \text{y=1} \\ 0, & \text{y=0} \end{cases} \end{align} \tag{e.0} $$ 由定义$(1.1)$求出条件概率 $$ P(y|\vec x; \vec w) = \frac{\exp(\vec w \cdot \vec \phi(\vec x, y))}{\exp(\vec w \cdot \vec \phi(\vec x, 1)) + 1} \tag{e.1} $$
由$(1.2)$可知对数似然 $$ \begin{align} \log L(\vec w) &= \sum\limits_{i=1}^n \log p(y_i|x_i; \vec w) \\ &= \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\exp(\vec w \cdot \vec \phi(x_i, 1)) + \exp(\vec w \cdot \vec \phi(x_i, 0))} \\ &= \sum\limits_{i=1}^n \log \frac{\exp(\vec w \cdot \vec \phi(x_i, y_i))}{\exp(\vec w \cdot \vec x_i) + 1} \\ &= \sum\limits_{i=1}^n [\vec w \cdot \vec \phi(x_i, y_i) - \log (\exp(\vec w \cdot \vec x_i) + 1)] \end{align} \tag{e.2} $$ 由$(1.3)$求出梯度 $$ \begin{align} \frac{\partial \log L(\vec w)}{\partial \vec w} &= \frac{\partial}{\partial \vec w} \sum\limits_{i=1}^n [\vec w \cdot \vec \phi(x_i, y_i) - \log (\exp(\vec w \cdot \vec x_i) + 1)] \\ &= \sum\limits_{i=1}^n [\vec \phi(x_i, y_i) - \frac{\exp(\vec w \cdot \vec x_i) \vec x_i}{\exp(\vec w \cdot \vec x_i) + 1}] \\ &= \sum\limits_{i=1}^n [\vec \phi(x_i, y_i) - P(1|\vec x_i; \vec w) \vec x_i] \end{align} \tag{e.3} $$ 事实上, 当$y \in \lbrace 0, 1 \rbrace$时, $(e.0)$中的特征函数可以简化为$\vec \phi(\vec x, y) = y \vec x$, $(e.2)$, $(e.3)$可以进一步简化, 但当$y$为其他取值时不通用, 因此没有扔进去算
MEMM
把log linear models推广到序列上, 就有了MEMM
符号 | 含义 |
---|---|
$x \in \mathcal X$ | input, for example the $j$’th word in a sentence |
$\vec x = (x_1,x_2, \cdots, x_T)$ | input sequence |
$s^i \in \mathcal S, 1 \leq i \leq \lvert \mathcal S \rvert$ | state or output, like tag of a word |
$\vec s = (s_1,s_2, \cdots, s_T)$ | state sequence |
$\vec \phi(x, y): \mathcal X \times \mathcal Y \rightarrow R^d$ | 特征函数, 将$(x, y)$映射为$R^d$维特征向量 |
${\vec w} \in R^d$ | weights for $\vec \phi$ |
定义
MEMM的结构中, 状态链是马尔科夫链, 符合局部Markov性, 而transition和emission矩阵由特征函数刻画. 整个模型刻画了条件概率分布: $$ \begin{align} P((s_1,s_2, \cdots, s_T)|\vec x) &= \prod\limits_{i=1}^{T} P(s_i | (s_1,s_2, \cdots, s_{i-1}), \vec x) \\ &= \prod\limits_{i=1}^{T} P(s_i | s_{i-1}, \vec x) \end{align} \tag{2.1} $$ 其中
- 第一步使用了chain rule of conditional probabilities, 可以把联合概率拆分为条件概率: $$ \begin{align} P(A_n, \cdots, A_1) &= P(A_n | A_{n-1}, \cdots, A_1) P(A_{n-1}, \cdots, A_1) \\ P(A_4, A_3, A_2, A_1) &= P(A_4|A_3, A_2, A_1) P(A_3|A_2, A_1) P(A_2|A_1) \end{align} $$
- 第2步使用了independence assumption: $$P(s_i|s_{i-1}, s_{i-2}, \cdots, s_1) = P(s_i|s_{i-1})$$
为了计算$(2.1)$, 我们使用Log Linear Model刻画输出与输入/上一个输出之间的条件概率: $$ P(s_i | s_{i-1}, \vec x; \vec w) = \frac{\exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \Bigr )}{\sum\limits_{s’ \in \mathcal S} \exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s’ \bigr ) \Bigr )} \tag{2.2} $$
对于$ \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr )$, 有:
- $\vec x = (x_1, \cdots, x_T)$是整个输入序列
- $i$是输出序号
- $s_{i-1}$是上一次输出
- $s_i$是输出
学习
获取$(x_1,x_2, \cdots, x_T)$和$(s_1,s_2, \cdots, s_T)$之后, 学习可以使用最大似然, 参考$(1.2)$和$(1.3)$
预测(Decode)
和Log Linear Models中的预测不同, 前者只需给定输入预测输出的概率分布即可, 但MEMM给定输入之后可以刻画$|\mathcal S|^T$种输出, 找到条件概率最大的输出序列就成为了一个任务: $$ \arg \max\limits_{(s_1\cdots s_T)} P\bigl ((s_1,s_2, \cdots, s_T)|\vec x \bigr ) $$ 这和HMM的Viterbi算法如出一辙. 使用动态规划解决:
$$ \begin{align} \pi[i, s] &= \max\limits_{s_1 \cdots s_{i-1}} P \bigl ( (s_1, s_2, \cdots, s_{i}=s) | \vec x \bigr ) \\ &= \max\limits_{s_1 \cdots s_{i-1}} P\bigl ( (s_i=s|s_{i-1}, \vec x) \bigr)P \bigl ( (s_1, s_2, \cdots, s_{i-1}) |\vec x \bigr ) \\ &= \max\limits_{s_1 \cdots s_{i-1}} P\bigl ( (s_i=s|s_{i-1}, \vec x) \prod\limits_{j=1}^{i-1} P\bigl ( (s_j|s_{j-1}, \vec x) \end{align} \tag{2.3} $$
递推式为 $$ \pi[i, s] = \max\limits_{s’ \in \mathcal S} \pi[i-1, s’] \cdot P(s_{i-1}=s’ | s_i=s, \vec x) \tag{2.4} $$
MEMM与HMM
在计算transition & emission概率时, HMM和MEMM分别使用如下模型: $$ \begin{align} P(s_i|s_{i-1})P(x_i|s_i) &= A_{s_{i-1}s_i}B_{s_i}(o_i)\\ P(s_i | s_{i-1}, \vec x) &= \frac{\exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \Bigr )}{\sum\limits_{s’ \in \mathcal S} \exp \Bigl ( \vec w \cdot \vec \phi \bigl ( \vec x, i, s_{i-1}, s’ \bigr ) \Bigr )} \end{align} $$ MEMM的关键优势在于使用特征函数$\vec\phi$可以捕捉更多信息.
- HMM中$A \in R^{|\mathcal S| \times |\mathcal S|}$, $B \in R^{|\mathcal S| \times |\mathcal O|}$均为二维矩阵, 用来刻画transition和emission
- MEMM中特征函数$\vec\phi \in R^{|\mathcal X| \times T \times |\mathcal S| \times |\mathcal S|}$, 捕捉特征的能力远强于HMM.
For example, the transition probability can be sensitive to any word in the input sequence $x_1 \cdots x_T$. In addition, it is very easy to introduce features that are sensitive to spelling features (e.g., prefixes or suffixes) of the current word $x_i$, or of the surrounding words. These features are useful in many NLP applications, and are difficult to incorporate within HMMs in a clean way.
并且, HMM刻画的是联合概率分布(joint probability), MEMM刻画的是条件概率分布(conditional probability), 后者往往有更高的precision
CRF
定义
条件随机场的任务依然是刻画序列的条件概率$P(\vec s | \vec x)$: $$ P(\vec s | \vec x; \vec w) = \frac{\exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr)}{\sum\limits_{\vec s’ \in \mathcal S^T} \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s’) \Bigr)} \tag{3.1} $$ 然而这里的特征函数是直接建立在整个输入序列对输出序列的.
feature vector maps an entire input sequence $\vec x$ paired with an entire state sequence $\vec s$ to some $d$-dimensional feature vector. $$ \vec \Phi(\vec x, \vec s) = \vec x \times \vec s \rightarrow R^d $$
但由于局部Markov性, 特征函数可以通过简单的方式进行分解 $$ \begin{align} \vec \Phi(\vec x, \vec s) &= \sum\limits_{i=1}^T \vec \phi \bigl ( s_{i-1}, s_i, \vec x, i \bigr ) \end{align} \tag{3.2} $$ 这里的$\phi \bigl ( s_{i-1}, s_i, \vec x, i \bigr )$和MEMM中的一致.
概率计算
由于CRF是序列对序列的, 计算概率的过程比较复杂
非规范化概率的计算
非规范化概率的表示 $$ \begin{align} \hat P(\vec s | \vec x; \vec w) &= \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr) \\ &= \exp \Bigl ( \vec w \cdot \sum\limits_{i=1}^T [ \vec t(s_{i-1}, s_i, \vec x, i) + \vec e(s_i, \vec x, i) ] \Bigr ) \\ &= \exp \Bigl ( \sum\limits_{k=1}^{|\vec t|} \sum\limits_{i=1}^T \lambda_k t_k(s_{i-1}, s_i, \vec x, i) + \sum\limits_{k=1}^{|\vec e|} \sum\limits_{i=1}^T \mu_k e_k(s_i, \vec x, i) \Bigr ) \end{align} \tag{3.3} $$ 其中$\vec t, \vec e$分别是transition和emission特征, transition定义在边上, 依赖于当前和前一个位置, emission定义在节点上, 依赖于当前位置. $\lambda, \mu$分别是他们的权重, CRF完全由$\lambda, t, \mu, e$定义.
假设转移矩阵和位置及输入无关, 可以这样刻画transition和emission: $$ \begin{align} t[i][j] &= \lambda t(s^i, s^j) & s^i, s^j \in \mathcal S \\ e[i][j] &= \mu e(x_i, s^j) & x_i \in \vec x, s^j \in \mathcal S \end{align} $$
为了处理边界条件, 定义起点为$s_0$, 终点为$s_{-1}$, 即 $$ \vec s = (s_0, s_1, \cdots, s_T, s_{-1}) $$
规范化因子的计算: 前向算法
$(3.1)$分母是全局规范化因子, 涉及计算$\mathcal S^T$条路径的非规范化概率之和, 这和HMM中计算$P(O|\lambda)$时需要对所有状态求和是一致的, 都可以使用前向/后向算法解决.
举个例子, 假设$\mathcal S \in {a, b}$ $$ \begin{align} \alpha_1(a) = \hat P \bigl ( (s_0),s_1=a |(x_1) \bigr ) &= \exp \bigl ( t(0, a) + e(a, x_1) \bigr ) \\ \alpha_1(b) = \hat P \bigl ( (s_0),s_1=b |(x_1) \bigr ) &= \exp \bigl ( t(0, b) + e(b, x_1) \bigr ) \\ \end{align} \tag{3.4} $$ 而 $$ \begin{align} \alpha_2(a) &= \hat P \bigl ( (s_0, s_1), s_2=a |(x_1, x_2) \bigr ) \\ &= \exp \bigl ( t(0, a) + e(a, x_1) + t(a, a) + e(a, x_2) \bigr ) \\ &+ \exp \bigl ( t(0, b) + e(b, x_1) + t(b, a) + e(a, x_2) \bigr ) \\ &= \alpha_1(a) \cdot \exp \bigl ( t(a, a) + e(a, x_2) \bigr ) \\ &+ \alpha_1(b) \cdot \exp \bigl ( t(b, a) + e(a, x_2) \bigr ) \end{align} \tag{3.5} $$ 由$(3.4)$和$(3.5)$可以归纳出递推式 $$ \alpha_{t+1}(s^i) = \sum\limits_{j = 1}^{|\mathcal S|} \alpha_{t}(s^j) \cdot \exp\bigl (t(s^j, s^i) + e(s^i, x_{t+1}) \bigr ) \tag{3.6} $$ 或 $$ \alpha_{t+1}(s^i) = \sum\limits_{j = 1}^{|\mathcal S|} \exp \bigl ( \log \alpha_{t}(s^j) + t(s^j, s^i) + e(s^i, x_{t+1}) \bigr ) \tag{3.7} $$ 或 $$ \beta_{t+1}(s^i) = \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \bigl ( \beta_{t}(s^j) + t(s^j, s^i) + e(s^i, x_{t+1}) \bigr ) \tag{3.8} $$ 其中$\beta_t(s^i) = \log \alpha_t(s^i) = \log \hat P \bigl ( s_t=s^i |(x_1 \cdots x_t) \bigr )$
Decode: Viterbi算法
给定$\vec x$, 求$\arg \max\limits_{\vec s \in \mathcal S^T} P(\vec s | \vec x; \vec w)$
$$ \begin{align} \arg \max\limits_{\vec s \in \mathcal S^T} P(\vec s | \vec x; \vec w) &= \arg \max\limits_{\vec s \in \mathcal S^T} \frac{\exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s) \Bigr)}{\sum\limits_{\vec s’ \in \mathcal S^T} \exp \Bigl (\vec w \cdot \vec \Phi(\vec x, \vec s’) \Bigr)} \\ &= \arg \max\limits_{\vec s \in \mathcal S^T} \vec w \cdot \vec \Phi(\vec x, \vec s) \\ &= \arg \max\limits_{\vec s \in \mathcal S^T} \vec w \cdot \sum\limits_{i=1}^T \vec \phi \bigl ( \vec x, i, s_{i-1}, s_i \bigr ) \\ &= \arg \max\limits_{\vec s \in \mathcal S^T} \sum\limits_{k=1}^{|\vec t|} \sum\limits_{i=1}^T \lambda_k t_k(s_{i-1}, s_i, \vec x, i) + \sum\limits_{k=1}^{|\vec e|} \sum\limits_{i=1}^T \mu_k e_k(s_i, \vec x, i) \end{align} $$
依然是求$t + s$的最大带权路径. 使用Viterbi算法解决:
- 初始化 $$ \delta_1(s^i) = t(s_0, s_1=s^i) \\ \phi_1(s_1) = 0 \tag{3.9.1} $$
- 递推. 对$t=[2, T]$: $$ \delta_t(s^i) = \max\limits_{1 \leq j \leq |\mathcal S|} [\delta_{t-1}(s^j) + t(s^j, s^i)] + e(s^i, x_t) \\ \phi_t(q_i) = arg\max\limits_{1 \leq j \leq |\mathcal S|} [\delta_{t-1}(s^j) + t(s^j, s^i)] \tag{3.9.2} $$
- 终止 $$ P = \max\limits_{1 \leq i \leq |\mathcal S|} \delta_T(s^i) \\ I_T = arg\max\limits_{1 \leq i \leq |\mathcal S|} \delta_T(s^i) \tag{3.9.3} $$
- 逆推最优状态序列. 对$t=T-1, T-2, \cdots, 1$ $$ I_t = \phi_{t+1}(I_{t+1}) \tag{3.9.4} $$
其中 $$ \delta_t(s^i)=\max\limits_{(s_1 \cdots s_{t-1})} P \left ((s_1 \cdots s_{t-1}), s_t=s^i | \vec x; \vec w \right ) \tag{3.9.5} $$
Appendix: CRF的计算
使用矩(跃)阵(迁)运(引)算(擎)加速, 并充分利用broadcast简化代码, 降低可读性(
概率计算
注意$\vec s = (s_0, s_1, \cdots, s_T, s_{-1})$, $|\mathcal S|$包括起点和终点
数据结构 $$ \begin{align} \vec \beta_t(s) &= \begin{bmatrix} \beta_t(s^1) \\ \vdots \\ \beta_t(s^{|\mathcal S|}) \end{bmatrix} \in R^{|\mathcal S|}\\ T &= \begin{bmatrix} t(s^1, s^1) & \cdots & e(s^1, s^{|\mathcal S|}) \\ \vdots & \ddots & \vdots \\ t(s^{|\mathcal S|}, s^1) & \cdots & e(s^{|\mathcal S|}, s^{|\mathcal S|}) \end{bmatrix} \in R^{|\mathcal S| \times |\mathcal S|} \\ E &= \begin{bmatrix} e(s^1, x_{1}) & \cdots & e(s^1, x_{T}) \\ \vdots & \ddots & \vdots \\ e(s^{|\mathcal S|}, x_{1}) & \cdots & e(s^{|\mathcal S|}, x_{T}) \end{bmatrix} \in R^{|\mathcal S| \times T} \end{align} \tag{a.1} $$ 那么根据$(3.8)$ $$ \begin{align} \beta_{t+1}(s^i) &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left ( \begin{bmatrix} \beta_t(s^1) \\ \vdots \\ \beta_t(s^{|\mathcal S|}) \end{bmatrix} + \begin{bmatrix} t(s^1, s^i) \\ \vdots \\ t(s^{|\mathcal S|}, s^i) \end{bmatrix} + \begin{bmatrix} e(s^i, x_{t+1}) \\ \vdots \\ e(s^i, x_{t+1}) \end{bmatrix} \right ) \\ &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (\vec \beta_t(s) + T[:, i] + E[i, t+1]\right ) \\ \end{align} \tag{e.1} $$ 再根据广播(broadcasting)运算, 不难得到 $$ \begin{align} \vec \beta_{t+1}^T(s) &= \log \sum\limits_{j = 1}^{|\mathcal S|} \exp \left (\vec \beta_t(s) + T + E^T[:, t+1] \right ) \end{align} \tag{e.2} $$
在$(e.1)$中, $E[i, t+1]$被广播到$\vec \beta_t(s) + T[:, i]$的每一行
在$(e.2)$中, $\vec \beta_t(s)$被广播到$T$的每一列, $E^T[:, t+1]$被广播到T的每一行, 最后逐列求log_sum_exp
|
|
Decode
假设 $$ \vec \delta_t = \begin{bmatrix} \delta_t(s^1) \\ \vdots \\ \delta_t(s^{|\mathcal S|}) \end{bmatrix} \tag{e.3} $$ 参考$(a.1)$
- For $t \in [0, T-1]$ $$\vec \delta_{t+1} = \max(\vec \delta_t + T)^T + E[:, t+1] \tag{e.4}$$
- $t=T$, 这一步没有$E$ $$\vec \delta_{t+1} = \max(\vec \delta_t + T)^T \tag{e.5}$$
其中$\max$函数取函数每一列的最大值.
|
|