Boosting(2) - Adaboost and Forward Stagewise

2016-06-22 476 words One minute

Contents

Boosting理论基础: 和前向分步算法的等价性

Forward stagewise

Consider an additive model like adaboost: $$ f(x) = \sum\limits_{m=1}^M \beta_m b(x; \gamma_m) $$ in which $b(x; \gamma_m)$ is the base model, $\gamma_m$ is the model’s parameters, $\beta_m$ is it’s coefficient.

To minimim the loss function, we have the forward stagewise algorithm, which minimize one model & coefficient at a time, that is, take the models already trained as constant, that’s the core concept of this algorithm:

Input: Training data $T = \lbrace (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(N)}, y^{(N)}) \rbrace$, lost function$L(y, f(x))$ and base model $b(x; \gamma)$ Output: additive model $f(x)$

Initialize $f_0(x) = 0$
For $m = 1,2,\cdots,M$: (1) Minimize loss function: $$ (\beta_m, \gamma_m) = arg\min\limits_{\beta, \gamma} \sum\limits_{i=1}^N L(y^{(i)}, f_{m-1}(x^{(i)}) + \beta b(x^{(i)};\gamma)) $$ (2) Update $f(x)$: $$ f_m(x) = f_{m-1}(x) + \beta_m b(x^{(i)};\gamma_m)$$
Get additive model: $$ f(x) = \sum\limits_{m=1}^M \beta_m b(x; \gamma_m) $$

Equivalence of adaboost and Forward stagewise

When using additive model & exponential loss function, forward stagewise algorithm is equivalent to adaboost. Here is the proof:

Assume the exponential loss function is: $$ L(y, f(x)) = exp(-yf(x)) $$ In the $m$th iteration, we have: $$ f_m(x) = f_{m-1}(x) + \alpha_mG_m(x), \\ \text{in which } f_{m-1}(x) = \sum\limits_{m=1}^{M-1} \alpha_mG_m(x) $$ To minimize $$\begin{align} (\alpha_m, G_m) & = arg\min\limits_{\alpha, G} \sum\limits_{i=1}^N exp \left\lbrace -y^{(i)}(f_{m-1}(x^{(i)}) + \alpha_m G_m(x^{(i)}))\right\rbrace \\ & = arg\min\limits_{\alpha, G} \sum\limits_{i=1}^N w_{mi} exp \left\lbrace -y^{(i)} \alpha_m G_m(x^{(i)}) \right\rbrace \end{align} $$ in which $w_{mi} = exp(-y^{(i)} f_{m-1}(x^{(i)}))$, it depends on neither $\alpha$ nor $G$.

To prove the $\alpha_m^*, G_m^*$ achieved here is exatly the $\alpha_m, G_m$ in adaboost:

For any $\alpha > 0$, $G_m^*(x)$ which maximize the loss function is the one who has the minimum error rate: $$ G_m^*(x) = arg\min\limits_{G} \sum\limits_{i=1}^N w_{mi} I(y^{(i)} \neq G(x^{(i)})) $$ That’s exatly what we train the base model to do, so $G_m^*(x) = G_m(x)$
Then calculate $\alpha_m$ $$\begin{align} L(\alpha, G_m) & = \sum\limits_{i=1}^N w_{mi} exp \left\lbrace -y^{(i)} \alpha G_m(x^{(i)}) \right\rbrace \\ & = e^{-\alpha} \sum\limits_{=} w_{mi} + e^{\alpha} \sum\limits_{\neq} w_{mi} \\ & = e^{-\alpha} \sum\limits w_{mi} - e^{-\alpha} \sum\limits_{\neq} w_{mi} + e^{\alpha} \sum\limits_{\neq} w_{mi} \\ & = (e^{\alpha} - e^{-\alpha}) \sum\limits_{\neq} w_{mi} + e^{-\alpha} \sum\limits w_{mi} \end{align} $$ in which $\sum\limits_{\neq}$ is abbreviation for $\sum\limits_{y^{(i)} \neq G(x^{(i)})}$, $\sum$ is abbreviation for $\sum\limits_{i=1}^N$ Let the partial derivative=0, we have: $$\frac {\partial L(\alpha, G_m)} {\partial \alpha} = (e^{\alpha} + e^{-\alpha}) \sum\limits_{\neq} w_{mi} - e^{-\alpha} \sum\limits w_{mi} = 0$$ s.t. $$(e^{\alpha} + e^{-\alpha}) \frac{\sum\limits_{\neq} w_{mi}}{\sum\limits w_{mi}} - e^{-\alpha} = 0$$ s.t. $$(e^{\alpha} + e^{-\alpha}) \epsilon_m - e^{-\alpha} = 0$$ s.t. $$ \alpha^* = \frac{1}{2}ln \frac{1-\epsilon_m}{\epsilon_m} $$ in which $\epsilon_m = \frac{\sum\limits_{\neq} w_{mi}}{\sum\limits w_{mi}}$, which is the error rate of $G_m^*(x)$

Finally we could see that the $(\alpha_m, G_m)$ we get here is exatly the same as in adaboost.