吴恩达 ML 公开课笔记(2) - A Linear Regression Example

2016-01-29 231 words One minute

Contents

This is my notes for the open course [Machine Learning](https://www.AndrewNg’s ML.org/learn/machine-learning/) from AndrewNg’s ML.

Model and Cost Function - A Linear Regression Example

Hypothesis: $$h_\theta(x) = \theta_0 + {\theta_1}x$$
Cost function: $$J(\theta_0, \theta_1) = \frac {1}{2m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)} )^2$$

Basically this function is derived from the maximum likelihood estimation of a set of $\theta_a, \theta_b \sim N(0, \sigma^2)$
$\exists J(\theta_a, \theta_b) (J(\theta_a, \theta_b) = min \lbrace J(\theta_0, \theta_1) \rbrace ) \to \theta_a, \theta_b$ is the best fit of hypothesis

Parameter Learning - Minimize Cost Function

Gradient descent algorithm

repeat until convergence:$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$

Notifications

$\alpha$: learning rate. The larger $\alpha$ is, the faster the algorithm will be(but rougher, or even fail to convergence)
$\theta_0, \theta_1$ should be updated simultaneously(using multiple temp var should work!)
Gradient descent can converge to a local minimum even with the learning rate fixed, as the step will automatically become smaller

Gradient Descent For Linear Regression

$$\begin{aligned} \text{repeat until convergence: } \lbrace & \\ \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \\ \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right) \ \rbrace&\end{aligned}$$

The $J(\theta_0, \theta_1)$ is a convex function, which means it has only one global minimun, which means gradient descent will always hit the best fit
“Batch” Gradient Descent: “Batch” means the algo is trained from all the samples every time