吴恩达 ML 公开课笔记(2) - A Linear Regression Example
Contents
This is my notes for the open course [Machine Learning](https://www.AndrewNg’s ML.org/learn/machine-learning/) from AndrewNg’s ML.
Model and Cost Function - A Linear Regression Example
- Hypothesis: $$h_\theta(x) = \theta_0 + {\theta_1}x$$
- Cost function: $$J(\theta_0, \theta_1) = \frac {1}{2m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)} )^2$$
- Basically this function is derived from the maximum likelihood estimation of a set of $\theta_a, \theta_b \sim N(0, \sigma^2)$
- $\exists J(\theta_a, \theta_b) (J(\theta_a, \theta_b) = min \lbrace J(\theta_0, \theta_1) \rbrace ) \to \theta_a, \theta_b$ is the best fit of hypothesis
Parameter Learning - Minimize Cost Function
- Gradient descent algorithm
- repeat until convergence:$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$
- Notifications
- $\alpha$: learning rate. The larger $\alpha$ is, the faster the algorithm will be(but rougher, or even fail to convergence)
- $\theta_0, \theta_1$ should be updated simultaneously(using multiple temp var should work!)
- Gradient descent can converge to a local minimum even with the learning rate fixed, as the step will automatically become smaller
Gradient Descent For Linear Regression
$$\begin{aligned} \text{repeat until convergence: } \lbrace & \\ \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \\ \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right) \ \rbrace&\end{aligned}$$
- The $J(\theta_0, \theta_1)$ is a convex function, which means it has only one global minimun, which means gradient descent will always hit the best fit
- “Batch” Gradient Descent: “Batch” means the algo is trained from all the samples every time