Contents

吴恩达 ML 公开课笔记(2) - A Linear Regression Example

This is my notes for the open course [Machine Learning](https://www.AndrewNg’s ML.org/learn/machine-learning/) from AndrewNg’s ML.

Model and Cost Function - A Linear Regression Example

  1. Hypothesis: $$h_\theta(x) = \theta_0 + {\theta_1}x$$
  2. Cost function: $$J(\theta_0, \theta_1) = \frac {1}{2m} \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)} )^2$$
  • Basically this function is derived from the maximum likelihood estimation of a set of $\theta_a, \theta_b \sim N(0, \sigma^2)$
  • $\exists J(\theta_a, \theta_b) (J(\theta_a, \theta_b) = min \lbrace J(\theta_0, \theta_1) \rbrace ) \to \theta_a, \theta_b$ is the best fit of hypothesis

Parameter Learning - Minimize Cost Function

  1. Gradient descent algorithm
  • repeat until convergence:$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$
  1. Notifications
  • $\alpha$: learning rate. The larger $\alpha$ is, the faster the algorithm will be(but rougher, or even fail to convergence)
  • $\theta_0, \theta_1$ should be updated simultaneously(using multiple temp var should work!)
  • Gradient descent can converge to a local minimum even with the learning rate fixed, as the step will automatically become smaller

Gradient Descent For Linear Regression

$$\begin{aligned} \text{repeat until convergence: } \lbrace & \\ \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \\ \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}\right) \ \rbrace&\end{aligned}$$

  • The $J(\theta_0, \theta_1)$ is a convex function, which means it has only one global minimun, which means gradient descent will always hit the best fit
  • “Batch” Gradient Descent: “Batch” means the algo is trained from all the samples every time