吴恩达公开课-03

1. Classification and Representation¶

1.1 Classification¶

y -> {0, 1}

Linear Regression并不适用Classification问题

Logistic Regression: $0\le h_\theta(x)\le1$

虽然叫Regression，但其实是一个Classification Algorithm

1.2 Hypothesis Representation¶

$h_\theta(x)=g(\theta^Tx)$

$g(z)=\frac{1}{1+e^{-z}}$ (Sigmoid function/Logistic function)

$h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}$ ，所以 $h(x)-0.5=0.5-h(-x)$ ，即函数图像关于 $y=0.5$ 对称

含义: $h_\theta(x)=P(y=1\vert x;\theta)$

和Linear Regression相似，得到以上 $h_\theta(x)$ 后，所做的就是估计\theta的值

1.3 Decision Boundary¶

$\theta^Tx=0$ : Decision Boundary

即分割两类点的Boundary，以两维为例，画x2-x1图，那么 $\theta^Tx=0$ 就是一条直线，这条直线就是Decision Boundary

对non-linear decision boundary，可以像linear regression中，引入一些多项式项，如 $h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)$

Training Set -> $\theta$ -> Decision Boundary

2. Logistic Regression Model¶

2.1 Cost Function¶

同样，也是有了m个Training Example后，确定 $\theta$ ，需要确定一个Cost Function，然后进行最小化

如果采用和Linear Regression同样的欧氏距离，所得的Cost Function non-convex，无法确保到达全局最小值

$J(\theta)=\frac{1}{m}Cost(h_\theta(x^i), y^i)$

$Cost(h_\theta(x), y)=\left\{ \begin{aligned} -log(h_\theta(x)) & & y=1 \\ -log(1-h_\theta(x)) & & y=0 \end{aligned} \right.$

$y=1$ 时：

$h_\theta(x)\to0$ ， $Cost(h(x), y)\to\infty$
$h_\theta(x)\to1$ ， $Cost(h(x), y)\to0$

$y=0$ 时，则相反

$h_\theta(x)\to0$ ， $Cost(h(x), y)\to0$
$h_\theta(x)\to1$ ， $Cost(h(x), y)\to\infty$

2.2 Simplified Cost Function and Gradient Descent¶

$h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}$

$Cost(h_\theta(x), y)=-ylog(h)-(1-y)log(1-h)$

$J(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^i), y^i)$

$J$ : Convex Function

最小化 $J(\theta)$ ：梯度下降，即重复

$\theta_j:=\theta_j-\alpha\frac{d}{d\theta_j}J(\theta)$

推导:

$\frac{d}{d\theta_j}h=x_jh(1-h)$$ $$\frac{d}{d\theta_j}Cost(\theta)=\frac{-y}{h}x_jh(1-h)+\frac{1-y}{1-h}x_jh(1-h)=(h-y)x_j$$ $$\theta_j=\theta_j-\frac{alpha}{m}\sum_{i=1}^mx_j(h-y)$

与Linear Regression中梯度下降公式相同

同样可以通过Feature Scaling让Logistic Regression的梯度下降运行更快

2.3 Advanced Optimization¶

Optimization Algorithm: Minimize $J(\theta)$

相比梯度下降，以下算法不需要选择 $\alpha$ ，而且运行更快

需要提供 $J(\theta)$ 和 $\frac{d}{d\theta_j}J(\theta)的函数$

1. Conjugate Gradient

2. BFGS

3. L-BFGS

3. Multiclass Classification¶

3.1 Multiclass Classification: One-vs-all¶

one-vs-all

对多种类的分类，看做多个双种类的分类问题，分别应用Logistic Regression

分类时，应用k个h函数，取最大值

4. Solving the Problem of Overfitting¶

4.1 The Problem of Overfitting¶

Underfit: high bias

Overfit: High variance e.g. high polynomial fail to generalize to new examples

Addressing Overfitting

Reduce number of features
Regularization: reduce magnitude of $\theta_j$

4.2 Cost Function¶

例如对4次的high polynomial，通过penalize将 $\theta_3$ $\theta_4$ 变小，比如给 $J(\theta)$ 加上 $1000\theta_3^2+1000\theta_4^2$

Regularization: Small values for parameters $\theta_j$

4.3 Regularized Linear Regression¶

1. Gradient Descent

$J(\theta)=\frac{1}{2m}[\sum_{i=1}^m(h-y)^2+\lambda\sum_{j=1}^n\theta_j^2]$

这里是j=1开始，就是说不考虑常数项

对于 $\theta_{j>0}$

$\theta_j=\theta_j(1-\frac{\alpha\lambda}{m})-\frac{\alpha}{m}\sum(h-y)$

2. Normal Equation

原来: $<span class="arithmatex"><span class="MathJax_Preview">\theta=(X^TX)^{-1}X^Ty</span><script type="math/tex">\theta=(X^TX)^{-1}X^Ty$

加入Regularization，即加一个矩阵 $<span class="arithmatex"><span class="MathJax_Preview">\theta=(X^TX+\lambda \left[\begin{matrix} 0\\ &1\\ &&1\\ &&&\ddots\\ &&&&1 \end{matrix}\right] )^{-1}X^Ty</span><script type="math/tex">\theta=(X^TX+\lambda \left[\begin{matrix} 0\\ &1\\ &&1\\ &&&\ddots\\ &&&&1 \end{matrix}\right] )^{-1}X^Ty$

并且，加上这个矩阵后，括号内矩阵一定可逆

4.4 Regularized Logistic Regression¶

1. Gradient Descent

加上 $\theta$ 项 $<span class="arithmatex"><span class="MathJax_Preview">J(\theta)=\frac{1}{m}[\sum(-ylogh-(1-y)log(1-h))]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2</span><script type="math/tex">J(\theta)=\frac{1}{m}[\sum(-ylogh-(1-y)log(1-h))]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$

所以梯度下降时，对 $\theta_{j>0}$ ，之前

$\theta_j=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^mx_j(h-y)$

现在 $<span class="arithmatex"><span class="MathJax_Preview">\theta_j=\theta_j-\alpha[\frac{1}{m}\sum_{i=1}^mx_j(h-y)+\frac{\lambda}{m}\theta_j]</span><script type="math/tex">\theta_j=\theta_j-\alpha[\frac{1}{m}\sum_{i=1}^mx_j(h-y)+\frac{\lambda}{m}\theta_j]$

2. Advanced Optimization Techniques

在给出 $J(\theta)$ 和 $\frac{d}{d\theta_j}J(\theta)$ 时，进行修改

$J(\theta)=\frac{1}{m}[\sum_{i=1}^m(-y^ilog(h_\theta(x^i))-(1-y^i)log(1-h_\theta(x^i)))]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$

$\frac{d}{d\theta_j}J(\theta)=\frac{1}{m}\sum_{i=1}^mx_j^i(h_\theta(x^i)-y^i)+\frac{\lambda}{m}\theta_j$