跳转至

吴恩达公开课-03

1. Classification and Representation

1.1 Classification

y -> {0, 1}

Linear Regression并不适用Classification问题

Logistic Regression: 0\le h_\theta(x)\le1

虽然叫Regression,但其实是一个Classification Algorithm

1.2 Hypothesis Representation

h_\theta(x)=g(\theta^Tx)

g(z)=\frac{1}{1+e^{-z}} (Sigmoid function/Logistic function)

h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}},所以h(x)-0.5=0.5-h(-x),即函数图像关于y=0.5对称

含义: h_\theta(x)=P(y=1\vert x;\theta)

和Linear Regression相似,得到以上h_\theta(x)后,所做的就是估计\theta的值

1.3 Decision Boundary

\theta^Tx=0: Decision Boundary

即分割两类点的Boundary,以两维为例,画x2-x1图,那么\theta^Tx=0就是一条直线,这条直线就是Decision Boundary

对non-linear decision boundary,可以像linear regression中,引入一些多项式项,如h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)

Training Set -> \theta -> Decision Boundary


2. Logistic Regression Model

2.1 Cost Function

同样,也是有了m个Training Example后,确定\theta,需要确定一个Cost Function,然后进行最小化

如果采用和Linear Regression同样的欧氏距离,所得的Cost Function non-convex,无法确保到达全局最小值

J(\theta)=\frac{1}{m}Cost(h_\theta(x^i), y^i)
Cost(h_\theta(x), y)=\left\{ \begin{aligned} -log(h_\theta(x)) & & y=1 \\ -log(1-h_\theta(x)) & & y=0 \end{aligned} \right.

y=1时:

  • h_\theta(x)\to0Cost(h(x), y)\to\infty
  • h_\theta(x)\to1Cost(h(x), y)\to0

y=0时,则相反

  • h_\theta(x)\to0Cost(h(x), y)\to0
  • h_\theta(x)\to1Cost(h(x), y)\to\infty

2.2 Simplified Cost Function and Gradient Descent

h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}
Cost(h_\theta(x), y)=-ylog(h)-(1-y)log(1-h)
J(\theta)=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^i), y^i)

J: Convex Function

最小化J(\theta):梯度下降,即重复

\theta_j:=\theta_j-\alpha\frac{d}{d\theta_j}J(\theta)

推导:

\frac{d}{d\theta_j}h=x_jh(1-h)$$ $$\frac{d}{d\theta_j}Cost(\theta)=\frac{-y}{h}x_jh(1-h)+\frac{1-y}{1-h}x_jh(1-h)=(h-y)x_j$$ $$\theta_j=\theta_j-\frac{alpha}{m}\sum_{i=1}^mx_j(h-y)

与Linear Regression中梯度下降公式相同

同样可以通过Feature Scaling让Logistic Regression的梯度下降运行更快

2.3 Advanced Optimization

Optimization Algorithm: Minimize J(\theta)

相比梯度下降,以下算法不需要选择\alpha,而且运行更快

需要提供J(\theta)\frac{d}{d\theta_j}J(\theta)的函数

1. Conjugate Gradient

2. BFGS

3. L-BFGS


3. Multiclass Classification

3.1 Multiclass Classification: One-vs-all

one-vs-all

对多种类的分类,看做多个双种类的分类问题,分别应用Logistic Regression

分类时,应用k个h函数,取最大值


4. Solving the Problem of Overfitting

4.1 The Problem of Overfitting

Underfit: high bias

Overfit: High variance e.g. high polynomial fail to generalize to new examples

Addressing Overfitting

  1. Reduce number of features
  2. Regularization: reduce magnitude of \theta_j

4.2 Cost Function

例如对4次的high polynomial,通过penalize将\theta_3 \theta_4变小,比如给J(\theta)加上1000\theta_3^2+1000\theta_4^2

Regularization: Small values for parameters \theta_j

4.3 Regularized Linear Regression

1. Gradient Descent

J(\theta)=\frac{1}{2m}[\sum_{i=1}^m(h-y)^2+\lambda\sum_{j=1}^n\theta_j^2]

这里是j=1开始,就是说不考虑常数项

对于\theta_{j>0}

\theta_j=\theta_j(1-\frac{\alpha\lambda}{m})-\frac{\alpha}{m}\sum(h-y)

2. Normal Equation

原来: \theta=(X^TX)^{-1}X^Ty

加入Regularization,即加一个矩阵 \theta=(X^TX+\lambda \left[\begin{matrix} 0\\ &1\\ &&1\\ &&&\ddots\\ &&&&1 \end{matrix}\right] )^{-1}X^Ty

并且,加上这个矩阵后,括号内矩阵一定可逆

4.4 Regularized Logistic Regression

1. Gradient Descent

加上\thetaJ(\theta)=\frac{1}{m}[\sum(-ylogh-(1-y)log(1-h))]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2

所以梯度下降时,对\theta_{j>0},之前

\theta_j=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^mx_j(h-y)

现在 \theta_j=\theta_j-\alpha[\frac{1}{m}\sum_{i=1}^mx_j(h-y)+\frac{\lambda}{m}\theta_j]

2. Advanced Optimization Techniques

在给出J(\theta)\frac{d}{d\theta_j}J(\theta)时,进行修改

J(\theta)=\frac{1}{m}[\sum_{i=1}^m(-y^ilog(h_\theta(x^i))-(1-y^i)log(1-h_\theta(x^i)))]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2
\frac{d}{d\theta_j}J(\theta)=\frac{1}{m}\sum_{i=1}^mx_j^i(h_\theta(x^i)-y^i)+\frac{\lambda}{m}\theta_j