Machine Learning (CPSC 540): Regularization and regression

Frequentist learning assumes that there exists a true model, say $\theta_0$. We estimate model $\hat{\theta}$ in this method by $\hat{\theta} = \mathrm{argmax}_{\theta} \ p(x_1, x_2, \cdots, x_n \mid \theta)$ where $x_1, x_2, \cdots, x_n$ are sample data. We can also apply this method to the Neural Network.
Why this method is reasonable? First, MLE is consistent. That is, $\lim_{N \to \infty} P(\| \hat{\theta} - \theta_0 \| > \alpha) = 0 \ \ \textit{for any} \ \ \alpha$. Second, MLE is asymptotically optimal. That is, $\hat{\theta} - \theta_0 \sim N(0, I^{-1}) \ \textit{where N is large}$. ($I$ is a Fisher information matrix.)

MLE always have a bias-variance trade off $\mathrm{MSE} = \mathrm{bias}^2 + \mathrm{variance}$. Here, bias is $\bar{\theta} - \theta_0$ and variance is $E(\hat{\theta} - \bar{\theta})^2$.

If we have a big variance oscillation, we have small bias and vice versa.

I want to note that if basis function is given, nonlinear model becomes linear. Thus, we can apply below results to nonlinear model with basis function.
We usually estimates linear model by solving the normal equation $X^TX\theta = X^Ty$. However, if the system is undetermined or $X$ does not have a full rank, $X^TX$ is not invertible. Thus, we add $\delta^2I$ to the $X^TX$ and solve $(X^TX + \delta^2I)\theta = X^Ty$ to deal with this problem. We call this the ridge regression estimate and this can be derived by regularised quadratic cost function $J(\theta) = (y - X\theta)^T(y-X\theta) + \delta^2\theta^T\theta$.
We can control model complexity because as $\delta$ goes to infinity, some component of the $\theta$ goes to 0 faster than the others. (important components go to 0 slower than the others.)

댓글