Machine Learning (CPSC 540): regularization, cross-validation and data size

We have regularised quadratic cost function, $J(\theta) = (y-X\theta)^T(y-X\theta) + \delta^2\theta^T\theta$. Then, how do we choose the regularization coefficient? This means "how do we control the model complexity?" because if $\delta$ is large, the model complexity is small and vice versa.
We have a method called cross-validation. We divide the data set into two pieces and use one of them as a train data set and use another one as a test data set. Then, we change the value of $\delta$ and calculate test set error and train set error (mean square error) to choose the right value of delta.
Similar to the above graph, it is usual that the larger $\delta$, the larger the train set error since larger $\delta$ means low model complexity. However, the test set error improves because too complex model doesn't fit to the all of data. Since we know that even if true model have an error, it is good to choose $\delta$ where the test set error is the lowest.

There is an another method called K-fold cross validation. This is similar to the above method. For example, 5-fold cross validation works like this.
We use each pieces as a test data set for each run, and average the test set errors to choose $\delta$.
Finally, more data improves results, but only if the model has the right complexity.

We can use these methods because of the smoothness of the world, and smoothness of the world is the underlying nature of learning.

댓글