Machine Learning (CPSC 540): Neural networks

In the previous lecture, we learned about neuron and a recipe for learning. In this lecture, we can get a recipe for Neural networks.
Actually, steps are very abstract. Just build a neural network, and add probabilistic model to the network. Then, derive likelihood function and take negative log to get cost/objective function which is also cross-entropy. Last, calculate gradient and Hessian analytically (back propagation) and use Newton's method to find optimum.

How does Neural network works? and why we need it?
First, think about XOR gate.
Can we implement this with single neuron (Logistic regression)? No, because we can't separate red points and blue points by a line. So, how can we get orange curve? Let's write orange curve by a implicit function $f(x_1,x_2)=0$ and build only the end part of the Neural network.
All we have to do is to approximate $f(x_1, x_2)$ by filling up the unknown orange part of the Neural network. Let's use following basis function.
$$\kappa(\mathbb{x}, a, b, c) = \frac{1}{1+e^{-a-bx_1-cx_2}}.$$If you want to approximate $f$ by two basis functions, you have to choose $\theta$ of $f(\mathbb{x}) = \theta_5 + \theta_6\kappa(\mathbb{x}, \theta_1, \theta_2, \theta_3) + \theta_7\kappa(\mathbb{x}, \theta_4, \theta_9, \theta_8)$. Then we get following Neural network.
Now, we have done the first part of the recipe above. So we just need to do the rest part of the recipe. Here, we can control model complexity by changing the number of neurons because the number of neurons is the number of basis functions in this case. We can use regularization, cross-validation or bayesian optimization.

Finally, I'm going to comment vanishing gradient problem of back propagation.
Consider the following Neural network.
Let $J(\theta) = (y - \hat{y})^2$. Then,
$$\begin{aligned}
\frac{\partial}{\partial \theta_2}J(\theta) &= -2(y-\hat{y})\frac{\partial \hat{y}}{\partial o_5}\frac{\partial o_5}{\partial u_4}\frac{\partial u_4}{\partial o_3}\frac{\partial o_3}{\partial u_3}\frac{\partial u_3}{\partial o_1}\frac{\partial o_1}{\partial u_1}\frac{\partial u_1}{\partial \theta_2}.\\
&=-2(y-\hat{y}) \theta_{16} [o_5(1-o_5)] \theta_{12} [o_3(1-o_3)] \theta_8 [o_1(1-o_1)] \theta_2.
\end{aligned}$$We know that $0<o_i<1$. Thus, the partial derivative could be vanish because of the term like $[o_5(1-o_5)][o_3(1-o_3)][o_1(1-o_1)]$. This is a vanishing gradient problem and we will deal with this problem in the next lecture.

댓글