Machine Learning (CPSC 540): Unconstrained optimization
I'm going to show two kinds of view to the one-dimensional Newton's method and extend it to the multivariate case.
In one-dimensional case, gradient descent method is simply $x_{n+1} = x_n - \eta f'(x_n)$. So, how do we choose $\eta$? We can use curvature, the change rate of gradient. Like the figure below shows, if curvature is high, we have to be careful when we proceed and if not, we can move faster.
Thus, it is reasonable to choose $\eta = \frac{1}{f''(x_n)}$. Then, we get the following familiar formula.
$$x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}.$$We can see this by the following familiar figure.
Second view of Newton's method is Taylor series. Second order Taylor expansion at $x_n$ is $f(x) \approx f(x_n) + f'(x_n)(x - x_n) + \frac{1}{2}f''(x_n)(x - x_n)^2$. This is a quadratic curve which approximates $f(x)$ at $x_{n}$. So, we can move $x_{n+1}$ from $x_n$ where $f(x_n) + f'(x_n)(x_{n+1} - x_n) + \frac{1}{2}f''(x_n)(x_{n+1} - x_n)^2$ is the extreme value.
Then, we get the same formula.
$$x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}.$$Next, let's extend those views to the multivariate case. The original multivariate Newton's method is the second view above. We just derive second order Taylor expansion at $x_n$ and move to $x_{n+1}$ to reach extreme value of the quadratic approximation.
$$f(x) \approx f(x_n) + \bigtriangledown f(x_n) (x-x_n) + \frac{1}{2} (x-x_n)^T \bigtriangledown^2 f(x_n) (x-x_n).\\
\bigtriangledown f(x_n) + \bigtriangledown^2 f(x_n) (x_{n+1}-x_n) = 0\\
x_{n+1} = x_n - [\bigtriangledown^2 f(x_n)]^{-1}\bigtriangledown f(x_n).$$However, I have a different opinion. Since we move to the opposite direction of gradient, we can regard multivariate case to one dimensional case like
$$F(t) = f(x_n + vt) \ \textit{where} \ v = \frac{\bigtriangledown f(x_n)}{\left\|\bigtriangledown f(x_n)\right\|}.$$Then, we have
$$\begin{aligned}
t_0 &= 0 - \frac{F'(0)}{F''(0)}\\
&= -\frac{\bigtriangledown f(x_n) \cdot v}{v^T\bigtriangledown^2 f(x_n) v}.
\end{aligned}$$Finally, we can get the following formula which is different from original method.
$$\begin{aligned}
x_{n+1} &= x_n + vt_0 = x_n - \frac{\bigtriangledown f(x_n) \cdot v}{v^T\bigtriangledown^2 f(x_n) v}v\\
&= x_n - \frac{1}{v^T\bigtriangledown^2 f(x_n) v}\bigtriangledown f(x_n).
\end{aligned}$$In the lecture, there is a formula $\theta_{k+1} = \theta_k - \eta \bigtriangledown J(\theta_k, \mathbb{x}_k)$. Professor said there is a theoretical result that if we choose $\eta = \frac{1}{N}$, this method converges to minimum. I will figure it out soon. (Question, Machine Learning (CPSC 540)).
In one-dimensional case, gradient descent method is simply $x_{n+1} = x_n - \eta f'(x_n)$. So, how do we choose $\eta$? We can use curvature, the change rate of gradient. Like the figure below shows, if curvature is high, we have to be careful when we proceed and if not, we can move faster.
Thus, it is reasonable to choose $\eta = \frac{1}{f''(x_n)}$. Then, we get the following familiar formula.
$$x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}.$$We can see this by the following familiar figure.
Second view of Newton's method is Taylor series. Second order Taylor expansion at $x_n$ is $f(x) \approx f(x_n) + f'(x_n)(x - x_n) + \frac{1}{2}f''(x_n)(x - x_n)^2$. This is a quadratic curve which approximates $f(x)$ at $x_{n}$. So, we can move $x_{n+1}$ from $x_n$ where $f(x_n) + f'(x_n)(x_{n+1} - x_n) + \frac{1}{2}f''(x_n)(x_{n+1} - x_n)^2$ is the extreme value.
Then, we get the same formula.
$$x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}.$$Next, let's extend those views to the multivariate case. The original multivariate Newton's method is the second view above. We just derive second order Taylor expansion at $x_n$ and move to $x_{n+1}$ to reach extreme value of the quadratic approximation.
$$f(x) \approx f(x_n) + \bigtriangledown f(x_n) (x-x_n) + \frac{1}{2} (x-x_n)^T \bigtriangledown^2 f(x_n) (x-x_n).\\
\bigtriangledown f(x_n) + \bigtriangledown^2 f(x_n) (x_{n+1}-x_n) = 0\\
x_{n+1} = x_n - [\bigtriangledown^2 f(x_n)]^{-1}\bigtriangledown f(x_n).$$However, I have a different opinion. Since we move to the opposite direction of gradient, we can regard multivariate case to one dimensional case like
$$F(t) = f(x_n + vt) \ \textit{where} \ v = \frac{\bigtriangledown f(x_n)}{\left\|\bigtriangledown f(x_n)\right\|}.$$Then, we have
$$\begin{aligned}
t_0 &= 0 - \frac{F'(0)}{F''(0)}\\
&= -\frac{\bigtriangledown f(x_n) \cdot v}{v^T\bigtriangledown^2 f(x_n) v}.
\end{aligned}$$Finally, we can get the following formula which is different from original method.
$$\begin{aligned}
x_{n+1} &= x_n + vt_0 = x_n - \frac{\bigtriangledown f(x_n) \cdot v}{v^T\bigtriangledown^2 f(x_n) v}v\\
&= x_n - \frac{1}{v^T\bigtriangledown^2 f(x_n) v}\bigtriangledown f(x_n).
\end{aligned}$$In the lecture, there is a formula $\theta_{k+1} = \theta_k - \eta \bigtriangledown J(\theta_k, \mathbb{x}_k)$. Professor said there is a theoretical result that if we choose $\eta = \frac{1}{N}$, this method converges to minimum. I will figure it out soon. (Question, Machine Learning (CPSC 540)).



댓글
댓글 쓰기