Machine Learning (CPSC 540): Logistic regression
Basically, logistic regression regressing a binary variable. I drew a diagram different from the lecture, and I think it will help you to understand well.
We can see logistic regression is just training single neuron. I understood neuron is a function with properties $\Pi, \theta$ and the training means to determine $\theta$ given data and $\Pi$. In this regard, $\theta_0$ is a bias term which is the property of a neuron. We shouldn't give $0$ whenever the input is $0$ so this is plausible.
We can get a recipe for learning here. After we construct probabilistic model, we can derive likelihood function $P(\mathbb{y} \mid \mathbb{X}, \theta)$ and take negative log to get cost/objective function. If we can calculate gradient and Hessian analytically, we can use Newton's method to find optimum.
In the logistic regression case, the Hessian of the negative loglikelihood $-\mathrm{log}P(\mathbb{y} \mid \mathbb{X}, \theta)$ is everywhere positive definite. Thus, the objective function is convex and we can find unique global minimum by Newton's method.
I'm going to prove first statement here. Since $H = X^T \mathrm{diag}(\pi_i(1-\pi_i))X$, $y^THy = y^TX^T\mathrm{diag}(\pi_i(1-\pi_i))Xy = (Xy)^T\mathrm{diag}(\pi_i(1-\pi_i))(Xy) = \sum_i z_i^2\pi_i(1-\pi_i) \ \textit{where} \ z=Xy$. Thus, $H$ is everywhere positive definite. This is not a hard technique because we use similar argument to prove well-known theorem which is used to test whether a symmetric matrix is positive definite.
Proof for the second statement: https://wj32.org/wp/2013/02/26/convex-functions-second-derivatives-and-hessian-matrices/.
Finally, I'm going to interpret objective function as a cross-entropy. I'm going to use simple coin toss model. We know that likelihood function of coin toss is $P(\mathbb{y} \mid \theta) = \prod_i \theta^{y_i}(1-\theta)^{1-y_i}$ and the negative loglikelihood which is an objective function is $-\mathrm{log}P(\mathbb{y} \mid \theta) = \sum_i -y_i\mathrm{log}\theta - (1-y_i)\mathrm{log}(1-\theta)$. This means how much information we get after we know the data $\mathbb{y}$. If $\mathbb{y}$ and $\theta$ coincide, this value may be low because we don't get much information if the result is expectable. We want to minimize cross-entropy because we want that $\mathbb{y}$ and $\theta$ coincide and this is exactly agree with Maximum Likelihood method.
It is not surprise that negative log of the likelihood is cross-entropy because likelihood is the probability to get the output value and negative log of probability is information gain which can be interpret as degree of agreement of a parameter and output values. To comment, we can say this value with the term entropy because high agreement means low uncertainty and low agreement means high uncertainty.

댓글
댓글 쓰기