Machine Learning (CPSC 540): Maximum likelihood and linear regression
Suppose we have a training data set $(\mathbb{x}_i, y_i)_{i=1}^{n}$ where $\mathbb{x}_i = (x_0=1, x_1, \cdots, x_m), \theta = (\theta_0, \theta_1, \cdots, \theta_m)$ which follows gaussian distribution $y_i \sim N(\mathbb{x}_i^T\theta, \sigma^2)$ with unknown $\theta$ and $\sigma^2$. Then we can estimate $\theta$ and $\sigma^2$ by using Maximum Likelihood Estimation (MLE)$^{1)}$.
$^{1)}$ I want to note that MLE doesn't proved to be the best estimation mathematically. It is just one of the estimation method. However, by using weak law of large number, we can conclude that if we have large amount of sample data, MLE is meaningful.
$$P(Y \mid X, \theta, \sigma^2) = (2\pi\sigma^2)^{-n/2}e^{-\frac{1}{2\sigma^2}\sum(y_i - \mathbb{x}_i^T\theta)^2}.$$All we have to do is just minimizing $P(Y \mid X, \theta, \sigma^2)$. We can get estimated value by $\hat{\theta} = (X^TY)^{-1}X^TY$ and $\hat{\sigma^2} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \mathbb{x}_i^T\theta)^2$.
Here, we can see that if each $y_i$ in data set follows gaussian distribution, the cost function of linear regression is related with MLE.
After the estimation, we can predict the output of new input value $\mathbb{x}_*$ with the probability distribution $y_* \sim N(\mathbb{x}_*^T\hat{\theta}, \hat{\sigma^2})$.
Frequentist learning assumes that there exists a true model. In this sense, MLE estimates a true model by maximizing probability.
In information theory, entropy $H$ is a measure of the uncertainty associated with a random variable.
$$H(X) = -\sum_{x}p(x \mid \theta) \mathrm{log} \ p(x \mid \theta).$$For a Bernoulli variable X, the entropy is $H(X) = -[(1-\theta)\mathrm{log}(1-\theta)+\theta \mathrm{log}\theta]$.
The above graph shows that the entropy is maximized when the probability is $\frac{1}{2}$. This means if there is less information, the entropy and uncertainty become higher. In this sense, MLE minimizing uncertainty.
$^{1)}$ I want to note that MLE doesn't proved to be the best estimation mathematically. It is just one of the estimation method. However, by using weak law of large number, we can conclude that if we have large amount of sample data, MLE is meaningful.

댓글
댓글 쓰기