Machine Learning (CPSC 540): Bayesian learning
In the Frequentist's view, they treat event that can be repeated many times. They use just likelihood for the prediction and they regard the parameter $\theta$ as a fixed constant. However, in the Bayesian's view, they regard the $\theta$ as a random variable and they set a prior hypothesis (or initial belief) to the $\theta$. This view is useful when the event can not be repeated many times and for solving reverse problems. Generally, reverse problem is much harder, for example, it is easy to transform text to sound but reverse problem is hard. Bayesian's view is similar to how machines learns. They have a prior knowledge and that prior knowledge would be updated when machines encounter data. Here, it is good to use conjugate prior for the likelihood function so that posterior distribution are in the same family as the prior probability distribution. We call this conjugate analysis.
It is easy to understand Bayesian's view by the example of a coin toss. First, we want to estimate $\theta$ which is the probability of head. We set a prior knowledge that the $\theta$ of a coin follows the below probability distribution $p(\theta)$.
Suppose that we test the coin by tossing it ten times. Then we know the likelihood function for each $\theta$. We call it $p(D \mid \theta) \ \textit{where D is the data e.g.} \ D = HHTH \cdots T$. Now, suppose that we tossed the coin ten times and encountered $HTTHHTHTTT$. It is clear that we want to estimate $\theta$ by the posterior distribution $p(\theta \mid D=HTTHHTHTTT)$. We can calculate posterior distribution by baye's rule.
$$p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}.$$Here, I want to note that if we choose prior to be the (asymptotic) uniform distribution, this is equal to the Maximum Likelihood Estimation since $$\mathrm{argmax}_{\theta} \ p(\theta \mid D) = \frac{[\mathrm{argmax}_{\theta} \ p(D \mid \theta)]p(\theta)}{p(D)}.$$This can be explained by the following diagram.
First we have the prior knowledge, and then we meet the data (likelihood). Finally we get a posterior distribution and this can update prior knowledge.
To understand easily, let's apply above sentence to the Monty Hall Problem.
Suppose we chose first door, and our prior knowledge $p(\theta)$ is
$$p(\theta = i) = \frac{1}{3}, i=1,2,3.$$Now, suppose that door 2 is opened and there is nothing inside. Since we have seen the data, our knowledge of the position of a car change. We call this a posterior knowledge and we can get this by
$$p(\theta \mid d=2) = \frac{p(d=2 \mid \theta)p(\theta)}{\sum_{\theta}p(d=2 \mid \theta)p(\theta)}. \\ p(\theta = 1 \mid d=2) = \frac{1/2 \times 1/3}{1/2} = \frac{1}{3}. \\ p(\theta = 3 \mid d=2) = \frac{1 \times 1/3}{1/2} = \frac{2}{3}.$$We can also apply the Bayesian's view to the linear regression problem. We have the likelihood function $p(\theta \mid \mathbb{y}) = N(\mathbb{y} \mid X\theta, \sigma^2I_n)$ and we set the prior to be the conjugate prior for the likelihood function $p(\theta) = N(\theta \mid \theta_0, V_0)$. Here, the conjugate prior for the gaussian is the gaussian. Then, by some calculation of completing square, we can get gaussian posterior distribution.
$$p(\theta \mid X, \mathbb{y}, \sigma^2) = \frac{1}{\|2\pi V_n\|^{1/2}}e^{-\frac{1}{2}(\theta - \theta_n)^TV_n^{-1}(\theta - \theta_n)}.$$Where $\theta_n = V_nV_0^{-1}\theta_0+\frac{1}{\sigma^2}V_nX^T\mathbb{y}$ and $V_n^{-1} = V_0^{-1} + \frac{1}{\sigma^2}X^TX$.
Now, consider the special case where $\theta_0 = 0$ and $V_0 = \tau_0^2I_d$. Then the posterior mean reduces to $\theta_n = (\lambda I_d + X^TX)^{-1}X^T\mathbb{y}$ where we have defined $\lambda := \frac{\sigma^2}{\tau_0^2}$. We have therefore recovered ridge regression. Also, if we define prior to be asymptotic uniform distribution ($\tau_0 \to \infty$), we have $\lambda \to 0$ and this leads to MLE. I want to remark that this result is the same with the result that we have calculated with the gradient for the cost function.
How to predict new data? To predict, Bayesians marginalize over the posterior. Let $x_*$ be a new input. The prediction, given the training data $D=(X, \mathbb{y})$ is
$$p(\mathbb{y} \mid x_*, D, \sigma^2) = \int N(\mathbb{y} \mid x_*^T\theta, \sigma^2)N(\theta \mid \theta_n, V_n)\mathit{d}\theta = N(\mathbb{y} \mid x_*^T\theta_n, \sigma^2 + x_*^TV_nx_*).$$In the MLE, the probability distribution for the $\theta$ is highly concentrated to the estimated value $\hat{\theta}$, i.e. $N(\theta \mid \theta_n, V_n) = \delta_{\hat{\theta}}(\theta)$ where we get $p(\mathbb{y} \mid x_*, D, \sigma^2) = N(\mathbb{y} \mid x_*^T\hat{\theta}, \sigma^2)$.
The meaning of bayesian prediction is that in MLE, we have equivalent variance for every input value. However, in bayesian, we have different variances $\sigma^2 + x_*^TV_nx_*$ so that we have high uncertainty for unknown input values. (Question, Machine Learning (CPSC 540))
At the end, I want to memo the useful result of gaussians.
$\textrm{For} \ p(\mathbb{x}) = N(\mathbb{x} \mid \mu_x, \sum_x) \\ p(\mathbb{y} \mid \mathbb{x}) = N(\mathbb{y} \mid A\mathbb{x} + \mathbb{b}, \sum_y),$
$\textrm{we have} \ p(\mathbb{x} \mid \mathbb{y}) = N(\mathbb{x} \mid \mu_{x|y}, \sum_{x|y}) \ \textit{where} \\ \sum_{x|y}^{-1} = \sum_x^{-1} + A^T\sum_y^{-1}A \\ \mu_{x|y} = \sum_{x|y}[A^T\sum_y^{-1}(\mathbb{y} - \mathbb{b}) + \sum_x^{-1}\mu_x]$
$\textrm{and}$
$p(\mathbb{y}) = N(\mathbb{y} \mid A\mu_x + \mathbb{b}, \sum_y + A\sum_xA^T)$.
It is easy to understand Bayesian's view by the example of a coin toss. First, we want to estimate $\theta$ which is the probability of head. We set a prior knowledge that the $\theta$ of a coin follows the below probability distribution $p(\theta)$.
Suppose that we test the coin by tossing it ten times. Then we know the likelihood function for each $\theta$. We call it $p(D \mid \theta) \ \textit{where D is the data e.g.} \ D = HHTH \cdots T$. Now, suppose that we tossed the coin ten times and encountered $HTTHHTHTTT$. It is clear that we want to estimate $\theta$ by the posterior distribution $p(\theta \mid D=HTTHHTHTTT)$. We can calculate posterior distribution by baye's rule.
$$p(\theta \mid D) = \frac{p(D \mid \theta)p(\theta)}{p(D)}.$$Here, I want to note that if we choose prior to be the (asymptotic) uniform distribution, this is equal to the Maximum Likelihood Estimation since $$\mathrm{argmax}_{\theta} \ p(\theta \mid D) = \frac{[\mathrm{argmax}_{\theta} \ p(D \mid \theta)]p(\theta)}{p(D)}.$$This can be explained by the following diagram.
First we have the prior knowledge, and then we meet the data (likelihood). Finally we get a posterior distribution and this can update prior knowledge.
To understand easily, let's apply above sentence to the Monty Hall Problem.
Suppose we chose first door, and our prior knowledge $p(\theta)$ is
$$p(\theta = i) = \frac{1}{3}, i=1,2,3.$$Now, suppose that door 2 is opened and there is nothing inside. Since we have seen the data, our knowledge of the position of a car change. We call this a posterior knowledge and we can get this by
$$p(\theta \mid d=2) = \frac{p(d=2 \mid \theta)p(\theta)}{\sum_{\theta}p(d=2 \mid \theta)p(\theta)}. \\ p(\theta = 1 \mid d=2) = \frac{1/2 \times 1/3}{1/2} = \frac{1}{3}. \\ p(\theta = 3 \mid d=2) = \frac{1 \times 1/3}{1/2} = \frac{2}{3}.$$We can also apply the Bayesian's view to the linear regression problem. We have the likelihood function $p(\theta \mid \mathbb{y}) = N(\mathbb{y} \mid X\theta, \sigma^2I_n)$ and we set the prior to be the conjugate prior for the likelihood function $p(\theta) = N(\theta \mid \theta_0, V_0)$. Here, the conjugate prior for the gaussian is the gaussian. Then, by some calculation of completing square, we can get gaussian posterior distribution.
$$p(\theta \mid X, \mathbb{y}, \sigma^2) = \frac{1}{\|2\pi V_n\|^{1/2}}e^{-\frac{1}{2}(\theta - \theta_n)^TV_n^{-1}(\theta - \theta_n)}.$$Where $\theta_n = V_nV_0^{-1}\theta_0+\frac{1}{\sigma^2}V_nX^T\mathbb{y}$ and $V_n^{-1} = V_0^{-1} + \frac{1}{\sigma^2}X^TX$.
Now, consider the special case where $\theta_0 = 0$ and $V_0 = \tau_0^2I_d$. Then the posterior mean reduces to $\theta_n = (\lambda I_d + X^TX)^{-1}X^T\mathbb{y}$ where we have defined $\lambda := \frac{\sigma^2}{\tau_0^2}$. We have therefore recovered ridge regression. Also, if we define prior to be asymptotic uniform distribution ($\tau_0 \to \infty$), we have $\lambda \to 0$ and this leads to MLE. I want to remark that this result is the same with the result that we have calculated with the gradient for the cost function.
How to predict new data? To predict, Bayesians marginalize over the posterior. Let $x_*$ be a new input. The prediction, given the training data $D=(X, \mathbb{y})$ is
$$p(\mathbb{y} \mid x_*, D, \sigma^2) = \int N(\mathbb{y} \mid x_*^T\theta, \sigma^2)N(\theta \mid \theta_n, V_n)\mathit{d}\theta = N(\mathbb{y} \mid x_*^T\theta_n, \sigma^2 + x_*^TV_nx_*).$$In the MLE, the probability distribution for the $\theta$ is highly concentrated to the estimated value $\hat{\theta}$, i.e. $N(\theta \mid \theta_n, V_n) = \delta_{\hat{\theta}}(\theta)$ where we get $p(\mathbb{y} \mid x_*, D, \sigma^2) = N(\mathbb{y} \mid x_*^T\hat{\theta}, \sigma^2)$.
The meaning of bayesian prediction is that in MLE, we have equivalent variance for every input value. However, in bayesian, we have different variances $\sigma^2 + x_*^TV_nx_*$ so that we have high uncertainty for unknown input values. (Question, Machine Learning (CPSC 540))
At the end, I want to memo the useful result of gaussians.
$\textrm{For} \ p(\mathbb{x}) = N(\mathbb{x} \mid \mu_x, \sum_x) \\ p(\mathbb{y} \mid \mathbb{x}) = N(\mathbb{y} \mid A\mathbb{x} + \mathbb{b}, \sum_y),$
$\textrm{we have} \ p(\mathbb{x} \mid \mathbb{y}) = N(\mathbb{x} \mid \mu_{x|y}, \sum_{x|y}) \ \textit{where} \\ \sum_{x|y}^{-1} = \sum_x^{-1} + A^T\sum_y^{-1}A \\ \mu_{x|y} = \sum_{x|y}[A^T\sum_y^{-1}(\mathbb{y} - \mathbb{b}) + \sum_x^{-1}\mu_x]$
$\textrm{and}$
$p(\mathbb{y}) = N(\mathbb{y} \mid A\mu_x + \mathbb{b}, \sum_y + A\sum_xA^T)$.



댓글
댓글 쓰기