Machine Learning (CPSC 540): Importance sampling

Importance sampling is all about distribution changing. I'm going to explain this first with simple coin toss example, and move on to the posterior case.

Suppose that we want to toss a coin which has the probability $P(H)=\frac{1}{3}, P(T)=\frac{2}{3}$. However, we don't have that kind of coin. Then, how to simulate this situation with an ordinary coin? Suppose that we tossed an ordinary coin $2N$ times and we get $N$ heads and $N$ tails. We want to regard $N$ heads as $\frac{1}{3}\cdot2N$ heads and $N$ tails as $\frac{2}{3}\cdot2N$ tails. If we assign weights to each event like $\frac{2}{3}$ for head and $\frac{4}{3}$ for tail, we can get desired result.

Let's extend this result to the posterior case. We want to simulate $P(\theta\mid D)$ with $q(\theta)$. With some calculations we can get
$$P(\theta \mid D) = \frac{P(D \mid \theta)P(\theta)}{\int P(D \mid \theta)P(\theta)\mathit{d}\theta} = \frac{1}{z}\frac{P(D \mid \theta)P(\theta)}{q(\theta)}q(\theta) = \frac{w(\theta)}{z}q(\theta).$$Now consider small interval $d\theta$. If we sample $N$ points from $q$, and we get $n_q$ points in the $d\theta$, we can get $q(\theta')d\theta = \frac{n_q}{N}$. Then, how can we simulate sampling from $P(\theta \mid D)$? Suppose that we sampled $N$ points from $P(\theta \mid D)$ and got $n_p$ points in the $d\theta$. Then, we get
$$\frac{n_p}{N} = P(\theta' \mid D)d\theta = \frac{w(\theta')}{z}q(\theta')d\theta = \frac{w(\theta')}{z}\cdot\frac{n_q}{N}.$$Thus, if we assign weight $\frac{w(\theta')}{z}$ to points belong to the interval $d\theta$, we can get desired result.
On that account, I think it must be
$$P(d\theta \mid D) = \frac{\frac{1}{N}\sum w(\theta^i)\delta_{\theta^i}(d\theta)}{\frac{1}{N}\sum w(\theta^i)}.$$in page 4 of the lecture note.

Now, how can we choose good distribution $q$? Obviously, we have to choose $q$ that covers the domain of the posterior. Next, we should choose $q$ which has enough variance. If we chose $q$ which has small variance we wouldn't cover the posterior enough like in figure below. We can use cross-validation to choose suitable variance of $q$.

댓글