Machine Learning (CPSC 540): Gaussian process

First, I misunderstood the gaussian process because of the name "process". I thought it was a statistical process like the bernoulli process. However, the basic idea of the gaussian process is distribution over functions. We can't represent distribution over functions easily. Thus, we think about values of a function for any arbitrary points $\mathbb{x}_1, \mathbb{x}_2, \cdots, \mathbb{x}_n$ and consider those values follow multivariate gaussian, i.e. $p(f(\mathbb{x}_1), f(\mathbb{x}_2), \cdots, f(\mathbb{x}_n)) \sim N(\mu, \sum)$. In my opinion, I view this as marginal distribution. I imagine that there is original distribution over functions, and we know the marginal distribution over arbitrary points although we don't know the exact distribution over functions. We denote this by $f(\mathbb{x}) \sim GP(m(\mathbb{x}), \kappa(\mathbb{x}, \mathbb{x}'))$ where $m(\mathbb{x}) = E(f(\mathbb{x}))$ and $\kappa(\mathbb{x}, \mathbb{x}') = E((\mathbb{x} - m(\mathbb{x}))^T(\mathbb{x}' - m(\mathbb{x}'))$.

Gaussian process is a stochastic process. That is, we assume that data are dependent each other. The main idea of the GP is smoothness. We want that whenever we get the function values $f(\mathbb{x}_1), f(\mathbb{x}_2), \cdots, f(\mathbb{x}_n)$ from the distribution, if $\mathbb{x}_i$ and $\mathbb{x}_j$ are close, $f(\mathbb{x}_i)$ and $f(\mathbb{x}_j)$ are also close. We can meet this desire by using covariance matrix. We know that if a covariance is high, two random variable behave similarly. Therefore, if we make covariance matrix of the multivariate gaussian like $(\sum_{ij}) = e^{-\|\mathbb{x}_i - \mathbb{x}_j\|^2}$, whenever $\mathbb{x}_i$ and $\mathbb{x}_j$ are close, $f(\mathbb{x}_i)$ and $f(\mathbb{x}_j)$ are also close. Here, we used gaussian kernel $\kappa(\mathbb{x}_i, \mathbb{x}_j) = e^{-\|\mathbb{x}_i - \mathbb{x}_j\|^2}$, but we can change this to any function that represents the closeness of points.

One thing which makes GP plausible is that if there is a data point, uncertainty is low and if there isn't a data point, uncertainty is high.

댓글