Natural Language Processing (CS224N): Word Vectors
How do we represent the meaning of a word? Previously, we used a taxonomy like WordNet, but it requires human labor, it is hard to compute similarity and the nuances are missing. Also, we used localist representation such as one-hot vector encoding but it doesn't contain similarity relationship. However, distributional similarity based representations such as skip-gram and CBOW are good at computing similarity because they use dense vectors to represent words and we can easily compute similarity with dot product. This idea comes from the philosophy "You shall know a word by the company it keeps" by English linguist John Rupert Firth. This means we can get a lot of value by representing a word by means of its neighbors, and skip-gram method really computes word vector representation for a word by means of its neighbors.
There were some notes from the lecture. They were the followings. Skip-gram method cares neighbors of the center word, but it doesn't care how far is the context word. There are two vector representations for each words, one for center word and one for context word because it is practically and computationally better. Lastly, $e$ in the softmax is appropriate to give probability and make the value positive. Softmax is really "soft"max because it exaggerates gaps and because of this exaggeration, there are other approachs to replace softmax.
The most important part of this lecture is computing gradient of skip-gram model. The following derivation will fill the missing part.
Firstly, we define the environment formally. The number of words in the corpus is $V$ and context word vectors are $u_1, u_2, \cdots, u_V \in \mathbb{R}^{d}.$ Similarly, center word vectors are $v_1, v_2, \cdots, v_V \in \mathbb{R}^{d}.$ With above notations, we define the parameter of skip-gram model as $\theta = (u_1, u_2, \cdots, u_V, v_1, v_2, \cdots, v_V) \in \mathbb{R}^{2dV}.$ For convenience, we use additional notations. $u_n(x)$ means a $d-$dimensional vector which forms by substituting the position of $u_n$ in $\theta$ with $d-$dimensional vector $x$ and fill other components with 0. Informally, $u_n(x) = (0_1, 0_2, \cdots, 0_{n-1}, x, 0_{n+1}, \cdots, 0_V, 0_1, 0_2, \cdots, 0_V).$ We can define $v_n(x)$ similarly.
Next, we define objective function $J: \mathbb{R}^{2dV} \to \mathbb{R}$ simply because the differential operator is linear.
\begin{align}
J(\theta) &= -\log{p(o \mid c)}\\
&= -\log{ \frac{\exp(u_o^T v_c)}{\sum_{w=1}^{V}\exp(u_w^T v_c)} }
\end{align}The last part is computing gradient of $J$. For computing gradient, we should be careful of $u_o, u_w$ and $v_c$ because $J(\theta)$ is a $2dV-$dimensional real-valued function and thus, the gradient should be $2dV-$dimensional vector.
\begin{align}
\frac{\partial}{\partial \theta} J(\theta) &= \frac{\partial}{\partial \theta} \left[ -\log{p(u_o \mid v_c)} \right]\\
&= \frac{\partial}{\partial \theta} \left[ -\log{ \frac{\exp(u_o^T v_c)}{ \sum_{w=1}^{V}\exp(u_w^T v_c)} } \right]\\
&= \frac{\partial}{\partial \theta} \log{\sum_{w=1}^{V}\exp(u_w^T v_c)} - \frac{\partial}{\partial \theta} u_o^T v_c\\
&= \sum_{k=1}^{V} \frac{\exp(u_k^T v_c) \cdot \frac{\partial}{\partial \theta} (u_k^T v_c)}{\sum_{w=1}^{V}\exp(u_w^T v_c)} - \frac{\partial}{\partial \theta} (u_o^T v_c)\\
&= \sum_{k=1}^{V} \frac{\exp(u_k^T v_c) (v_c(u_k) + u_k(v_c))}{\sum_{w=1}^{V}\exp(u_w^T v_c)} - (v_c(u_o) + u_o(v_c))\\
&= \sum_{k=1}^{V} (v_c(u_k) + u_k(v_c)) \cdot p( u_k \mid v_c) - (v_c(u_o) + u_o(v_c))\\
&= v_c( \mathbb{E}(u \mid v_c) - u_o ) + (\mathbb{E}(u_k(v_c) \mid v_c) - u_o(v_c)).
\end{align}The first part of the Equation 9 means "expectation - observed" and this influence the center word vector $v_c$. Also, we can know that the center word vector $v_c$ is updated only when the word $c$ is chosen for the center word. For the second part, it influence all of the context word vectors. The appearance of "observed - expectation" is not surprising because to maximize $p(u_o \mid v_c)$ is equal to make the expectation to $u_o$.
Finally, this is only naive softmax method, and there are other effective methods such as Hierarchical softmax and Negative sampling. Someone who works in this area seriously should work on them.
There were some notes from the lecture. They were the followings. Skip-gram method cares neighbors of the center word, but it doesn't care how far is the context word. There are two vector representations for each words, one for center word and one for context word because it is practically and computationally better. Lastly, $e$ in the softmax is appropriate to give probability and make the value positive. Softmax is really "soft"max because it exaggerates gaps and because of this exaggeration, there are other approachs to replace softmax.
The most important part of this lecture is computing gradient of skip-gram model. The following derivation will fill the missing part.
Firstly, we define the environment formally. The number of words in the corpus is $V$ and context word vectors are $u_1, u_2, \cdots, u_V \in \mathbb{R}^{d}.$ Similarly, center word vectors are $v_1, v_2, \cdots, v_V \in \mathbb{R}^{d}.$ With above notations, we define the parameter of skip-gram model as $\theta = (u_1, u_2, \cdots, u_V, v_1, v_2, \cdots, v_V) \in \mathbb{R}^{2dV}.$ For convenience, we use additional notations. $u_n(x)$ means a $d-$dimensional vector which forms by substituting the position of $u_n$ in $\theta$ with $d-$dimensional vector $x$ and fill other components with 0. Informally, $u_n(x) = (0_1, 0_2, \cdots, 0_{n-1}, x, 0_{n+1}, \cdots, 0_V, 0_1, 0_2, \cdots, 0_V).$ We can define $v_n(x)$ similarly.
Next, we define objective function $J: \mathbb{R}^{2dV} \to \mathbb{R}$ simply because the differential operator is linear.
\begin{align}
J(\theta) &= -\log{p(o \mid c)}\\
&= -\log{ \frac{\exp(u_o^T v_c)}{\sum_{w=1}^{V}\exp(u_w^T v_c)} }
\end{align}The last part is computing gradient of $J$. For computing gradient, we should be careful of $u_o, u_w$ and $v_c$ because $J(\theta)$ is a $2dV-$dimensional real-valued function and thus, the gradient should be $2dV-$dimensional vector.
\begin{align}
\frac{\partial}{\partial \theta} J(\theta) &= \frac{\partial}{\partial \theta} \left[ -\log{p(u_o \mid v_c)} \right]\\
&= \frac{\partial}{\partial \theta} \left[ -\log{ \frac{\exp(u_o^T v_c)}{ \sum_{w=1}^{V}\exp(u_w^T v_c)} } \right]\\
&= \frac{\partial}{\partial \theta} \log{\sum_{w=1}^{V}\exp(u_w^T v_c)} - \frac{\partial}{\partial \theta} u_o^T v_c\\
&= \sum_{k=1}^{V} \frac{\exp(u_k^T v_c) \cdot \frac{\partial}{\partial \theta} (u_k^T v_c)}{\sum_{w=1}^{V}\exp(u_w^T v_c)} - \frac{\partial}{\partial \theta} (u_o^T v_c)\\
&= \sum_{k=1}^{V} \frac{\exp(u_k^T v_c) (v_c(u_k) + u_k(v_c))}{\sum_{w=1}^{V}\exp(u_w^T v_c)} - (v_c(u_o) + u_o(v_c))\\
&= \sum_{k=1}^{V} (v_c(u_k) + u_k(v_c)) \cdot p( u_k \mid v_c) - (v_c(u_o) + u_o(v_c))\\
&= v_c( \mathbb{E}(u \mid v_c) - u_o ) + (\mathbb{E}(u_k(v_c) \mid v_c) - u_o(v_c)).
\end{align}The first part of the Equation 9 means "expectation - observed" and this influence the center word vector $v_c$. Also, we can know that the center word vector $v_c$ is updated only when the word $c$ is chosen for the center word. For the second part, it influence all of the context word vectors. The appearance of "observed - expectation" is not surprising because to maximize $p(u_o \mid v_c)$ is equal to make the expectation to $u_o$.
Finally, this is only naive softmax method, and there are other effective methods such as Hierarchical softmax and Negative sampling. Someone who works in this area seriously should work on them.
댓글
댓글 쓰기