Adversarial Training for Community Question Answer Selection Based on Multi-scale Matching (Xiao Yang et al. 2018)
Summary
Main topic of this research is CQA Selection which aims at automatically retrieving archived answers that are relevant to a newly submitted question. Previous works on this task explicitly model the correlations between text fragments in questions and answers, predicting the relevance score for each candidate answer to a question, and put them in the binary classifier (relevant/irrelevant). After that, they re-rank these answers to find the most appropriate one.
1. Novel contribution
1) Multi-scale matching model
While there are only word-to-word correlations in prior works, the proposed model inspects the correlation between words and ngrams (word-to-ngrams) of different levels of granularity. This allows the proposed model to capture rich context, therefore can better differentiate good answers from bad ones. The proposed model employs a deep convolutional neural network (CNN) to learn a hierarchical representation for each sentence.
2) Adversarial training framework
When treating CQA selection as a binary classification task, a practical issue is how to construct negative samples. Prior works usually construct negative samples by sampling uniformly from the dataset. However, these large number of low-quality negative samples may lead to a imbalanced class labels problem. In contrast, the proposed model presents an adversarial training framework to generate challenging negative samples to fool the classification model. These relatively small number of high-quality realistic negative samples are more likely to fool the classification model.
2. Result
The proposed method is evaluated on SemEval 2017 and Yahoo Answer dataset and achieves state-of-the-art performance.
Solution
Given $Q = (q_1, \cdots, q_m), A = (a_1, \cdots, a_n)$ which is a question and answer pair, we get a score by $f_\theta (Q, A)$. Every word is represented by $d$-dim vector using GloVe embedding method. We use $f_\theta$ to construct binary classifier (relevant / irrelevant) and after training it, we compare scores among $\mathbf{A} = \{A_i\}$ and select top answer.
1. Multi-scale Matching model
Multi-scale Matching model describes $f_\theta$. To compresses the semantic information different levels of granularity, we transform a question and answer pair to $(Q, Q^1, Q^2, \cdots, Q^K), (A, A^1, A^2, \cdots, A^K)$ where $Q^k = \mathrm{conv\_block}^k (Q^{k-1}), A^k = \mathrm{conv\_block}^k (A^{k-1}).$ For exmaple, $Q_i^0$ represents the information of the i-th word embedding, while $Q_i^1$ represents the context information from a 5-gram since the receptive field is 5.
After that, we follows "matching-aggregating" framework with multi-scale matching. For a specific pair of feature representations $Q^u$ and $A^v$, we can define a matching function $\mathcal{M}(Q^u, A^v) = \mathcal{M}^{(u, v)}$ to measure the relation between them. Figure 1 describes the matching function well.
Based on the defined the matching function $\mathcal{M}$, the score function $f_\theta (Q, A)$ can be formulated as $f_\theta (Q, A) = \mathcal{G}([\{\mathcal{M}^{(u, v)}\}])$ where $\mathcal{G}([\{\mathcal{M}^{(u, v)}\}])$ denotes the concatenation of all possible matching results and $\mathcal{G}$ is a real-value function which is realized by a two-layer fully connected neural network. To consider only word-to-word and word-to-ngram matchings, we formulate the score function simply as $f_\theta (Q, A) = \mathcal{G}([\{\mathcal{M}^{(0, v)}\}, \{\mathcal{M}^{(u, 0)}\}])$.
2. Adversarial training framework
The proposed model uses adversarial training framework to train $f_\theta$. With conditioned on the question sentence $Q$, the objective function can be written as:
\[
J(G, D) = \min_G \max_D \mathbb{E}_{A \sim p_{data}(A \mid Q)} [ \log{D(A \mid Q)} ] + \mathbb{E}_{A' \sim p_G (A' \mid Q)} [ \log{(1 - D(A' \mid Q))} ]
\] We use multi-scale matching model for each of $G$ and $D$. Given a set of candidate answers $\mathbf{A} = \{A_i\}$ of a specific question $Q$, $D(A \mid Q)$ and $p_G (A \mid Q)$ is modeled by:
\begin{align*}
D(A \mid Q) &= \sigma (f_\theta (Q, A))\\
p_G (A_i \mid Q) &= \frac{\exp{(f_{\theta'}(Q, A_i) / \tau)}}{\sum_j \exp{(f_{\theta'}(Q, A_j) / \tau)}}
\end{align*}with $\sigma$ being a sigmoid function and $\tau$ being a temperature hyper-parameter. $p_G$ denotes discrete probability distribution and to compute this easier, we uniformly sample an alternative smaller answer set $\tilde{A}$. This is not in the paper but to make this training possible, $p_{data}(A \mid Q)$ should be a probability distribution of positive answers and $p_G (A' \mid Q)$ should be a probability distribution of negative answers. Thereby, in training process, $f_\theta$ tries to give high scores to the positive answers and $p_G (\theta')$ tries to give more probability to the realistic negative answers.
There are two method to differentiate discrete sampling process which is not differentiable. They are policy gradient and Gumbel Softmax trick. Here, we use policy gradient method which simply takes gradient directly to the objective function and approximates the expectation by sampling.
Experiment & Result
We evaluate the proposed method on two benchmark datasets: SemEval 2017 and Yahoo Answers. Evaluation metrics are mean average precision (MAP), mean reciprocal rank (MRR) and precision at top 1 rank position.
Conclusion & Comment
In addition to the prior solutions of CQA selection, the proposed model inspects the correlation between words and ngrams (word-to-ngrams) to capture rich context. Also, the proposed model presents an adversarial training framework to generate challenging negative samples to fool the classification model for the robustness. Lastly, compared to other methods, this model achieves the best result on SemEval 2017 and Yahoo Answer dataset. Therefore, this research contributes considerably to the CQA selection task.
Main topic of this research is CQA Selection which aims at automatically retrieving archived answers that are relevant to a newly submitted question. Previous works on this task explicitly model the correlations between text fragments in questions and answers, predicting the relevance score for each candidate answer to a question, and put them in the binary classifier (relevant/irrelevant). After that, they re-rank these answers to find the most appropriate one.
1. Novel contribution
1) Multi-scale matching model
While there are only word-to-word correlations in prior works, the proposed model inspects the correlation between words and ngrams (word-to-ngrams) of different levels of granularity. This allows the proposed model to capture rich context, therefore can better differentiate good answers from bad ones. The proposed model employs a deep convolutional neural network (CNN) to learn a hierarchical representation for each sentence.
2) Adversarial training framework
When treating CQA selection as a binary classification task, a practical issue is how to construct negative samples. Prior works usually construct negative samples by sampling uniformly from the dataset. However, these large number of low-quality negative samples may lead to a imbalanced class labels problem. In contrast, the proposed model presents an adversarial training framework to generate challenging negative samples to fool the classification model. These relatively small number of high-quality realistic negative samples are more likely to fool the classification model.
2. Result
The proposed method is evaluated on SemEval 2017 and Yahoo Answer dataset and achieves state-of-the-art performance.
Solution
Given $Q = (q_1, \cdots, q_m), A = (a_1, \cdots, a_n)$ which is a question and answer pair, we get a score by $f_\theta (Q, A)$. Every word is represented by $d$-dim vector using GloVe embedding method. We use $f_\theta$ to construct binary classifier (relevant / irrelevant) and after training it, we compare scores among $\mathbf{A} = \{A_i\}$ and select top answer.
1. Multi-scale Matching model
Multi-scale Matching model describes $f_\theta$. To compresses the semantic information different levels of granularity, we transform a question and answer pair to $(Q, Q^1, Q^2, \cdots, Q^K), (A, A^1, A^2, \cdots, A^K)$ where $Q^k = \mathrm{conv\_block}^k (Q^{k-1}), A^k = \mathrm{conv\_block}^k (A^{k-1}).$ For exmaple, $Q_i^0$ represents the information of the i-th word embedding, while $Q_i^1$ represents the context information from a 5-gram since the receptive field is 5.
After that, we follows "matching-aggregating" framework with multi-scale matching. For a specific pair of feature representations $Q^u$ and $A^v$, we can define a matching function $\mathcal{M}(Q^u, A^v) = \mathcal{M}^{(u, v)}$ to measure the relation between them. Figure 1 describes the matching function well.
Based on the defined the matching function $\mathcal{M}$, the score function $f_\theta (Q, A)$ can be formulated as $f_\theta (Q, A) = \mathcal{G}([\{\mathcal{M}^{(u, v)}\}])$ where $\mathcal{G}([\{\mathcal{M}^{(u, v)}\}])$ denotes the concatenation of all possible matching results and $\mathcal{G}$ is a real-value function which is realized by a two-layer fully connected neural network. To consider only word-to-word and word-to-ngram matchings, we formulate the score function simply as $f_\theta (Q, A) = \mathcal{G}([\{\mathcal{M}^{(0, v)}\}, \{\mathcal{M}^{(u, 0)}\}])$.
2. Adversarial training framework
The proposed model uses adversarial training framework to train $f_\theta$. With conditioned on the question sentence $Q$, the objective function can be written as:
\[
J(G, D) = \min_G \max_D \mathbb{E}_{A \sim p_{data}(A \mid Q)} [ \log{D(A \mid Q)} ] + \mathbb{E}_{A' \sim p_G (A' \mid Q)} [ \log{(1 - D(A' \mid Q))} ]
\] We use multi-scale matching model for each of $G$ and $D$. Given a set of candidate answers $\mathbf{A} = \{A_i\}$ of a specific question $Q$, $D(A \mid Q)$ and $p_G (A \mid Q)$ is modeled by:
\begin{align*}
D(A \mid Q) &= \sigma (f_\theta (Q, A))\\
p_G (A_i \mid Q) &= \frac{\exp{(f_{\theta'}(Q, A_i) / \tau)}}{\sum_j \exp{(f_{\theta'}(Q, A_j) / \tau)}}
\end{align*}with $\sigma$ being a sigmoid function and $\tau$ being a temperature hyper-parameter. $p_G$ denotes discrete probability distribution and to compute this easier, we uniformly sample an alternative smaller answer set $\tilde{A}$. This is not in the paper but to make this training possible, $p_{data}(A \mid Q)$ should be a probability distribution of positive answers and $p_G (A' \mid Q)$ should be a probability distribution of negative answers. Thereby, in training process, $f_\theta$ tries to give high scores to the positive answers and $p_G (\theta')$ tries to give more probability to the realistic negative answers.
There are two method to differentiate discrete sampling process which is not differentiable. They are policy gradient and Gumbel Softmax trick. Here, we use policy gradient method which simply takes gradient directly to the objective function and approximates the expectation by sampling.
Experiment & Result
We evaluate the proposed method on two benchmark datasets: SemEval 2017 and Yahoo Answers. Evaluation metrics are mean average precision (MAP), mean reciprocal rank (MRR) and precision at top 1 rank position.
Conclusion & Comment
In addition to the prior solutions of CQA selection, the proposed model inspects the correlation between words and ngrams (word-to-ngrams) to capture rich context. Also, the proposed model presents an adversarial training framework to generate challenging negative samples to fool the classification model for the robustness. Lastly, compared to other methods, this model achieves the best result on SemEval 2017 and Yahoo Answer dataset. Therefore, this research contributes considerably to the CQA selection task.



댓글
댓글 쓰기