Adversarial Training for Community Question Answer Selection Based on Multi-scale Matching (Xiao Yang et al. 2018)

이미지
Summary Main topic of this research is CQA Selection which aims at automatically retrieving archived answers that are relevant to a newly submitted question. Previous works on this task explicitly model the correlations between text fragments in questions and answers, predicting the relevance score for each candidate answer to a question, and put them in the binary classifier (relevant/irrelevant). After that, they re-rank these answers to find the most appropriate one. 1. Novel contribution 1) Multi-scale matching model While there are only word-to-word correlations in prior works, the proposed model inspects the correlation between words and ngrams (word-to-ngrams) of different levels of granularity. This allows the proposed model to capture rich context, therefore can better differentiate good answers from bad ones. The proposed model employs a deep convolutional neural network (CNN) to learn a hierarchical representation for each sentence. 2) Adversarial training framework ...

Natural Language Processing (CS224N): Word Vectors

How do we represent the meaning of a word? Previously, we used a taxonomy like WordNet, but it requires human labor, it is hard to compute similarity and the nuances are missing. Also, we used localist representation such as one-hot vector encoding but it doesn't contain similarity relationship. However, distributional similarity based representations such as skip-gram and CBOW are good at computing similarity because they use dense vectors to represent words and we can easily compute similarity with dot product. This idea comes from the philosophy "You shall know a word by the company it keeps" by English linguist John Rupert Firth. This means we can get a lot of value by representing a word by means of its neighbors, and skip-gram method really computes word vector representation for a word by means of its neighbors. There were some notes from the lecture. They were the followings. Skip-gram method cares neighbors of the center word, but it doesn't care how far is...

Natural Language Processing (CS224N): Introduction

Lecture: Natural Language Processing (2017, Stanford) Professor: Christopher Manning and Richard Socher Link:  https://www.youtube.com/watch?v=OQQ-W_63UgQ&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6 Slides:  https://github.com/maxim5/cs224n-winter-2017 Goal of NLP For computers to process or "understand" natural language in order to perform tasks that are useful. e.g. Machine translation and QA. Why is NLP hard? Human language is not a formal language. It depends on real world, common sense, and contextual knowledge. The large vocabulary, symbolic encoding of words leads to a sparsity problem. Traditional machine learning vs. Deep learning Traditional approach: use human-designed representation. Deep Learning: facts are stored in vectors. NLP progression 1. Mostly solved Spell checking Keyword search Finding synonyms Spam detection POS tagging Named entity recognition (NER) 2. Making good progress (thanks to DL) Sentimental analysis Coreference res...

Generative Adversarial Nets (Ian J. Goodfellow et al., 2014 NIPS)

이미지
Motivation Deep learning experienced most striking success in discriminative models. However, Deep generative models have had less of an impact because it is hard to approximate intractable probabilistic computations and leverage the benefits of piecewise linear units. The proposed new generative model sidesteps these difficulties by using minimax two-player game. The good analogous of two players are the team of counterfeiters and the police. Solution Minimax two-player game: $\displaystyle \min_{G}\max_{D} V(D, G) = \mathbb{E}_{x \sim p_{data}} \left [ \log D(x) \right ] + \mathbb{E}_{x \sim p_z(z)} \left [ \log (1 - D(G(z)) \right ]$. - Discriminative model $D$: Estimates the probability $D(x)$ that a sample came from the training data rather than $G$. - Generative model $G$: Transforms probability distribution $p_z(z)$ to the generated data distribution $p_g$. Unique solution is $\displaystyle \frac{1}{2}$. This means discriminator can't discriminate sample came from $G...

Multimodal Deep Learning (Jiquan Ngiam et al., 2011 ICML)

이미지
Motivation - Learn features over multiple modalities (ex. audio + video). - Human understands speech with audio+visual information. McGurk effect: visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects. (McGurk, H. and MacDonald, J. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.) Learning architecture - Input: Audio + Visual (lip motion) of single letter & digit. - Find correlation between viseme and phoneme, not raw visual and audio. 1. Sparse RBM (Figure 2-a, b) Baseline. Also use as pretraining models. Informally, these models transform raw data into viseme and phoneme. 2. Bimodal DBN (Figure 2-d) By using RBM pretraining models, find correlation between viseme and phoneme. 3. Single Deep Autoencoder (Figure 3-a) For Bimodal DBN, there is no objective and it is hard to train when only single modality appears. Single Deep Autoencoder solved these two issues, focused on training with single modality. This model uses a RBM pretrain...

Machine Learning (CPSC 540): Markov Chain Monte Carlo

이미지
I'm going to explain MCMC with finite case. Essentially, we don't need MCMC method for finite case but this would be good example for understanding this method. Before this, I recommend reading  http://blog.kleinproject.org/?p=280  which is a good introduction for Markov Chain. Suppose that we want to sample from following distribution. We will sample by exploring the state space $X=\left\{1, 2, 3\right\}$ using a Markov Chain mechanism. Let $T$ be a transition matrix which is aperiodic and irreducible. Then, for any initial probability vector $v$, $vT^t \to \pi \ \textit{as} \ t \to \infty$. Now, if $\pi = \left(\frac{1}{6}, \frac{1}{3}, \frac{1}{2} \right)$, we can sample from object distribution by taking any distribution of initial state and traverse state space with $T$ until convergence. Thus, all we have to do is finding $T$ which is aperiodic, irreducible, and $\pi T = \pi$. This exactly agrees with continuous case which is our aim. In this case, $T$ become...

Machine Learning (CPSC 540): Importance sampling

이미지
Importance sampling is all about distribution changing. I'm going to explain this first with simple coin toss example, and move on to the posterior case. Suppose that we want to toss a coin which has the probability $P(H)=\frac{1}{3}, P(T)=\frac{2}{3}$. However, we don't have that kind of coin. Then, how to simulate this situation with an ordinary coin? Suppose that we tossed an ordinary coin $2N$ times and we get $N$ heads and $N$ tails. We want to regard $N$ heads as $\frac{1}{3}\cdot2N$ heads and $N$ tails as $\frac{2}{3}\cdot2N$ tails. If we assign weights to each event like $\frac{2}{3}$ for head and $\frac{4}{3}$ for tail, we can get desired result. Let's extend this result to the posterior case. We want to simulate $P(\theta\mid D)$ with $q(\theta)$. With some calculations we can get $$P(\theta \mid D) = \frac{P(D \mid \theta)P(\theta)}{\int P(D \mid \theta)P(\theta)\mathit{d}\theta} = \frac{1}{z}\frac{P(D \mid \theta)P(\theta)}{q(\theta)}q(\theta) = \frac{w(\th...