Generative Adversarial Nets (Ian J. Goodfellow et al., 2014 NIPS)
Motivation
Deep learning experienced most striking success in discriminative models. However, Deep generative models have had less of an impact because it is hard to approximate intractable probabilistic computations and leverage the benefits of piecewise linear units. The proposed new generative model sidesteps these difficulties by using minimax two-player game. The good analogous of two players are the team of counterfeiters and the police.
Solution
Minimax two-player game: $\displaystyle \min_{G}\max_{D} V(D, G) = \mathbb{E}_{x \sim p_{data}} \left [ \log D(x) \right ] + \mathbb{E}_{x \sim p_z(z)} \left [ \log (1 - D(G(z)) \right ]$.
- Discriminative model $D$: Estimates the probability $D(x)$ that a sample came from the training data rather than $G$.
- Generative model $G$: Transforms probability distribution $p_z(z)$ to the generated data distribution $p_g$.
Unique solution is $\displaystyle \frac{1}{2}$. This means discriminator can't discriminate sample came from $G$.
In practice, we training $G$ to maximize $\log (D(G(z)))$ not minimize $\log (1 - D(G(z)))$. This provides much stronger gradients early in learning.
Theoretical result shows that if $G$ and $D$ have enough capacity, and at each step of Algorithm 1, the discriminator is allowed to reach its optimum given $G$ then $p_g$ converges to $p_{data}$.
This result does not guarantee the convergence of the proposed model because the proposed model uses a part of probability distributions which can be expressed by multilayer perceptrons. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.
Experiment & Result
Dataset
- MNIST
- Toronto Face Database (TFD)
- CIFAR-10
Evalutaion Tool
- Gaussian Parzen window (Used for various generative models for which the exact likelihood is not tractable.)
Related Works & Comparsion
- Prior work
1. RBM: Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
2. DBN: Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–455.
3. Score matching: Hyvarinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res., 6.
4. Noise-contrastive estimation (NCE): Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS’2010.
5. Generative stochastic network (GSN): Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.
6. Auto-encoding variational Bayes: Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
7. Stochastic backpropagation: Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082.
- Novel contribution
The proposed model does not require Markov chains or unrolled approximate inference networks. It only requires backpropagation so it has computational advantages. Also, it is better able to leverage piecewise linear units. Furthermore, it can represent very sharp, even degenerate distributions while other models based on Markov chains require that the distribution be somewhat blurry.
Conclusion & Value
There are two main disadvantages. We can't get explicit representation of $p_g(x)$ from this model and the Helvetica scenario (mode collapsing). However, this framework can yield specific training algorithms for many kinds of model and optimization algorithm.
Future works
1. Conditional generative model $p(x \mid c)$.
2. Predict $z$ given $x$.
3. Approximately model all conditionals $p(x_S \mid x_{\not S})$ where $S$ is a subset of the indices of $x$ by training a family of conditional models that share parameters. Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM.
4. Semi-supervised learning when limited labeled data is available.
Comment
Undoubtedly, GAN is a very novel generative model. However, this does not work well for all distributions and it is hard to tune hyperparameters of GANs because we don't know what's going on inside GAN. One of my goals is to discover the mechanism of GAN and make it easy to tune it when we fail to generate good data.
Questions
In proposition 2, I wonder why $U(p_g, D)$ is convex in $p_g$. $U(p_g, D)$ is not easy to handle because it is a function of function. However, the proof uses that like it is just an ordinary function.
Deep learning experienced most striking success in discriminative models. However, Deep generative models have had less of an impact because it is hard to approximate intractable probabilistic computations and leverage the benefits of piecewise linear units. The proposed new generative model sidesteps these difficulties by using minimax two-player game. The good analogous of two players are the team of counterfeiters and the police.
Solution
Minimax two-player game: $\displaystyle \min_{G}\max_{D} V(D, G) = \mathbb{E}_{x \sim p_{data}} \left [ \log D(x) \right ] + \mathbb{E}_{x \sim p_z(z)} \left [ \log (1 - D(G(z)) \right ]$.
- Discriminative model $D$: Estimates the probability $D(x)$ that a sample came from the training data rather than $G$.
- Generative model $G$: Transforms probability distribution $p_z(z)$ to the generated data distribution $p_g$.
Unique solution is $\displaystyle \frac{1}{2}$. This means discriminator can't discriminate sample came from $G$.
In practice, we training $G$ to maximize $\log (D(G(z)))$ not minimize $\log (1 - D(G(z)))$. This provides much stronger gradients early in learning.
This result does not guarantee the convergence of the proposed model because the proposed model uses a part of probability distributions which can be expressed by multilayer perceptrons. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.
Experiment & Result
Dataset
- MNIST
- Toronto Face Database (TFD)
- CIFAR-10
Evalutaion Tool
- Gaussian Parzen window (Used for various generative models for which the exact likelihood is not tractable.)
Related Works & Comparsion
- Prior work
1. RBM: Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
2. DBN: Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–455.
3. Score matching: Hyvarinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res., 6.
4. Noise-contrastive estimation (NCE): Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS’2010.
5. Generative stochastic network (GSN): Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.
6. Auto-encoding variational Bayes: Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
7. Stochastic backpropagation: Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082.
- Novel contribution
The proposed model does not require Markov chains or unrolled approximate inference networks. It only requires backpropagation so it has computational advantages. Also, it is better able to leverage piecewise linear units. Furthermore, it can represent very sharp, even degenerate distributions while other models based on Markov chains require that the distribution be somewhat blurry.
Conclusion & Value
There are two main disadvantages. We can't get explicit representation of $p_g(x)$ from this model and the Helvetica scenario (mode collapsing). However, this framework can yield specific training algorithms for many kinds of model and optimization algorithm.
1. Conditional generative model $p(x \mid c)$.
2. Predict $z$ given $x$.
3. Approximately model all conditionals $p(x_S \mid x_{\not S})$ where $S$ is a subset of the indices of $x$ by training a family of conditional models that share parameters. Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM.
4. Semi-supervised learning when limited labeled data is available.
Comment
Undoubtedly, GAN is a very novel generative model. However, this does not work well for all distributions and it is hard to tune hyperparameters of GANs because we don't know what's going on inside GAN. One of my goals is to discover the mechanism of GAN and make it easy to tune it when we fail to generate good data.
Questions
In proposition 2, I wonder why $U(p_g, D)$ is convex in $p_g$. $U(p_g, D)$ is not easy to handle because it is a function of function. However, the proof uses that like it is just an ordinary function.




댓글
댓글 쓰기