Multimodal Deep Learning (Jiquan Ngiam et al., 2011 ICML)
Motivation
- Learn features over multiple modalities (ex. audio + video).
- Human understands speech with audio+visual information.
McGurk effect: visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects. (McGurk, H. and MacDonald, J. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.)
Learning architecture
- Input: Audio + Visual (lip motion) of single letter & digit.
- Find correlation between viseme and phoneme, not raw visual and audio.
1. Sparse RBM (Figure 2-a, b)
Baseline. Also use as pretraining models.
Informally, these models transform raw data into viseme and phoneme.
2. Bimodal DBN (Figure 2-d)
By using RBM pretraining models, find correlation between viseme and phoneme.
3. Single Deep Autoencoder (Figure 3-a)
For Bimodal DBN, there is no objective and it is hard to train when only single modality appears.
Single Deep Autoencoder solved these two issues, focused on training with single modality.
This model uses a RBM pretraining model and weight of Bimodal DBN to reconstruct both modaltiy when given only single modality.
4. Bimodal Deep Autoencoder (Figure 3-b)
It is hard to train Single Deep Autoencoder by using both modality.
Bimodal Deep Autoencoder solved this issue, and has the same power with Single Deep Autoencoder when given only single modality as well. With single modality, the data is augmented regarding absent modality as a zero vector which is similar to Denoising Autoencoder.
Experiment & Result
- Dataset
CUAVE - Video + Audio
AVLetters / AVLetters2 - Video
Stanford Dataset - Video + Audio
TIMIT - Audio
- Classifier: Linear SVM
(It is enough to use linear classifier if these models find good features.)
- In feature learning phase (pretraining 2-a, b, d), use every possible training data
1. Cross Modality Learning (3-a)
Video)
AVLetters: Great (> Baseline, RBM, Bimodal DAE)
CUAVE: Good (> Baseline, RBM, Bimodal DAE)
Audio)
RBM is better. Adding video information can sometimes hurt performance.
2. Multimodal Fusion
There are two cases, train with clean audio or noisy audio.
'+' means concatenation.
Similar to Cross Modality Learning, audio features perform well on their own and concatenating video features can sometimes hurt performance. However, when the best audio features are concatenated with the bimodal features, it outperforms the other feature combinations. This shows that the learned multimodal features are better able to complement the audio features.
3. McGurk effect
Bimodal Deep Autoencoder consistent with the McGurk effect on people.
The same effect was not observed with other models.
4. Shared Representation Learning
Hearing to see: train with audio, test with video.
Seeing to hear: train with video, test with audio.
"Seeing to hear" performs better than "Hearing to see".
Bimodal deep autoencoder model does not perform as well as CCA, however, CCA does not help in other tasks.
5. Additional Control Experiments
With additional experiments, it is found that depth and audio cue are important for video-only deep autoencoder.
Related Works & Comparsion
- Prior work on audio-visual speech recognition
1. Duchnowski, P., Meier, U., and Waibel, A. See me, hear me: Integrating automatic speech recognition and lipreading. In ICSLP, pp. 547–550, 1994.
2. Yuhas, B. P., Goldstein, M. H., and Sejnowski, T. J. Integration of acoustic and visual speech signals using neural networks. IEEE Comm. Magazine, pp. 65 –71, 1989.
3. Meier, U., Hurst, W., and Duchnowski, P. Adaptive Bimodal Sensor Fusion For Automatic Speechreading. In ICASSP, pp. 833–836, 1996.
4. Bregler, C. and Konig, Y. ”Eigenlips” for robust speech recognition. In ICASSP, 1994.
- Novel contribution
- Use the hidden units to build a new representation of the data.
- Do not model phonemes or visemes, which require expensive labeling efforts.
- Build deep bimodal representations by modeling the correlations across the learned shallow representations.
Conclusion
Showed how deep learning can be applied to this challenging task for discovering multimodal features.
Comment
Multimodal Learning is very important because human being do not learn by only one thing. However, integrated learning is absent in current AI technology. In this regrad, this research proposed novel idea to deal with multimodality by using deep learning. Surprisingly, bimodal deep autoencoder shows same McGurk effect like human being. In contrast to these advantages, it seems that audio is not suitable for multimodal learning and they used CCA method for the shared representation learning which is not consistent with previous experiment. This means we have to solve multimodality case by case.
- Learn features over multiple modalities (ex. audio + video).
- Human understands speech with audio+visual information.
McGurk effect: visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects. (McGurk, H. and MacDonald, J. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.)
Learning architecture
- Input: Audio + Visual (lip motion) of single letter & digit.
- Find correlation between viseme and phoneme, not raw visual and audio.
1. Sparse RBM (Figure 2-a, b)
Baseline. Also use as pretraining models.
Informally, these models transform raw data into viseme and phoneme.
2. Bimodal DBN (Figure 2-d)
By using RBM pretraining models, find correlation between viseme and phoneme.
3. Single Deep Autoencoder (Figure 3-a)
For Bimodal DBN, there is no objective and it is hard to train when only single modality appears.
Single Deep Autoencoder solved these two issues, focused on training with single modality.
This model uses a RBM pretraining model and weight of Bimodal DBN to reconstruct both modaltiy when given only single modality.
4. Bimodal Deep Autoencoder (Figure 3-b)
It is hard to train Single Deep Autoencoder by using both modality.
Bimodal Deep Autoencoder solved this issue, and has the same power with Single Deep Autoencoder when given only single modality as well. With single modality, the data is augmented regarding absent modality as a zero vector which is similar to Denoising Autoencoder.
Experiment & Result
- Dataset
CUAVE - Video + Audio
AVLetters / AVLetters2 - Video
Stanford Dataset - Video + Audio
TIMIT - Audio
- Classifier: Linear SVM
(It is enough to use linear classifier if these models find good features.)
- In feature learning phase (pretraining 2-a, b, d), use every possible training data
1. Cross Modality Learning (3-a)
Video)
AVLetters: Great (> Baseline, RBM, Bimodal DAE)
CUAVE: Good (> Baseline, RBM, Bimodal DAE)
Audio)
RBM is better. Adding video information can sometimes hurt performance.
2. Multimodal Fusion
There are two cases, train with clean audio or noisy audio.
'+' means concatenation.
Similar to Cross Modality Learning, audio features perform well on their own and concatenating video features can sometimes hurt performance. However, when the best audio features are concatenated with the bimodal features, it outperforms the other feature combinations. This shows that the learned multimodal features are better able to complement the audio features.
3. McGurk effect
Bimodal Deep Autoencoder consistent with the McGurk effect on people.
The same effect was not observed with other models.
4. Shared Representation Learning
Hearing to see: train with audio, test with video.
Seeing to hear: train with video, test with audio.
"Seeing to hear" performs better than "Hearing to see".
Bimodal deep autoencoder model does not perform as well as CCA, however, CCA does not help in other tasks.
5. Additional Control Experiments
With additional experiments, it is found that depth and audio cue are important for video-only deep autoencoder.
Related Works & Comparsion
- Prior work on audio-visual speech recognition
1. Duchnowski, P., Meier, U., and Waibel, A. See me, hear me: Integrating automatic speech recognition and lipreading. In ICSLP, pp. 547–550, 1994.
2. Yuhas, B. P., Goldstein, M. H., and Sejnowski, T. J. Integration of acoustic and visual speech signals using neural networks. IEEE Comm. Magazine, pp. 65 –71, 1989.
3. Meier, U., Hurst, W., and Duchnowski, P. Adaptive Bimodal Sensor Fusion For Automatic Speechreading. In ICASSP, pp. 833–836, 1996.
4. Bregler, C. and Konig, Y. ”Eigenlips” for robust speech recognition. In ICASSP, 1994.
- Novel contribution
- Use the hidden units to build a new representation of the data.
- Do not model phonemes or visemes, which require expensive labeling efforts.
- Build deep bimodal representations by modeling the correlations across the learned shallow representations.
Conclusion
Showed how deep learning can be applied to this challenging task for discovering multimodal features.
Comment
Multimodal Learning is very important because human being do not learn by only one thing. However, integrated learning is absent in current AI technology. In this regrad, this research proposed novel idea to deal with multimodality by using deep learning. Surprisingly, bimodal deep autoencoder shows same McGurk effect like human being. In contrast to these advantages, it seems that audio is not suitable for multimodal learning and they used CCA method for the shared representation learning which is not consistent with previous experiment. This means we have to solve multimodality case by case.
댓글
댓글 쓰기