The paper proposes a multimodal language understanding method called MMC-GAN for carry and place tasks. MMC-GAN uses a GAN to augment training data in the latent space, improving over single modality and dialog-based approaches. It trains an extractor network on multimodal inputs of language instructions, contexts and images to classify likely target areas. Evaluation shows MMC-GAN outperforms baselines, and the multimodal approach is needed to understand ambiguous instructions in context.
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions
1. Multimodal Language Understanding
for Carry and Place tasks
Aly Magassouba, Komei Sugiura and Hisashi Kawai
National Institute of Information and Communications Tech., Japan
2. Our target: service robots that understand ambiguous speech
Social Background
• Shortage of manpower that can physically
support people with disability
Challenge
• Understanding ambiguous instructions
from the linguistic and visual context in a
end-to-end approach
Ambiguity
• “Put away the sugar and milk bottle”
• Meaning: “Put the sugar on the kitchen
shelf and the milk in the fridge”
3. The difference between our approach and literature is Generative
Adversarial Nets (GAN) data augmentation in latent space
Related work:
• Dialog-based approach [Kollar10]
– Time consuming
• End-to-end approach [Hatori18]
– Grasping task/Large dataset
• LAC-GAN [Sugiura17]
– Single modality
Novelty:
– Multimodal spoken language
understanding with GAN data
augmentation
• Key technology
– GAN data augmentation in latent space
– Different from Classic GAN[Goodfellow14]
used for generation
[Bousmalis17]
fake
real
Discriminator
Generator
OR
[Zhang17]
4. Theoretical background of MultiModal Classifier GAN (MMC-GAN)
Cost function of Extractor
Cost function of Generator based on
Wasserstein method
Cost function of discriminator• Data augmentation in latent space
makes more data-efficient [Sugiura17]
• Extractor was fully-connected, not
adapted to visual and multimodal inputs
6. Input (b)
• Instruction: “Bring this towel to the
kitchen shelf”
• Context “the robot is holding the
towel”
• Depth image
Output label
• A4= Unlikely target area
Building Carry-and-Place Multimodal Dataset for validating our method
Input (a)
• Instruction: “Put the coke bottle on
the table”
• Context “the bottle has been
grasped”
• Depth image
Output label
• A1= Very likely target area
A1 212
A2 432
A3 398
A4 240
Total 1282
Data set distribution
7. MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
Metric = test-set accuracy
8. MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
MMC-GAN outperforms
classic DNN
Metric = test-set accuracy
9. MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
Metric = test-set accuracy
Multimodal approach is
required to solve the carry-
and-place task
10. MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
WGAN is more
stable
Metric = test-set accuracy
11. Sample results: MMC-GAN emphasizes the relationship between
linguistic and visual features
CorrectpredictionIncorrectprediction
Confusion matrix
12. Summary
• Contribution
– Multimodal spoken language understanding with GAN data augmentation
• Method
– A GAN network based on latent space feature that classifies target areas
from ambiguous instructions
• Results
– Our method outperforms DNN
– Multimodal inputs are required to solve carry-and-place tasks