SlideShare a Scribd company logo
1 of 12
Download to read offline
Multimodal Language Understanding
for Carry and Place tasks
Aly Magassouba, Komei Sugiura and Hisashi Kawai
National Institute of Information and Communications Tech., Japan
Our target: service robots that understand ambiguous speech
Social Background
• Shortage of manpower that can physically
support people with disability
Challenge
• Understanding ambiguous instructions
from the linguistic and visual context in a
end-to-end approach
Ambiguity
• “Put away the sugar and milk bottle”
• Meaning: “Put the sugar on the kitchen
shelf and the milk in the fridge”
The difference between our approach and literature is Generative
Adversarial Nets (GAN) data augmentation in latent space
Related work:
• Dialog-based approach [Kollar10]
– Time consuming
• End-to-end approach [Hatori18]
– Grasping task/Large dataset
• LAC-GAN [Sugiura17]
– Single modality
Novelty:
– Multimodal spoken language
understanding with GAN data
augmentation
• Key technology
– GAN data augmentation in latent space
– Different from Classic GAN[Goodfellow14]
used for generation
[Bousmalis17]
fake
real
Discriminator
Generator
OR
[Zhang17]
Theoretical background of MultiModal Classifier GAN (MMC-GAN)
Cost function of Extractor
Cost function of Generator based on
Wasserstein method
Cost function of discriminator• Data augmentation in latent space
makes more data-efficient [Sugiura17]
• Extractor was fully-connected, not
adapted to visual and multimodal inputs
Structure of Extractor
Input (b)
• Instruction: “Bring this towel to the
kitchen shelf”
• Context “the robot is holding the
towel”
• Depth image
Output label
• A4= Unlikely target area
Building Carry-and-Place Multimodal Dataset for validating our method
Input (a)
• Instruction: “Put the coke bottle on
the table”
• Context “the bottle has been
grasped”
• Depth image
Output label
• A1= Very likely target area
A1 212
A2 432
A3 398
A4 240
Total 1282
Data set distribution
MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
Metric = test-set accuracy
MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
MMC-GAN outperforms
classic DNN
Metric = test-set accuracy
MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
Metric = test-set accuracy
Multimodal approach is
required to solve the carry-
and-place task
MMC-GAN is more accurate thanks to the data augmentation property
Method GAN
type
Instruction Instruction
+Context
Image Instruction
+Context
+Image
CNN
(baseline)
- 59.4 60.2 61.1 82.2
MMC-GAN GAN 57.5* 59.5* 58.1 85.3
MMC-GAN CGAN 56.4* 56.7* 58.2 86.2
MMC-GAN WGAN 61.8 62.7 59.7 84.4
*Not all trials converge
WGAN is more
stable
Metric = test-set accuracy
Sample results: MMC-GAN emphasizes the relationship between
linguistic and visual features
CorrectpredictionIncorrectprediction
Confusion matrix
Summary
• Contribution
– Multimodal spoken language understanding with GAN data augmentation
• Method
– A GAN network based on latent space feature that classifies target areas
from ambiguous instructions
• Results
– Our method outperforms DNN
– Multimodal inputs are required to solve carry-and-place tasks

More Related Content

Similar to A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
c.titus.brown
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Jisu Han
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the Cloud
Pasquale Salza
 
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al..."Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
Edge AI and Vision Alliance
 
Manta ray optimized deep contextualized bi-directional long short-term memor...
Manta ray optimized deep contextualized bi-directional long  short-term memor...Manta ray optimized deep contextualized bi-directional long  short-term memor...
Manta ray optimized deep contextualized bi-directional long short-term memor...
IJECEIAES
 

Similar to A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions (20)

Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty Five
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty FiveMulti Task DPP for Basket Completion by Romain WARLOP, Fifty Five
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty Five
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
Parallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the CloudParallel Genetic Algorithms in the Cloud
Parallel Genetic Algorithms in the Cloud
 
Informs 2019 - Flexible Network Design Utilizing Non Strict Modeling Approaches
Informs 2019  - Flexible Network Design Utilizing Non Strict Modeling ApproachesInforms 2019  - Flexible Network Design Utilizing Non Strict Modeling Approaches
Informs 2019 - Flexible Network Design Utilizing Non Strict Modeling Approaches
 
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al..."Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
kiran_bangar
kiran_bangarkiran_bangar
kiran_bangar
 
NS-CUK Joint Jouarl Club: JHLee, Review on "GraphMAE: Self-Supervised Masked...
 NS-CUK Joint Jouarl Club: JHLee, Review on "GraphMAE: Self-Supervised Masked... NS-CUK Joint Jouarl Club: JHLee, Review on "GraphMAE: Self-Supervised Masked...
NS-CUK Joint Jouarl Club: JHLee, Review on "GraphMAE: Self-Supervised Masked...
 
Manta ray optimized deep contextualized bi-directional long short-term memor...
Manta ray optimized deep contextualized bi-directional long  short-term memor...Manta ray optimized deep contextualized bi-directional long  short-term memor...
Manta ray optimized deep contextualized bi-directional long short-term memor...
 
BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Automated_attendance_system_project.pptx
Automated_attendance_system_project.pptxAutomated_attendance_system_project.pptx
Automated_attendance_system_project.pptx
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - Reboot
 
Large Scale Distributed Deep Networks
Large Scale Distributed Deep NetworksLarge Scale Distributed Deep Networks
Large Scale Distributed Deep Networks
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
 

Recently uploaded

The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 

A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

  • 1. Multimodal Language Understanding for Carry and Place tasks Aly Magassouba, Komei Sugiura and Hisashi Kawai National Institute of Information and Communications Tech., Japan
  • 2. Our target: service robots that understand ambiguous speech Social Background • Shortage of manpower that can physically support people with disability Challenge • Understanding ambiguous instructions from the linguistic and visual context in a end-to-end approach Ambiguity • “Put away the sugar and milk bottle” • Meaning: “Put the sugar on the kitchen shelf and the milk in the fridge”
  • 3. The difference between our approach and literature is Generative Adversarial Nets (GAN) data augmentation in latent space Related work: • Dialog-based approach [Kollar10] – Time consuming • End-to-end approach [Hatori18] – Grasping task/Large dataset • LAC-GAN [Sugiura17] – Single modality Novelty: – Multimodal spoken language understanding with GAN data augmentation • Key technology – GAN data augmentation in latent space – Different from Classic GAN[Goodfellow14] used for generation [Bousmalis17] fake real Discriminator Generator OR [Zhang17]
  • 4. Theoretical background of MultiModal Classifier GAN (MMC-GAN) Cost function of Extractor Cost function of Generator based on Wasserstein method Cost function of discriminator• Data augmentation in latent space makes more data-efficient [Sugiura17] • Extractor was fully-connected, not adapted to visual and multimodal inputs
  • 6. Input (b) • Instruction: “Bring this towel to the kitchen shelf” • Context “the robot is holding the towel” • Depth image Output label • A4= Unlikely target area Building Carry-and-Place Multimodal Dataset for validating our method Input (a) • Instruction: “Put the coke bottle on the table” • Context “the bottle has been grasped” • Depth image Output label • A1= Very likely target area A1 212 A2 432 A3 398 A4 240 Total 1282 Data set distribution
  • 7. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge Metric = test-set accuracy
  • 8. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge MMC-GAN outperforms classic DNN Metric = test-set accuracy
  • 9. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge Metric = test-set accuracy Multimodal approach is required to solve the carry- and-place task
  • 10. MMC-GAN is more accurate thanks to the data augmentation property Method GAN type Instruction Instruction +Context Image Instruction +Context +Image CNN (baseline) - 59.4 60.2 61.1 82.2 MMC-GAN GAN 57.5* 59.5* 58.1 85.3 MMC-GAN CGAN 56.4* 56.7* 58.2 86.2 MMC-GAN WGAN 61.8 62.7 59.7 84.4 *Not all trials converge WGAN is more stable Metric = test-set accuracy
  • 11. Sample results: MMC-GAN emphasizes the relationship between linguistic and visual features CorrectpredictionIncorrectprediction Confusion matrix
  • 12. Summary • Contribution – Multimodal spoken language understanding with GAN data augmentation • Method – A GAN network based on latent space feature that classifies target areas from ambiguous instructions • Results – Our method outperforms DNN – Multimodal inputs are required to solve carry-and-place tasks