The document discusses requirements and technologies for personal robots. It describes two projects from Preferred Networks - an interactive picking robot that can be controlled by spoken language instructions, and an autonomous tidying robot. For picking, they achieved 92.7% accuracy in matching objects to instructions through an interactive clarification approach. For tidying, their system can tidy at a rate of 1.9 objects per minute with a 90% grasp success rate using computer vision and a tablet user interface. Remaining challenges include handling unlimited items and unseen environments.
8. Key Technologies
● Computer Vision: Generalization to different environments and tasks
○ Object detection of thousands of categories
○ Support unseen environments and unseen objects
● Human–robot interface between humans and robots
○ Intuitive interface with spoken and visual language interpretation
○ Spoken and visual feedback from robots
11. Challenges
● Variety of Expressions
“a bear doll”, “the animal plushie”,
“that fluffy thing”, “up-side-down grizzly”
“grab X”, “bring together X and Y”,
“move X to a diagonal box”
● Ambiguity and errors
“that brown one”, “a dog doll?”
12. Human: the one next to the eraser box.
Robot: I got it.
Human: hey can you move that brown fluffy
thing to the bottom right?
Robot: which one do you mean?
15. Handling Ambiguous Commands
● Trained with hinge loss for correct sentence–object pairs [Yu+ 2017]
● Instruction is considered ambiguous if margin is below threshold
CNN
MLP
CNN
MLPMLP
LSTM
!pick the brown
fluffy thing and put it
in the lower right bin.
2nd 1st
margin
16. Interactive Picking Dataset
grab the human face
labeled object and …
move the pop red can
from the top …
move the pink horse
plushie …
put the box with a 50
written on it that is …
Publicly available as PFN-PIC dataset:
https://github.com/pfnet-research/picking-instruction
1200 scenes
(26k objects in total)
100 types of commodities
unconstrained 73k instructions
(vocabulary size: 5000)
19. Results
4.7% improvement (39% error reduction) by interactive clarification
single instruction interactive
88.0% 92.7%
Accuracy of target object matching
20. Summary
● We proposed an interactive picking system that can be controlled by
unconstrained spoken language instructions.
● We achieved an object matching accuracy of 92.7%.
● Accuracies for unseen objects are not sufficient (~70%).
* Hatori+ 2018. Interactively Picking Real-World Object with Unconstrained Spoken Language Instructions. ICRA-2018 Best Paper on HRI.
25. Environment
● Furnished living room
○ Coffee table, coach,
bookshelf, trash bins,
laundry bag, toy box
● Two Toyota HSRs working in
parallel
26. Object Recognition
● Sensors
○ HSR’s head camera (RGBD)
○ 4 ceiling cameras (RGB)
● Supported objects: ~300
● PFDet as CNN base model
○ 2nd place accuracy at Google
AI Open Images Challenge –
Object Detection (Sep, 2018)
27. PFDet: Basic Architecture [1]
● Feature Pyramid Network (FPN) (SENet-154 and SE-ResNeXt-101)
● Multi-node batch normalization
● Non-maximum weighted (NMW) suppression [2]
● Global context
○ Additional FPN block
○ PSP (pyramid spatial pooling) module
○ Context head [3]
[1] Akiba+ 2018. PFDet: 2nd Place Solution to Open Images Challenge 2018 Object Detection Track.
[2] Zhou+. CAD: Scale invariant framework for real-time object detection. ICCVW 2017.
[3] Zhu+. CoupleNet: Coupling global structure with local parts for object detection. ICCV 2017.
28. PFDet: High Scalability
Hardware: In-house GPU Cluster
NVIDIA Tesla V100
(32GB) × 512
Infiniband
Scalability Results
● Training of 16 epochs completed in 33 hours
● Scaling efficiency is 83% compared to 8 GPUs
Software Framework
35. Human–Robot Interaction (HRI)
● From user to robot
○ Update where the current item should be stored
○ Inquire about object locations
● From robot to user
○ Spoken and audio feedback
○ Tablet App for monitoring
■ User can also provide feedback
■ AR-based visualization
● Technologies involved: speech recognition, NLP, gesture, AR
40. Remaining Challenges with Tidying-up
● Standalone computation (no external sensor or computer)
● Recognition of unlimited items in domestic environments
● Generalization to unseen environments
● Easy setup
41. Robots as Interface with Physical World
● Domestic robots can track household items while tidying-up,
connecting everything in physical world to the virtual world.
● Potential applications:
○ E-commerce
○ Recommendations on items purchase or disposal
42. Key Takeaways
● Robust computer vision and intuitive human–robot interface are
prerequisites for successful personal robot applications.
● Some of simple domestic tasks like tidying-up are getting close to a
production level.
● Robots are interface with physical world, computerizing household
items and connect them to online services.
43. Thank you!
Interactive picking: https://pfnet.github.io/interactive-robot/
Tidying-up robot: https://projects.preferred.jp/tidying-up-robot/en/
Related talks
● S9380 - The Frontier of Define-by-Run Deep Learning Frameworks
Wed, Mar 20, 11:00 AM - 11:50 AM – SJCC Room 210E
● S9738 - Using GPU Power for NumPy-syntax Calculations
Tue, Mar 19, 2:00 PM - 02:50 PM – SJCC Room 210F