This document presents RVOS, a fully end-to-end recurrent network for one-shot and zero-shot video object segmentation. RVOS extends recurrent semantic instance segmentation (RSIS) to the video domain, using spatio-temporal recurrence to segment objects in each frame. The model is evaluated on YouTube-VOS and DAVIS-2017, achieving state-of-the-art or comparable results to SOTA for one-shot VOS. However, zero-shot VOS results are still low, as the model struggles to segment unseen object categories without any training examples. Overall, RVOS provides a fully trainable alternative to existing methods for multiple object VOS.
3. Motivation
● End-to-End Trainable model
○ YouTube-VOS include 3,471 training videos
○ No dependency on other pre-trained networks like optical flow
● Recurrent Network
○ Extend RSIS (Recurrent Semantic Instance Segmentation) to Video Object Segmentation
● Spatial and Temporal Recurrence
○ Study if spatio-temporal recurrence outperforms spatial and temporal recurrence
● One-shot and zero-shot video object segmentation
○ RSIS is able to discover the object instances in an image
○ No published results for zero-shot multiple object video object segmentation
● Fast method
○ No need of fine-tuning at inference (no online learning)
4. Related Work
● Sequence-to-Sequence (S2S) [1]
○ Drawbacks
■ Each instance is trained and segmented independently
■ Designed only for one-shot video object segmentation
[1] N. Xu et al, YouTube-VOS: Sequence-to-Sequence Video Object Segmentation. ECCV 2018
5. Related Work
● ConvGRU [2]
○ Drawbacks
■ Each instance is trained and segmented independently
■ Optical flow depends on a network trained for another task: model is not end-to-end trainable
[2] P. Tokmakov et al., Learning video object segmentation with visual memory. ICCV 2017
6. Related Work
● RSIS [3]
○
○ Drawbacks
■ Model is designed for image object segmentation
[3] A. Salvador et al., Recurrent Neural Networks for Semantic Instance Segmentation. arXiv
25. Experiments: One-shot VOS on DAVIS-2017
● We take advantage of the model already trained on YouTube-VOS
○ Apply directly the pre-trained model
○ Finetune on DAVIS-2017 the pre-trained model
● S2S model (SoA also trained on YouTube-VOS)
○ Results on DAVIS-2016 (single object, foreground-background video object segmentation)
○ No results on DAVIS-2017 (multiple object)
28. Experiments: Zero-shot VOS on DAVIS-2017
● First results reported for zero-shot VOS on DAVIS-2017
● Results for zero-shot VOS on YouTube for unseen categories were also low
30. Conclusions
● Fully end-to-end trainable model for video object segmentation
● Designed for multiple object video object segmentation
● Designed for one-shot and zero-shot video object segmentation
● Spatio-temporal recurrence outperforms spatial and temporal recurrence
● One-shot video object segmentation:
○ YouTube-VOS: Comparable results to SoA techniques (S2S)
○ DAVIS-2017:
■ Outperform other SoA techniques that do not use online learning
■ Comparable results to some SoA techniques that use online learning
● Zero-shot video object segmentation:
○ No results reported both on YouTube-VOS and DAVIS-2017
31. Thank you for your attention
Carles Ventura Royo
cventuraroy@uoc.edu
https://imatge-upc.github.io/rvos/