FASSOLD Deep learning for semantic analysis and annotation of conventional and 360 degees video
1. Deep learning for semantic
analysis and annotation of
conventional and 360° video
Hannes Fassold
2. Who we are
• Smart Media Solutions Team
• CCM research group @ DIGITAL, JOANNEUM RESEARCH, Graz, Austria
• Content-based quality analysis & restoration of film and video
• http://vidicert.com
• http://www.hs-art.com
• Semantic video analysis
• Extraction of semantic information from a video
(with deep learning and classical methods)
• Shot & cadence detection
• Brand monitoring
• Object detection & recognition (faces, persons, …)
• Most components are real-time capable
2
3. Presentation overview
• Deep learning in a nutshell
• Face detection & recognition
• State of the art & issues
• Object detection & tracking
• State of the art & issues
• Applications
• Semi-automatic annotation of video
• Generating non-interactive version
of 360° video
3
4. Deep learning in a nutshell
• Deep neural networks (DNNs)
• Mimick the human brain structure
• Training
• Learn the weights for all layers
• A huge annotated (‚ground truth‘) dataset
is needed for training ‚from scratch‘
• Inference
• Run the network (classify / detect / …) for one image
• Both training and inference usually done on graphic cards (GPUs)
4
5. Face detection
• State of the art approaches
• Multi-Task CNN, RetinaFace, …
• Face detection is more or less ‚solved‘
• Works great even for small faces
and profile views of faces
• Accuracy of > 90 % (mAP)
• Real-time capable
(on the GPU)
5
Result of our face detection algorithm on a region of an image from a 360° video.
Content provided by Mediaset for Hyper360 project.
6. Face recognition
• Most algorithms rely on „closed world assumption“
• All faces occurring in the videos are known, meaning that the face recognition
algorithm has been trained on them
• State of the art approaches
• FaceNet, ArcFace, SphereFace, …
• Accuracy of > 98 % on the standard databases, processing in real-time
• Factors influencing the recognition result negatively
• Small face (or low resolution video)
• Profile view
• Bad lighting conditions
6
7. Face recognition – challenges & issues
• „Closed world assumption“ is difficult to achieve in practice
• You do not want to retrain your DNN if you want to recognize a new person,
as training takes quite some time …
• Incremental training can help here
• Not easy – you have to identify first that a person is ‚new‘ and have to retrain the DNN on-the-fly
• We have added incremental training in our in-house face framework
• You may not have enough annotations (samples) for each person
• 50 – 100 annotations for each person‘s face usually employed in the databases
• Training with less data is an active research area („few-shot learning“)
7
8. Face recognition – challenges & issues
• Class imbalance
• Some classes are under-represented in the dataset used for training the DNN
• Ethnic bias
• Publically available face datasets contain mostly faces from caucasian people
• Error rates on african people are about twice as big as for caucasian people [1]
• Few faces with glasses in most face datsets, but many asians have glasses
• Active research on methods in order to mitigate class imbalance
• Better data augmentation strategies
• Data crawling
• Synthetic generation of additional training data samples (‚face synthesis‘)
• Domain adaption & unsupervised learning
8
[1] https://arxiv.org/pdf/1812.00194.pdf
9. Object detection & tracking
• Task
• Detect an object of a certain class (e.g. person, dog, car, …)
and track it through its lifetime (each object gets an unique id)
• State of the art approaches
• RetinaNet, YoloV3, Faster R-CNN, …
• Usually detect 80 classes from MS COCO
• Our inhouse algorithm
• Detects & tracks general objects,
faces, text and logo in real-time
• << Demovideo >>
9
Result of our object detection & tracking algorithm
10. Object detection & tracking – challenges & issues
• Current state
• Algorithms are really usable in practice: robust (mAP > 60 %) and fast (real-time)
• Remaining issues
• Re-identification of objects is challenging
• E.g. persons which get occluded and then appear again (crowdy scene)
• One can use the object‘s appearance, but what if all look the same (e.g. soccer players) ?
• Simple Strategy used in our framework - newly appearing objects get a new id
• Quite limited number of object classes
• E.g. MS COCO dataset [1] has classes for a few animals
(dog, cat, horse, cow, …) but what if your the subject
of your documentary video are dinosaurs ?
10
[1] https://arxiv.org/pdf/1405.0312.pdf
11. Semi-automatic video annotation
• Automate the annotation process of archive videos
• Who is appearing in the video (with whom), in which video sections
• Other potentially useful metadata: facial emotion, what action is he / she doing,
what is he / she saying, what logos appear, what are the ‚video highlights‘, …
• Semi-automatic video annotation workflow
• Deep learning algorithms (face recognition, object detection & tracking, …)
do the first pass and generate the „raw metadata“
• Raw metadata is inspected and corrected (false detections, multiple ids for one
person, …) by a human operator with a convenient tool
• Hopefully the whole process is more efficient than the ‚human-only‘ workflow ☺
11
12. Non-interactive version of 360° video
• Generate non-interactive version of 360° video
• For archiving purposes a preview-version of the video
(additionally to original 360° video) could be fine
• For consumption of 360° video on old TV sets, or as
„lean-back mode“ for users who do not want to interact
• Rough algorithm workflow
• Works iteratively, shot-per-shot
• Extract all scene objects (focusing on persons currently)
• Determine the most „interesting“ person for the current
shot (based on size, movement, what we have seen in
last shot etc.) and track it
12
Non-interactive version of a 360° music video
(each row is one generated shot)
Content provided by RBB for Hyper360 project.
13. Non-interactive version of 360° video – outlook
• << Demovideo >>
• Currently working on adressing some limitations of original algorithm
• More diverse shot types: close-up, wide-angle shot, panning shot, …
(currently, all shots are tracking shots with horizontal FOV of 75°)
• Employ best-practice rules for framing and „continuity editing“
• Avoid jump-cuts
• 180° rule
• …
• Goal is „virtual director“ which trys to mimicks a certain human director‘s style
13
14. Acknowledgments
• Thanks to the “Hyper360” project partners RBB, Mediaset, Fraunhofer Fokus, Drukka for providing the
360° video sequences for research and development purposes within the project.
• The research leading to these results has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 761934 - Hyper360 and grant
agreement No. 761802 – MARCONI
• http://www.hyper360.eu/
• https://www.projectmarconi.eu/
14
15. Thank you for your attention !
Contact:
hannes.fassold@joanneum.at
JOANNEUM RESEARCH
Forschungsgesellschaft mbH
DIGITAL– Institut für Informations-
und Kommunikationstechnologien
Steyrergasse 17, 8010 Graz
Tel. +43 316 876-5000
digital@joanneum.at
www.joanneum.at/digital