SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Downloaden Sie, um offline zu lesen
Ivan Laptev 
ivan.laptev@inria.fr 
WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images 
CSClubSaint PetersburgNovember 17, 2014 
Joint work with: 
Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic
Контакты: 
Официальный сайт:http://visionlabs.ru/ 
Контактное лицо:Ханин Александр 
E-mail: a.khanin@visionlabs.ru 
Тел.: +7 (926) 988-7891 
VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем. 
Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему. 
О компании–Advertisement –
Команда 
Александр 
Ханин 
Chief 
Executive 
Officer 
Алексей 
Нехаев 
Executive 
Officer 
Слава 
Казьмин 
Chief 
Technical 
Officer 
Иван 
Лаптев 
Scientific 
advisor 
Сергей 
Миляев 
Senior 
CV engineer 
Алексей 
Кордичев 
Financial 
advisor 
Иван 
Трусков 
Software 
developer 
Сергей 
Черепанов 
Software 
developer 
Наша команда – 
симбиоз науки и бизнеса 
Направления деятельности 
Технологияраспознаваниялиц 
Системавыявлениямошенниковвбанках 
Технологияраспознаванияномеров 
Системаучетаиавтоматизациидоступатранспорта 
Технологиидлябезопасногогорода 
Системавыявлениянарушенийиопасныхситуаций–Advertisement –
–Advertisement – 
Проекты масштаба государства 
Достижения
–Advertisement – 
Мы ищем единомышленников 
Создание и внедрение интеллектуальных систем 
Решение интересных практических задач 
Работа в дружной амбициозной команде 
Спасибо за внимание! 
Контакты: 
Официальный сайт:http://visionlabs.ru/ 
Контактное лицо:Ханин Александр 
E-mail: a.khanin@visionlabs.ru 
Тел.: +7 (926) 988-7891
What is Computer Vision?
7 
What is Computer Vision?
What is the recent progress? 
1990s: 
Recognition at the level of a few 
toy objects (COIL 20 dataset) 
Industry Research 
Automated quality inspection 
(controlled lighting, scale,…) 
Now: 
Face recognition in social media ImageNet: 14M images, 21K classes 
6% Top-5 error rate in 2014 Challenge
~5K image uploads every min. 
>34K hours of video upload every day 
TV-channels recorded since 60’s 
~30M surveillance cameras in US => ~700K video hours/day 
~2.5 Billion new images / month 
And even more with future wearable devicesWhy image and video analysis? 
Data:
Movies 
TV 
YouTubeWhy looking at people? 
How many person-pixels are in the video?
Movies 
TV 
YouTube 
How many person-pixels are in the video? 
40% 
35% 
34% Why looking at people?
How many person pixels in our daily life? 
Wearable camera data: Microsoft SenseCamdataset 

How many person pixels in our daily life? 
Wearable camera data: Microsoft SenseCamdataset 
 
~4%
 Large variations in appearance: 
occlusions, non-rigid motion, view-point 
changes, clothing… 
What are the difficulties? 
 Manual collection of training 
samples is prohibitive: many 
action classes, rare occurrence 
 Action vocabulary is not 
well-defined 
… 
Action Open: 
… 
… 
Action Hugging:
This talk: 
Brief overview of recent techniques 
Weakly-supervised learning from video and scripts 
Weakly-supervised learning with convolutional neural networks
Standard visual recognition pipeline 
GetOutCar 
AnswerPhone 
Kiss 
HandShake 
StandUp 
DriveCar 
Collect image/video samples and corresponding class labels 
Design appropriate data representation, with certain invariance properties 
Design / use existing machine learning methods for learning and classification
Occurrence histogram of visual words 
space-time patches 
Extraction of 
Local features 
Feature 
description 
K-means clustering (k=4000) 
Feature 
quantization 
Non-linear SVM with χ2kernel 
[Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition
Action classification 
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
Where to get training data? 
Shoot actions in the lab 
• 
KTH datasetWeizmandataset,… 
-Limited variability 
-Unrealistic 
Manually annotate existing content 
• 
HMDB, Olympic Sports, UCF50, UCF101, … 
-Very time-consuming 
Use readily-available video scripts 
• 
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com 
-Scripts are available for 1000’s of hours of movies and TV-series 
-Scripts describe dynamic and static content of videos
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
21
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
22
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa... 
23
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
24
… 
1172 
01:20:17,240 --> 01:20:20,437 
Why weren't you honest with me? 
Why'dyou keep your marriage a secret? 
1173 
01:20:20,640 --> 01:20:23,598 
lt wasn't my secret, Richard. 
Victor wanted it that way. 
1174 
01:20:23,800 --> 01:20:26,189 
Not even our closest friends 
knew about our marriage. 
… 
… 
RICK 
Why weren't you honest with me? Why 
didyou keep your marriage a secret? 
Rick sits down with Ilsa. 
ILSA 
Oh,it wasn't my secret, Richard. 
Victor wanted it that way. Not even 
our closest friends knew about our 
marriage. 
… 
01:20:17 
01:20:23 
subtitles 
movie script 
•Scripts available for >500 movies (no time synchronization) 
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … 
•Subtitles (with time info.) are available for the most of movies 
•Can transfer time to scripts by text alignmentScript-based video annotation 
[Laptev, Marszałek, Schmid, Rozenfeld2008]
Scripts as weak supervision 
Uncertainty 
24:25 
24:51 
Imprecise temporal localization 
• 
No explicit spatial localization 
• 
NLP problems, scripts ≠ training labels 
• 
“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” 
vs. Get-out-car 
Challenges:
Previous work 
Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009. 
Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. 
Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009. 
…wanted to know about the history of the trees
Joint Learning of Actors and Actions 
Rick? 
Rick? 
Walks? 
Walks? 
[Bojanowskiet al. ICCV 2013] 
Rick walks up behind Ilsa
Rick 
Walks 
Rick walks up behind IlsaJoint Learning of Actors and Actions 
[Bojanowskiet al. ICCV 2013]
Formulation: Cost function 
RickIlsaSam 
Actor labels 
Actor image features 
Actor classifier
Formulation: Cost function 
Person pappears at least once in clipN: 
p = Rick 
Weak supervision from scripts:
Action aappears at least once in clipN: 
a = Walk 
Weak supervision from scripts: Formulation: Cost function
Formulation: Cost function 
Action aappears in clipN: 
Weak supervision from scripts: 
Person pappears in clipN: 
Person pand 
Action aappear in clipN:
34 
Image and video features 
•Facial features [Everingham’06] 
•HOG descriptor on normalized face image 
•Dense Trajectory features in person bounding box [Wang et al.,’11] 
Face features 
Action features
35 
Results for Person Labelling 
American beauty (11 character names) 
Casablanca (17 character names)
36 
Results for Person + Action Labelling 
Casablanca, 
Walking
Finding Actions and Actors in Movies 
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
38 
Action Learning with Ordering Constraints 
[Bojanowskiet al. ECCV 2014]
39 
Action Learning with Ordering Constraints 
[Bojanowskiet al. ECCV 2014]
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Is the optimization tractable? 
•Path constraints are implicit 
•Cannot use off-the-shelf solvers 
•Frank-Wolfe optimization algorithm
Results 
937 video clips from 60 Hollywood movies 
• 
16 action classes 
• 
Each clip is annotated by a sequence of n actions (2≤n≤11) 
•
Object recognition
Convolutional Neural Networks 
•ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images 
•Krizhevskyet al. ILSVRC12 results improve other methods with a large margin 
2012 
2014GoogleLeNet: 6%
CNN of Krizhevskyet al. NIPS’12 
•Learns low-level features at the first layer. 
•Has some tricks but the main principle is similar to LeCun’88 
•Has 60M parameters and 650K neurons. 
•Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.
Approach 
1.Design training/test procedure using sliding windows 
2.Train adaptation layers to map labels 
See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13
Approach –sliding window training / testing
Results 
Object localization
Results 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
Results
Vision works?
Vision works? 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
VOC Action Classification Taster Challenge 
Given the bounding box of a person, predict whether they are performing a given action 
Playing Instrument? 
Reading? 
Encourage research on still-imageactivity recognition: more detailed understanding of image
Nine Action Classes 
Phoning 
Playing Instrument 
Reading 
Riding Bike 
Riding Horse 
Running 
Taking Photo 
Using Computer 
Walking
CNN action recognition and localization 
Qualitative results: reading
CNN action recognition and localization 
Qualitative results: phoning
CNN action recognition and localization 
Qualitative results: playing instrument
Results PASCAL VOC 2012 
Object classification 
Action classification 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
Are bounding boxes needed for training CNNs? 
Image-level labels: Bicycle, Person 
[Oquab, Bottou, Laptev, Sivic, 2014]
Motivation: labeling bounding boxes is tedious
Motivation: image-level labels are plentiful 
“Beautiful red leaves in a back street of Freiburg” 
[Kuznetsovaet al., ACL 2013] 
http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
Motivation: image-level labels are plentiful 
“Public bikes in Warsaw during night” 
https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
Let the algorithm localize the object in the image 
[Oquab, Bottou, Laptev, Sivic, 2014] 
Example training images with bounding boxes 
The locations of objects or their parts learnt by the CNN 
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object 
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
Approach: search over object’s location 
1.Efficient window sliding to find object location hypothesis 
2.Image-level aggregation (max-pool) 
3.Multi-label loss function (allow multiple objects in image) 
See also [Sermanetet al. ’14] and [Chaftieldet al.’14] 
Max-pool over image 
Per-image score 
FCa 
FCb 
C1-C2-C3-C4-C5 
FC6 
FC7 
4096- dim 
vector 
9216- dim 
vector 
4096- dim 
vector 
… 
motorbike 
person 
diningtable 
pottedplant 
chair 
car 
bus 
train 
… 
Max
1. Efficient window sliding to find object location 
192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
…
2. Image-level aggregation using global max-pool 
192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
…
3. Multi-label loss function 
(to allow for multiple objects in image) 192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually 
consistent appearance in the training data that often contains multiple objects in different spatial 
Sum of K (=20) log-loss functions, one for each of K classes: 
K-vector of network output 
for image x 
K-vector of (+1,-1) labels indicating 
presence/absence of each class
SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg 
aeroplane map 
car map 
«Keep up the 
good work !» 
(increase score) 
«Wrong !» 
(decrease score) 
«Found something 
there !» Receptive field of the maximum-scoring 
neuron 
max-pool 
max-pool 
mardi 10 juin 14 
Correct label: 
increase score 
for this class 
Incorrect label: 
decrease score 
for this class
Search for objects using max-pooling 
a 
What is the effect of errors?
Multi-scale training and testing 
16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.
Training videos
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset

Weitere ähnliche Inhalte

Ähnlich wie Computer Vision

Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMSolr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMLucidworks
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision Chen Sagiv
 
VOGIN-IP-lezing-Zeno_ geradts
VOGIN-IP-lezing-Zeno_ geradtsVOGIN-IP-lezing-Zeno_ geradts
VOGIN-IP-lezing-Zeno_ geradtsvoginip
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Improving the VR experience - VRST 2012
Improving the VR experience - VRST 2012Improving the VR experience - VRST 2012
Improving the VR experience - VRST 2012Sebastien Kuntz
 
Huawei STW 2018 public
Huawei STW 2018 publicHuawei STW 2018 public
Huawei STW 2018 publicAlan Smeaton
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf
 
Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learningNAVER Engineering
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptxcomputerscience98
 
Computer vision introduction
Computer vision  introduction Computer vision  introduction
Computer vision introduction Wael Badawy
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...NAVER D2
 
EMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image DigitalEMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image DigitalEdward Bowen
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaudstricaud
 
TAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AITAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AIYi-Shin Chen
 
Andrii Boichuk: Video-based action recognition - past, present and future
Andrii Boichuk: Video-based action recognition - past, present and futureAndrii Boichuk: Video-based action recognition - past, present and future
Andrii Boichuk: Video-based action recognition - past, present and futureLviv Startup Club
 
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual RealityFixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual RealityWen-Chih Lo
 
Splunk Live in RTP - March-2014-Jeff-Bollinger-Cisco
Splunk Live in RTP - March-2014-Jeff-Bollinger-CiscoSplunk Live in RTP - March-2014-Jeff-Bollinger-Cisco
Splunk Live in RTP - March-2014-Jeff-Bollinger-CiscoJeff Bollinger
 

Ähnlich wie Computer Vision (20)

Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMSolr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
VOGIN-IP-lezing-Zeno_ geradts
VOGIN-IP-lezing-Zeno_ geradtsVOGIN-IP-lezing-Zeno_ geradts
VOGIN-IP-lezing-Zeno_ geradts
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
Improving the VR experience - VRST 2012
Improving the VR experience - VRST 2012Improving the VR experience - VRST 2012
Improving the VR experience - VRST 2012
 
Huawei STW 2018 public
Huawei STW 2018 publicHuawei STW 2018 public
Huawei STW 2018 public
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
 
Visual geometry with deep learning
Visual geometry with deep learningVisual geometry with deep learning
Visual geometry with deep learning
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
Introduction
IntroductionIntroduction
Introduction
 
Computer vision introduction
Computer vision  introduction Computer vision  introduction
Computer vision introduction
 
[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...[212]big models without big data using domain specific deep networks in data-...
[212]big models without big data using domain specific deep networks in data-...
 
EMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image DigitalEMC 3130/2130 Lecture One - Image Digital
EMC 3130/2130 Lecture One - Image Digital
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
TAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AITAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AI
 
Andrii Boichuk: Video-based action recognition - past, present and future
Andrii Boichuk: Video-based action recognition - past, present and futureAndrii Boichuk: Video-based action recognition - past, present and future
Andrii Boichuk: Video-based action recognition - past, present and future
 
Perception and Quality of Immersive Media
Perception and Quality of Immersive MediaPerception and Quality of Immersive Media
Perception and Quality of Immersive Media
 
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual RealityFixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
 
Splunk Live in RTP - March-2014-Jeff-Bollinger-Cisco
Splunk Live in RTP - March-2014-Jeff-Bollinger-CiscoSplunk Live in RTP - March-2014-Jeff-Bollinger-Cisco
Splunk Live in RTP - March-2014-Jeff-Bollinger-Cisco
 

Mehr von Computer Science Club

20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12Computer Science Club
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10Computer Science Club
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09Computer Science Club
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02Computer Science Club
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01Computer Science Club
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04Computer Science Club
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01Computer Science Club
 

Mehr von Computer Science Club (20)

20141223 kuznetsov distributed
20141223 kuznetsov distributed20141223 kuznetsov distributed
20141223 kuznetsov distributed
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04
 
20140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-0320140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-03
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01
 
20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture2_matiyasevich
20131006 h10 lecture2_matiyasevich20131006 h10 lecture2_matiyasevich
20131006 h10 lecture2_matiyasevich
 
20130922 h10 lecture1_matiyasevich
20130922 h10 lecture1_matiyasevich20130922 h10 lecture1_matiyasevich
20130922 h10 lecture1_matiyasevich
 

Kürzlich hochgeladen

Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 

Kürzlich hochgeladen (20)

Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 

Computer Vision

  • 1. Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images CSClubSaint PetersburgNovember 17, 2014 Joint work with: Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic
  • 2. Контакты: Официальный сайт:http://visionlabs.ru/ Контактное лицо:Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891 VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем. Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему. О компании–Advertisement –
  • 3. Команда Александр Ханин Chief Executive Officer Алексей Нехаев Executive Officer Слава Казьмин Chief Technical Officer Иван Лаптев Scientific advisor Сергей Миляев Senior CV engineer Алексей Кордичев Financial advisor Иван Трусков Software developer Сергей Черепанов Software developer Наша команда – симбиоз науки и бизнеса Направления деятельности Технологияраспознаваниялиц Системавыявлениямошенниковвбанках Технологияраспознаванияномеров Системаучетаиавтоматизациидоступатранспорта Технологиидлябезопасногогорода Системавыявлениянарушенийиопасныхситуаций–Advertisement –
  • 4. –Advertisement – Проекты масштаба государства Достижения
  • 5. –Advertisement – Мы ищем единомышленников Создание и внедрение интеллектуальных систем Решение интересных практических задач Работа в дружной амбициозной команде Спасибо за внимание! Контакты: Официальный сайт:http://visionlabs.ru/ Контактное лицо:Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891
  • 7. 7 What is Computer Vision?
  • 8.
  • 9. What is the recent progress? 1990s: Recognition at the level of a few toy objects (COIL 20 dataset) Industry Research Automated quality inspection (controlled lighting, scale,…) Now: Face recognition in social media ImageNet: 14M images, 21K classes 6% Top-5 error rate in 2014 Challenge
  • 10. ~5K image uploads every min. >34K hours of video upload every day TV-channels recorded since 60’s ~30M surveillance cameras in US => ~700K video hours/day ~2.5 Billion new images / month And even more with future wearable devicesWhy image and video analysis? Data:
  • 11. Movies TV YouTubeWhy looking at people? How many person-pixels are in the video?
  • 12. Movies TV YouTube How many person-pixels are in the video? 40% 35% 34% Why looking at people?
  • 13. How many person pixels in our daily life? Wearable camera data: Microsoft SenseCamdataset 
  • 14. How many person pixels in our daily life? Wearable camera data: Microsoft SenseCamdataset  ~4%
  • 15.  Large variations in appearance: occlusions, non-rigid motion, view-point changes, clothing… What are the difficulties?  Manual collection of training samples is prohibitive: many action classes, rare occurrence  Action vocabulary is not well-defined … Action Open: … … Action Hugging:
  • 16. This talk: Brief overview of recent techniques Weakly-supervised learning from video and scripts Weakly-supervised learning with convolutional neural networks
  • 17. Standard visual recognition pipeline GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar Collect image/video samples and corresponding class labels Design appropriate data representation, with certain invariance properties Design / use existing machine learning methods for learning and classification
  • 18. Occurrence histogram of visual words space-time patches Extraction of Local features Feature description K-means clustering (k=4000) Feature quantization Non-linear SVM with χ2kernel [Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition
  • 19. Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
  • 20. Where to get training data? Shoot actions in the lab • KTH datasetWeizmandataset,… -Limited variability -Unrealistic Manually annotate existing content • HMDB, Olympic Sports, UCF50, UCF101, … -Very time-consuming Use readily-available video scripts • www.dailyscript.com, www.movie-page.com, www.weeklyscript.com -Scripts are available for 1000’s of hours of movies and TV-series -Scripts describe dynamic and static content of videos
  • 21. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 21
  • 22. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 22
  • 23. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa... 23
  • 24. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 24
  • 25. … 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'dyou keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … … RICK Why weren't you honest with me? Why didyou keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh,it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script •Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … •Subtitles (with time info.) are available for the most of movies •Can transfer time to scripts by text alignmentScript-based video annotation [Laptev, Marszałek, Schmid, Rozenfeld2008]
  • 26. Scripts as weak supervision Uncertainty 24:25 24:51 Imprecise temporal localization • No explicit spatial localization • NLP problems, scripts ≠ training labels • “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” vs. Get-out-car Challenges:
  • 27. Previous work Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009. …wanted to know about the history of the trees
  • 28. Joint Learning of Actors and Actions Rick? Rick? Walks? Walks? [Bojanowskiet al. ICCV 2013] Rick walks up behind Ilsa
  • 29. Rick Walks Rick walks up behind IlsaJoint Learning of Actors and Actions [Bojanowskiet al. ICCV 2013]
  • 30. Formulation: Cost function RickIlsaSam Actor labels Actor image features Actor classifier
  • 31. Formulation: Cost function Person pappears at least once in clipN: p = Rick Weak supervision from scripts:
  • 32. Action aappears at least once in clipN: a = Walk Weak supervision from scripts: Formulation: Cost function
  • 33. Formulation: Cost function Action aappears in clipN: Weak supervision from scripts: Person pappears in clipN: Person pand Action aappear in clipN:
  • 34. 34 Image and video features •Facial features [Everingham’06] •HOG descriptor on normalized face image •Dense Trajectory features in person bounding box [Wang et al.,’11] Face features Action features
  • 35. 35 Results for Person Labelling American beauty (11 character names) Casablanca (17 character names)
  • 36. 36 Results for Person + Action Labelling Casablanca, Walking
  • 37. Finding Actions and Actors in Movies [Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
  • 38. 38 Action Learning with Ordering Constraints [Bojanowskiet al. ECCV 2014]
  • 39. 39 Action Learning with Ordering Constraints [Bojanowskiet al. ECCV 2014]
  • 40. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 41. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 42. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 43. Is the optimization tractable? •Path constraints are implicit •Cannot use off-the-shelf solvers •Frank-Wolfe optimization algorithm
  • 44. Results 937 video clips from 60 Hollywood movies • 16 action classes • Each clip is annotated by a sequence of n actions (2≤n≤11) •
  • 45.
  • 47. Convolutional Neural Networks •ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images •Krizhevskyet al. ILSVRC12 results improve other methods with a large margin 2012 2014GoogleLeNet: 6%
  • 48. CNN of Krizhevskyet al. NIPS’12 •Learns low-level features at the first layer. •Has some tricks but the main principle is similar to LeCun’88 •Has 60M parameters and 650K neurons. •Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.
  • 49. Approach 1.Design training/test procedure using sliding windows 2.Train adaptation layers to map labels See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13
  • 50. Approach –sliding window training / testing
  • 52. Results [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 55. Vision works? [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 56. VOC Action Classification Taster Challenge Given the bounding box of a person, predict whether they are performing a given action Playing Instrument? Reading? Encourage research on still-imageactivity recognition: more detailed understanding of image
  • 57. Nine Action Classes Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking
  • 58. CNN action recognition and localization Qualitative results: reading
  • 59. CNN action recognition and localization Qualitative results: phoning
  • 60. CNN action recognition and localization Qualitative results: playing instrument
  • 61. Results PASCAL VOC 2012 Object classification Action classification [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 62. Are bounding boxes needed for training CNNs? Image-level labels: Bicycle, Person [Oquab, Bottou, Laptev, Sivic, 2014]
  • 63. Motivation: labeling bounding boxes is tedious
  • 64. Motivation: image-level labels are plentiful “Beautiful red leaves in a back street of Freiburg” [Kuznetsovaet al., ACL 2013] http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
  • 65. Motivation: image-level labels are plentiful “Public bikes in Warsaw during night” https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
  • 66. Let the algorithm localize the object in the image [Oquab, Bottou, Laptev, Sivic, 2014] Example training images with bounding boxes The locations of objects or their parts learnt by the CNN NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
  • 67. Approach: search over object’s location 1.Efficient window sliding to find object location hypothesis 2.Image-level aggregation (max-pool) 3.Multi-label loss function (allow multiple objects in image) See also [Sermanetet al. ’14] and [Chaftieldet al.’14] Max-pool over image Per-image score FCa FCb C1-C2-C3-C4-C5 FC6 FC7 4096- dim vector 9216- dim vector 4096- dim vector … motorbike person diningtable pottedplant chair car bus train … Max
  • 68. 1. Efficient window sliding to find object location 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in …
  • 69. 2. Image-level aggregation using global max-pool 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in …
  • 70. 3. Multi-label loss function (to allow for multiple objects in image) 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually consistent appearance in the training data that often contains multiple objects in different spatial Sum of K (=20) log-loss functions, one for each of K classes: K-vector of network output for image x K-vector of (+1,-1) labels indicating presence/absence of each class
  • 71. SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg aeroplane map car map «Keep up the good work !» (increase score) «Wrong !» (decrease score) «Found something there !» Receptive field of the maximum-scoring neuron max-pool max-pool mardi 10 juin 14 Correct label: increase score for this class Incorrect label: decrease score for this class
  • 72. Search for objects using max-pooling a What is the effect of errors?
  • 73. Multi-scale training and testing 16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.
  • 75. Test results on 80 classes in Microsoft COCO dataset
  • 76. Test results on 80 classes in Microsoft COCO dataset
  • 77. Test results on 80 classes in Microsoft COCO dataset
  • 78. Test results on 80 classes in Microsoft COCO dataset
  • 79. Test results on 80 classes in Microsoft COCO dataset
  • 80. Test results on 80 classes in Microsoft COCO dataset