Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Ivan Laptev 
ivan.laptev@inria.fr 
WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images 
CSClubSaint PetersburgNovember 17, 2014 
Joint work with: 
Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic
Контакты: 
Официальный сайт:http://visionlabs.ru/ 
Контактное лицо:Ханин Александр 
E-mail: a.khanin@visionlabs.ru 
Тел.: +7 (926) 988-7891 
VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем. 
Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему. 
О компании–Advertisement –
Команда 
Александр 
Ханин 
Chief 
Executive 
Officer 
Алексей 
Нехаев 
Executive 
Officer 
Слава 
Казьмин 
Chief 
Technical 
Officer 
Иван 
Лаптев 
Scientific 
advisor 
Сергей 
Миляев 
Senior 
CV engineer 
Алексей 
Кордичев 
Financial 
advisor 
Иван 
Трусков 
Software 
developer 
Сергей 
Черепанов 
Software 
developer 
Наша команда – 
симбиоз науки и бизнеса 
Направления деятельности 
Технологияраспознаваниялиц 
Системавыявлениямошенниковвбанках 
Технологияраспознаванияномеров 
Системаучетаиавтоматизациидоступатранспорта 
Технологиидлябезопасногогорода 
Системавыявлениянарушенийиопасныхситуаций–Advertisement –
–Advertisement – 
Проекты масштаба государства 
Достижения
–Advertisement – 
Мы ищем единомышленников 
Создание и внедрение интеллектуальных систем 
Решение интересных практических задач 
Работа в дружной амбициозной команде 
Спасибо за внимание! 
Контакты: 
Официальный сайт:http://visionlabs.ru/ 
Контактное лицо:Ханин Александр 
E-mail: a.khanin@visionlabs.ru 
Тел.: +7 (926) 988-7891
What is Computer Vision?
7 
What is Computer Vision?
Computer Vision
What is the recent progress? 
1990s: 
Recognition at the level of a few 
toy objects (COIL 20 dataset) 
Industry Research 
Automated quality inspection 
(controlled lighting, scale,…) 
Now: 
Face recognition in social media ImageNet: 14M images, 21K classes 
6% Top-5 error rate in 2014 Challenge
~5K image uploads every min. 
>34K hours of video upload every day 
TV-channels recorded since 60’s 
~30M surveillance cameras in US => ~700K video hours/day 
~2.5 Billion new images / month 
And even more with future wearable devicesWhy image and video analysis? 
Data:
Movies 
TV 
YouTubeWhy looking at people? 
How many person-pixels are in the video?
Movies 
TV 
YouTube 
How many person-pixels are in the video? 
40% 
35% 
34% Why looking at people?
How many person pixels in our daily life? 
Wearable camera data: Microsoft SenseCamdataset 

How many person pixels in our daily life? 
Wearable camera data: Microsoft SenseCamdataset 
 
~4%
 Large variations in appearance: 
occlusions, non-rigid motion, view-point 
changes, clothing… 
What are the difficulties? 
 Manual collection of training 
samples is prohibitive: many 
action classes, rare occurrence 
 Action vocabulary is not 
well-defined 
… 
Action Open: 
… 
… 
Action Hugging:
This talk: 
Brief overview of recent techniques 
Weakly-supervised learning from video and scripts 
Weakly-supervised learning with convolutional neural networks
Standard visual recognition pipeline 
GetOutCar 
AnswerPhone 
Kiss 
HandShake 
StandUp 
DriveCar 
Collect image/video samples and corresponding class labels 
Design appropriate data representation, with certain invariance properties 
Design / use existing machine learning methods for learning and classification
Occurrence histogram of visual words 
space-time patches 
Extraction of 
Local features 
Feature 
description 
K-means clustering (k=4000) 
Feature 
quantization 
Non-linear SVM with χ2kernel 
[Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition
Action classification 
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
Where to get training data? 
Shoot actions in the lab 
• 
KTH datasetWeizmandataset,… 
-Limited variability 
-Unrealistic 
Manually annotate existing content 
• 
HMDB, Olympic Sports, UCF50, UCF101, … 
-Very time-consuming 
Use readily-available video scripts 
• 
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com 
-Scripts are available for 1000’s of hours of movies and TV-series 
-Scripts describe dynamic and static content of videos
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
21
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
22
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa... 
23
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 
24
… 
1172 
01:20:17,240 --> 01:20:20,437 
Why weren't you honest with me? 
Why'dyou keep your marriage a secret? 
1173 
01:20:20,640 --> 01:20:23,598 
lt wasn't my secret, Richard. 
Victor wanted it that way. 
1174 
01:20:23,800 --> 01:20:26,189 
Not even our closest friends 
knew about our marriage. 
… 
… 
RICK 
Why weren't you honest with me? Why 
didyou keep your marriage a secret? 
Rick sits down with Ilsa. 
ILSA 
Oh,it wasn't my secret, Richard. 
Victor wanted it that way. Not even 
our closest friends knew about our 
marriage. 
… 
01:20:17 
01:20:23 
subtitles 
movie script 
•Scripts available for >500 movies (no time synchronization) 
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … 
•Subtitles (with time info.) are available for the most of movies 
•Can transfer time to scripts by text alignmentScript-based video annotation 
[Laptev, Marszałek, Schmid, Rozenfeld2008]
Scripts as weak supervision 
Uncertainty 
24:25 
24:51 
Imprecise temporal localization 
• 
No explicit spatial localization 
• 
NLP problems, scripts ≠ training labels 
• 
“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” 
vs. Get-out-car 
Challenges:
Previous work 
Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009. 
Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. 
Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009. 
…wanted to know about the history of the trees
Joint Learning of Actors and Actions 
Rick? 
Rick? 
Walks? 
Walks? 
[Bojanowskiet al. ICCV 2013] 
Rick walks up behind Ilsa
Rick 
Walks 
Rick walks up behind IlsaJoint Learning of Actors and Actions 
[Bojanowskiet al. ICCV 2013]
Formulation: Cost function 
RickIlsaSam 
Actor labels 
Actor image features 
Actor classifier
Formulation: Cost function 
Person pappears at least once in clipN: 
p = Rick 
Weak supervision from scripts:
Action aappears at least once in clipN: 
a = Walk 
Weak supervision from scripts: Formulation: Cost function
Formulation: Cost function 
Action aappears in clipN: 
Weak supervision from scripts: 
Person pappears in clipN: 
Person pand 
Action aappear in clipN:
34 
Image and video features 
•Facial features [Everingham’06] 
•HOG descriptor on normalized face image 
•Dense Trajectory features in person bounding box [Wang et al.,’11] 
Face features 
Action features
35 
Results for Person Labelling 
American beauty (11 character names) 
Casablanca (17 character names)
36 
Results for Person + Action Labelling 
Casablanca, 
Walking
Finding Actions and Actors in Movies 
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
38 
Action Learning with Ordering Constraints 
[Bojanowskiet al. ECCV 2014]
39 
Action Learning with Ordering Constraints 
[Bojanowskiet al. ECCV 2014]
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Cost Function 
Weak supervision from ordering constraints on Z: 
Action label 
Action index 
2 
4 
1 
2 
3 
2 
Video time intervals
Is the optimization tractable? 
•Path constraints are implicit 
•Cannot use off-the-shelf solvers 
•Frank-Wolfe optimization algorithm
Results 
937 video clips from 60 Hollywood movies 
• 
16 action classes 
• 
Each clip is annotated by a sequence of n actions (2≤n≤11) 
•
Computer Vision
Object recognition
Convolutional Neural Networks 
•ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images 
•Krizhevskyet al. ILSVRC12 results improve other methods with a large margin 
2012 
2014GoogleLeNet: 6%
CNN of Krizhevskyet al. NIPS’12 
•Learns low-level features at the first layer. 
•Has some tricks but the main principle is similar to LeCun’88 
•Has 60M parameters and 650K neurons. 
•Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.
Approach 
1.Design training/test procedure using sliding windows 
2.Train adaptation layers to map labels 
See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13
Approach –sliding window training / testing
Results 
Object localization
Results 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
Results
Vision works?
Vision works? 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
VOC Action Classification Taster Challenge 
Given the bounding box of a person, predict whether they are performing a given action 
Playing Instrument? 
Reading? 
Encourage research on still-imageactivity recognition: more detailed understanding of image
Nine Action Classes 
Phoning 
Playing Instrument 
Reading 
Riding Bike 
Riding Horse 
Running 
Taking Photo 
Using Computer 
Walking
CNN action recognition and localization 
Qualitative results: reading
CNN action recognition and localization 
Qualitative results: phoning
CNN action recognition and localization 
Qualitative results: playing instrument
Results PASCAL VOC 2012 
Object classification 
Action classification 
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
Are bounding boxes needed for training CNNs? 
Image-level labels: Bicycle, Person 
[Oquab, Bottou, Laptev, Sivic, 2014]
Motivation: labeling bounding boxes is tedious
Motivation: image-level labels are plentiful 
“Beautiful red leaves in a back street of Freiburg” 
[Kuznetsovaet al., ACL 2013] 
http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
Motivation: image-level labels are plentiful 
“Public bikes in Warsaw during night” 
https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
Let the algorithm localize the object in the image 
[Oquab, Bottou, Laptev, Sivic, 2014] 
Example training images with bounding boxes 
The locations of objects or their parts learnt by the CNN 
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object 
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
Approach: search over object’s location 
1.Efficient window sliding to find object location hypothesis 
2.Image-level aggregation (max-pool) 
3.Multi-label loss function (allow multiple objects in image) 
See also [Sermanetet al. ’14] and [Chaftieldet al.’14] 
Max-pool over image 
Per-image score 
FCa 
FCb 
C1-C2-C3-C4-C5 
FC6 
FC7 
4096- dim 
vector 
9216- dim 
vector 
4096- dim 
vector 
… 
motorbike 
person 
diningtable 
pottedplant 
chair 
car 
bus 
train 
… 
Max
1. Efficient window sliding to find object location 
192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
…
2. Image-level aggregation using global max-pool 
192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
…
3. Multi-label loss function 
(to allow for multiple objects in image) 192 
norm 
pool 
1:8 
3 
256 
norm 
pool 
1:16 
384 
1:16 
384 
1:16 
6144 
dropout 
1:32 
6144 
dropout 
1:32 
2048 
dropout 
1:32 
20 
1:32 
20 
final-pool 
Convolutional feature extraction layers 
trained on 1512 ImageNet classes (Oquab et al., 2014) 
Adaptation layers 
trained on Pascal VOC. 
256 
pool 
1:32 
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb 
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs 
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with 
respect to the input image. See [21, 26] and Section 3 for full details. 
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning 
from images containing prominent and centered objects in images with limited background clut-ter. 
More recent efforts attempt to learn from images containing multiple objects embedded in 
complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually 
consistent appearance in the training data that often contains multiple objects in different spatial 
Sum of K (=20) log-loss functions, one for each of K classes: 
K-vector of network output 
for image x 
K-vector of (+1,-1) labels indicating 
presence/absence of each class
SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg 
aeroplane map 
car map 
«Keep up the 
good work !» 
(increase score) 
«Wrong !» 
(decrease score) 
«Found something 
there !» Receptive field of the maximum-scoring 
neuron 
max-pool 
max-pool 
mardi 10 juin 14 
Correct label: 
increase score 
for this class 
Incorrect label: 
decrease score 
for this class
Search for objects using max-pooling 
a 
What is the effect of errors?
Multi-scale training and testing 
16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.
Training videos
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset

More Related Content

Computer Vision

  • 1. Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images CSClubSaint PetersburgNovember 17, 2014 Joint work with: Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic
  • 2. Контакты: Официальный сайт:http://visionlabs.ru/ Контактное лицо:Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891 VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем. Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему. О компании–Advertisement –
  • 3. Команда Александр Ханин Chief Executive Officer Алексей Нехаев Executive Officer Слава Казьмин Chief Technical Officer Иван Лаптев Scientific advisor Сергей Миляев Senior CV engineer Алексей Кордичев Financial advisor Иван Трусков Software developer Сергей Черепанов Software developer Наша команда – симбиоз науки и бизнеса Направления деятельности Технологияраспознаваниялиц Системавыявлениямошенниковвбанках Технологияраспознаванияномеров Системаучетаиавтоматизациидоступатранспорта Технологиидлябезопасногогорода Системавыявлениянарушенийиопасныхситуаций–Advertisement –
  • 4. –Advertisement – Проекты масштаба государства Достижения
  • 5. –Advertisement – Мы ищем единомышленников Создание и внедрение интеллектуальных систем Решение интересных практических задач Работа в дружной амбициозной команде Спасибо за внимание! Контакты: Официальный сайт:http://visionlabs.ru/ Контактное лицо:Ханин Александр E-mail: a.khanin@visionlabs.ru Тел.: +7 (926) 988-7891
  • 7. 7 What is Computer Vision?
  • 9. What is the recent progress? 1990s: Recognition at the level of a few toy objects (COIL 20 dataset) Industry Research Automated quality inspection (controlled lighting, scale,…) Now: Face recognition in social media ImageNet: 14M images, 21K classes 6% Top-5 error rate in 2014 Challenge
  • 10. ~5K image uploads every min. >34K hours of video upload every day TV-channels recorded since 60’s ~30M surveillance cameras in US => ~700K video hours/day ~2.5 Billion new images / month And even more with future wearable devicesWhy image and video analysis? Data:
  • 11. Movies TV YouTubeWhy looking at people? How many person-pixels are in the video?
  • 12. Movies TV YouTube How many person-pixels are in the video? 40% 35% 34% Why looking at people?
  • 13. How many person pixels in our daily life? Wearable camera data: Microsoft SenseCamdataset 
  • 14. How many person pixels in our daily life? Wearable camera data: Microsoft SenseCamdataset  ~4%
  • 15.  Large variations in appearance: occlusions, non-rigid motion, view-point changes, clothing… What are the difficulties?  Manual collection of training samples is prohibitive: many action classes, rare occurrence  Action vocabulary is not well-defined … Action Open: … … Action Hugging:
  • 16. This talk: Brief overview of recent techniques Weakly-supervised learning from video and scripts Weakly-supervised learning with convolutional neural networks
  • 17. Standard visual recognition pipeline GetOutCar AnswerPhone Kiss HandShake StandUp DriveCar Collect image/video samples and corresponding class labels Design appropriate data representation, with certain invariance properties Design / use existing machine learning methods for learning and classification
  • 18. Occurrence histogram of visual words space-time patches Extraction of Local features Feature description K-means clustering (k=4000) Feature quantization Non-linear SVM with χ2kernel [Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition
  • 19. Action classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”
  • 20. Where to get training data? Shoot actions in the lab • KTH datasetWeizmandataset,… -Limited variability -Unrealistic Manually annotate existing content • HMDB, Olympic Sports, UCF50, UCF101, … -Very time-consuming Use readily-available video scripts • www.dailyscript.com, www.movie-page.com, www.weeklyscript.com -Scripts are available for 1000’s of hours of movies and TV-series -Scripts describe dynamic and static content of videos
  • 21. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 21
  • 22. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 22
  • 23. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa... 23
  • 24. As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa... 24
  • 25. … 1172 01:20:17,240 --> 01:20:20,437 Why weren't you honest with me? Why'dyou keep your marriage a secret? 1173 01:20:20,640 --> 01:20:23,598 lt wasn't my secret, Richard. Victor wanted it that way. 1174 01:20:23,800 --> 01:20:26,189 Not even our closest friends knew about our marriage. … … RICK Why weren't you honest with me? Why didyou keep your marriage a secret? Rick sits down with Ilsa. ILSA Oh,it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage. … 01:20:17 01:20:23 subtitles movie script •Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com … •Subtitles (with time info.) are available for the most of movies •Can transfer time to scripts by text alignmentScript-based video annotation [Laptev, Marszałek, Schmid, Rozenfeld2008]
  • 26. Scripts as weak supervision Uncertainty 24:25 24:51 Imprecise temporal localization • No explicit spatial localization • NLP problems, scripts ≠ training labels • “… Will gets out of the Chevrolet. …” “… Erin exits her new truck…” vs. Get-out-car Challenges:
  • 27. Previous work Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009. …wanted to know about the history of the trees
  • 28. Joint Learning of Actors and Actions Rick? Rick? Walks? Walks? [Bojanowskiet al. ICCV 2013] Rick walks up behind Ilsa
  • 29. Rick Walks Rick walks up behind IlsaJoint Learning of Actors and Actions [Bojanowskiet al. ICCV 2013]
  • 30. Formulation: Cost function RickIlsaSam Actor labels Actor image features Actor classifier
  • 31. Formulation: Cost function Person pappears at least once in clipN: p = Rick Weak supervision from scripts:
  • 32. Action aappears at least once in clipN: a = Walk Weak supervision from scripts: Formulation: Cost function
  • 33. Formulation: Cost function Action aappears in clipN: Weak supervision from scripts: Person pappears in clipN: Person pand Action aappear in clipN:
  • 34. 34 Image and video features •Facial features [Everingham’06] •HOG descriptor on normalized face image •Dense Trajectory features in person bounding box [Wang et al.,’11] Face features Action features
  • 35. 35 Results for Person Labelling American beauty (11 character names) Casablanca (17 character names)
  • 36. 36 Results for Person + Action Labelling Casablanca, Walking
  • 37. Finding Actions and Actors in Movies [Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
  • 38. 38 Action Learning with Ordering Constraints [Bojanowskiet al. ECCV 2014]
  • 39. 39 Action Learning with Ordering Constraints [Bojanowskiet al. ECCV 2014]
  • 40. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 41. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 42. Cost Function Weak supervision from ordering constraints on Z: Action label Action index 2 4 1 2 3 2 Video time intervals
  • 43. Is the optimization tractable? •Path constraints are implicit •Cannot use off-the-shelf solvers •Frank-Wolfe optimization algorithm
  • 44. Results 937 video clips from 60 Hollywood movies • 16 action classes • Each clip is annotated by a sequence of n actions (2≤n≤11) •
  • 47. Convolutional Neural Networks •ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images •Krizhevskyet al. ILSVRC12 results improve other methods with a large margin 2012 2014GoogleLeNet: 6%
  • 48. CNN of Krizhevskyet al. NIPS’12 •Learns low-level features at the first layer. •Has some tricks but the main principle is similar to LeCun’88 •Has 60M parameters and 650K neurons. •Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.
  • 49. Approach 1.Design training/test procedure using sliding windows 2.Train adaptation layers to map labels See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13
  • 50. Approach –sliding window training / testing
  • 52. Results [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 55. Vision works? [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 56. VOC Action Classification Taster Challenge Given the bounding box of a person, predict whether they are performing a given action Playing Instrument? Reading? Encourage research on still-imageactivity recognition: more detailed understanding of image
  • 57. Nine Action Classes Phoning Playing Instrument Reading Riding Bike Riding Horse Running Taking Photo Using Computer Walking
  • 58. CNN action recognition and localization Qualitative results: reading
  • 59. CNN action recognition and localization Qualitative results: phoning
  • 60. CNN action recognition and localization Qualitative results: playing instrument
  • 61. Results PASCAL VOC 2012 Object classification Action classification [Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]
  • 62. Are bounding boxes needed for training CNNs? Image-level labels: Bicycle, Person [Oquab, Bottou, Laptev, Sivic, 2014]
  • 63. Motivation: labeling bounding boxes is tedious
  • 64. Motivation: image-level labels are plentiful “Beautiful red leaves in a back street of Freiburg” [Kuznetsovaet al., ACL 2013] http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
  • 65. Motivation: image-level labels are plentiful “Public bikes in Warsaw during night” https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
  • 66. Let the algorithm localize the object in the image [Oquab, Bottou, Laptev, Sivic, 2014] Example training images with bounding boxes The locations of objects or their parts learnt by the CNN NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
  • 67. Approach: search over object’s location 1.Efficient window sliding to find object location hypothesis 2.Image-level aggregation (max-pool) 3.Multi-label loss function (allow multiple objects in image) See also [Sermanetet al. ’14] and [Chaftieldet al.’14] Max-pool over image Per-image score FCa FCb C1-C2-C3-C4-C5 FC6 FC7 4096- dim vector 9216- dim vector 4096- dim vector … motorbike person diningtable pottedplant chair car bus train … Max
  • 68. 1. Efficient window sliding to find object location 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in …
  • 69. 2. Image-level aggregation using global max-pool 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in …
  • 70. 3. Multi-label loss function (to allow for multiple objects in image) 192 norm pool 1:8 3 256 norm pool 1:16 384 1:16 384 1:16 6144 dropout 1:32 6144 dropout 1:32 2048 dropout 1:32 20 1:32 20 final-pool Convolutional feature extraction layers trained on 1512 ImageNet classes (Oquab et al., 2014) Adaptation layers trained on Pascal VOC. 256 pool 1:32 C1 C2 C3 C4 C5 FC6 FC7 FCa FCb Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with respect to the input image. See [21, 26] and Section 3 for full details. Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning from images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded in complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually consistent appearance in the training data that often contains multiple objects in different spatial Sum of K (=20) log-loss functions, one for each of K classes: K-vector of network output for image x K-vector of (+1,-1) labels indicating presence/absence of each class
  • 71. SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg aeroplane map car map «Keep up the good work !» (increase score) «Wrong !» (decrease score) «Found something there !» Receptive field of the maximum-scoring neuron max-pool max-pool mardi 10 juin 14 Correct label: increase score for this class Incorrect label: decrease score for this class
  • 72. Search for objects using max-pooling a What is the effect of errors?
  • 73. Multi-scale training and testing 16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.
  • 75. Test results on 80 classes in Microsoft COCO dataset
  • 76. Test results on 80 classes in Microsoft COCO dataset
  • 77. Test results on 80 classes in Microsoft COCO dataset
  • 78. Test results on 80 classes in Microsoft COCO dataset
  • 79. Test results on 80 classes in Microsoft COCO dataset
  • 80. Test results on 80 classes in Microsoft COCO dataset