Computer Vision

Ivan Laptev
ivan.laptev@inria.fr
WILLOW, INRIA/ENS/CNRS, ParisComputer Vision: Weakly-supervised learning from video and images
CSClubSaint PetersburgNovember 17, 2014
Joint work with:
Piotr Bojanowski–RémiLajugie–MaximeOquab– Francis Bach –Leon Bottou–Jean Ponce – Cordelia Schmid–Josef Sivic

Контакты:
Официальный сайт:http://visionlabs.ru/
Контактное лицо:Ханин Александр
E-mail: a.khanin@visionlabs.ru
Тел.: +7 (926) 988-7891
VisionLabs–командапрофессионалов,обладающихзначительнымизнаниямиисущественнымпрактическимопытомвсфереразработкиалгоритмовкомпьютерногозренияиинтеллектуальныхсистем.
Мы создаем и внедряем технологии компьютерного зрения, открывая новые возможности для изменения окружающего нас мира к лучшему.
О компании–Advertisement –

Команда
Александр
Ханин
Chief
Executive
Officer
Алексей
Нехаев
Executive
Officer
Слава
Казьмин
Chief
Technical
Officer
Иван
Лаптев
Scientific
advisor
Сергей
Миляев
Senior
CV engineer
Алексей
Кордичев
Financial
advisor
Иван
Трусков
Software
developer
Сергей
Черепанов
Software
developer
Наша команда –
симбиоз науки и бизнеса
Направления деятельности
Технологияраспознаваниялиц
Системавыявлениямошенниковвбанках
Технологияраспознаванияномеров
Системаучетаиавтоматизациидоступатранспорта
Технологиидлябезопасногогорода
Системавыявлениянарушенийиопасныхситуаций–Advertisement –

–Advertisement –
Проекты масштаба государства
Достижения

–Advertisement –
Мы ищем единомышленников
Создание и внедрение интеллектуальных систем
Решение интересных практических задач
Работа в дружной амбициозной команде
Спасибо за внимание!
Контакты:
Официальный сайт:http://visionlabs.ru/
Контактное лицо:Ханин Александр
E-mail: a.khanin@visionlabs.ru
Тел.: +7 (926) 988-7891

What is the recent progress?
1990s:
Recognition at the level of a few
toy objects (COIL 20 dataset)
Industry Research
Automated quality inspection
(controlled lighting, scale,…)
Now:
Face recognition in social media ImageNet: 14M images, 21K classes
6% Top-5 error rate in 2014 Challenge

~5K image uploads every min.
>34K hours of video upload every day
TV-channels recorded since 60’s
~30M surveillance cameras in US => ~700K video hours/day
~2.5 Billion new images / month
And even more with future wearable devicesWhy image and video analysis?
Data:

Movies
TV
YouTubeWhy looking at people?
How many person-pixels are in the video?

Movies
TV
YouTube
How many person-pixels are in the video?
40%
35%
34% Why looking at people?

How many person pixels in our daily life?
Wearable camera data: Microsoft SenseCamdataset


How many person pixels in our daily life?
Wearable camera data: Microsoft SenseCamdataset

~4%

 Large variations in appearance:
occlusions, non-rigid motion, view-point
changes, clothing…
What are the difficulties?
 Manual collection of training
samples is prohibitive: many
action classes, rare occurrence
 Action vocabulary is not
well-defined
…
Action Open:
…
…
Action Hugging:

This talk:
Brief overview of recent techniques
Weakly-supervised learning from video and scripts
Weakly-supervised learning with convolutional neural networks

Standard visual recognition pipeline
GetOutCar
AnswerPhone
Kiss
HandShake
StandUp
DriveCar
Collect image/video samples and corresponding class labels
Design appropriate data representation, with certain invariance properties
Design / use existing machine learning methods for learning and classification

Occurrence histogram of visual words
space-time patches
Extraction of
Local features
Feature
description
K-means clustering (k=4000)
Feature
quantization
Non-linear SVM with χ2kernel
[Laptev, Marszałek, Schmid, Rozenfeld2008] Bag-of-Features action recognition

Action classification
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

Where to get training data?
Shoot actions in the lab
•
KTH datasetWeizmandataset,…
-Limited variability
-Unrealistic
Manually annotate existing content
•
HMDB, Olympic Sports, UCF50, UCF101, …
-Very time-consuming
Use readily-available video scripts
•
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
-Scripts are available for 1000’s of hours of movies and TV-series
-Scripts describe dynamic and static content of videos

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
21

22

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past.The headwaiter seats Ilsa...
23

24

…
1172
01:20:17,240 --> 01:20:20,437
Why weren't you honest with me?
Why'dyou keep your marriage a secret?
1173
01:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.
Victor wanted it that way.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
…
…
RICK
Why weren't you honest with me? Why
didyou keep your marriage a secret?
Rick sits down with Ilsa.
ILSA
Oh,it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
…
01:20:17
01:20:23
subtitles
movie script
•Scripts available for >500 movies (no time synchronization)
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …
•Subtitles (with time info.) are available for the most of movies
•Can transfer time to scripts by text alignmentScript-based video annotation
[Laptev, Marszałek, Schmid, Rozenfeld2008]

Scripts as weak supervision
Uncertainty
24:25
24:51
Imprecise temporal localization
•
No explicit spatial localization
•
NLP problems, scripts ≠ training labels
•
“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…”
vs. Get-out-car
Challenges:

Previous work
Sivic, Everingham, and Zisserman, ''Who are you?'' --Learning Person Specific Classifiers from Video, In CVPR 2009.
Buehler, Everingham, and Zisserman"Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009.
Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009.
…wanted to know about the history of the trees

Joint Learning of Actors and Actions
Rick?
Rick?
Walks?
Walks?
[Bojanowskiet al. ICCV 2013]
Rick walks up behind Ilsa

Rick
Walks
Rick walks up behind IlsaJoint Learning of Actors and Actions
[Bojanowskiet al. ICCV 2013]

Formulation: Cost function
RickIlsaSam
Actor labels
Actor image features
Actor classifier

Person pappears at least once in clipN:
p = Rick
Weak supervision from scripts:

Action aappears at least once in clipN:
a = Walk
Weak supervision from scripts: Formulation: Cost function

Action aappears in clipN:
Weak supervision from scripts:
Person pappears in clipN:
Person pand
Action aappear in clipN:

34
Image and video features
•Facial features [Everingham’06]
•HOG descriptor on normalized face image
•Dense Trajectory features in person bounding box [Wang et al.,’11]
Face features
Action features

35
Results for Person Labelling
American beauty (11 character names)
Casablanca (17 character names)

36
Results for Person + Action Labelling
Casablanca,
Walking

Finding Actions and Actors in Movies
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

38
Action Learning with Ordering Constraints
[Bojanowskiet al. ECCV 2014]

39
Action Learning with Ordering Constraints
[Bojanowskiet al. ECCV 2014]

Cost Function
Weak supervision from ordering constraints on Z:
Action label
Action index
2
4
1
2
3
2
Video time intervals

Is the optimization tractable?
•Path constraints are implicit
•Cannot use off-the-shelf solvers
•Frank-Wolfe optimization algorithm

Results
937 video clips from 60 Hollywood movies
•
16 action classes
•
Each clip is annotated by a sequence of n actions (2≤n≤11)
•

Convolutional Neural Networks
•ImageNetLarge-Scale Visual Recognition Challenge is very hard: 1000 classes, 1.2M images
•Krizhevskyet al. ILSVRC12 results improve other methods with a large margin
2012
2014GoogleLeNet: 6%

CNN of Krizhevskyet al. NIPS’12
•Learns low-level features at the first layer.
•Has some tricks but the main principle is similar to LeCun’88
•Has 60M parameters and 650K neurons.
•Success seems to be determined by (a) lots of labeled images and (b) very fast GPU implementation. Both (a) and (b) have not been available until very recently.

Approach
1.Design training/test procedure using sliding windows
2.Train adaptation layers to map labels
See also [Girshicket al.’13], [Donahue et al.’13], [Sermanetet al. ’14], [Zeilerand Fergus ’13] Transfer learning workshop at ICCV’13, ImageNetworkshop at ICCV’13

Approach –sliding window training / testing

Results
[Oquab, Bottou, Laptev, Sivic2013, HAL-00911179]

Vision works?

VOC Action Classification Taster Challenge
Given the bounding box of a person, predict whether they are performing a given action
Playing Instrument?
Reading?
Encourage research on still-imageactivity recognition: more detailed understanding of image

Nine Action Classes
Phoning
Playing Instrument
Reading
Riding Bike
Riding Horse
Running
Taking Photo
Using Computer
Walking

CNN action recognition and localization
Qualitative results: reading

Qualitative results: phoning

Qualitative results: playing instrument

Results PASCAL VOC 2012
Object classification
Action classification

Are bounding boxes needed for training CNNs?
Image-level labels: Bicycle, Person
[Oquab, Bottou, Laptev, Sivic, 2014]

Motivation: labeling bounding boxes is tedious

Motivation: image-level labels are plentiful
“Beautiful red leaves in a back street of Freiburg”
[Kuznetsovaet al., ACL 2013]
http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html

Motivation: image-level labels are plentiful
“Public bikes in Warsaw during night”
https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/

Let the algorithm localize the object in the image
[Oquab, Bottou, Laptev, Sivic, 2014]
Example training images with bounding boxes
The locations of objects or their parts learnt by the CNN
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …

Approach: search over object’s location
1.Efficient window sliding to find object location hypothesis
2.Image-level aggregation (max-pool)
3.Multi-label loss function (allow multiple objects in image)
See also [Sermanetet al. ’14] and [Chaftieldet al.’14]
Max-pool over image
Per-image score
FCa
FCb
C1-C2-C3-C4-C5
FC6
FC7
4096- dim
vector
9216- dim
vector
4096- dim
vector
…
motorbike
person
diningtable
pottedplant
chair
car
bus
train
…
Max

1. Efficient window sliding to find object location
192
norm
pool
1:8
3
256
norm
pool
1:16
384
1:16
384
1:16
6144
dropout
1:32
6144
dropout
1:32
2048
dropout
1:32
20
1:32
20
final-pool
Convolutional feature extraction layers
trained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layers
trained on Pascal VOC.
256
pool
1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performs
cross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio with
respect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learning
from images containing prominent and centered objects in images with limited background clut-ter.
More recent efforts attempt to learn from images containing multiple objects embedded in
…

2. Image-level aggregation using global max-pool
192
norm
pool
1:8
3
256
norm
pool
1:16
384
1:16
384
1:16
6144
dropout
1:32
6144
dropout
1:32
2048
dropout
1:32
20
1:32
20
final-pool
Adaptation layers
256
pool
1:32
…

3. Multi-label loss function
(to allow for multiple objects in image) 192
norm
pool
1:8
3
256
norm
pool
1:16
384
1:16
384
1:16
6144
dropout
1:32
6144
dropout
1:32
2048
dropout
1:32
20
1:32
20
final-pool
Adaptation layers
256
pool
1:32
complex scenes [2, 9, 28] or fromvideo [30]. Thesemethods typically localize objectswith visually
consistent appearance in the training data that often contains multiple objects in different spatial
Sum of K (=20) log-loss functions, one for each of K classes:
K-vector of network output
for image x
K-vector of (+1,-1) labels indicating
presence/absence of each class

SearcMh foar xob-jpecotso usliinngg m asxe-paorolcinhg
aeroplane map
car map
«Keep up the
good work !»
(increase score)
«Wrong !»
(decrease score)
«Found something
there !» Receptive field of the maximum-scoring
neuron
max-pool
max-pool
mardi 10 juin 14
Correct label:
increase score
for this class
Incorrect label:
decrease score
for this class

Search for objects using max-pooling
a
What is the effect of errors?

Multi-scale training and testing
16216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320.7…1.4 ] chairdiningtablesofapottedplantpersoncarbustrain… Figure3:Weaklysupervisedtrainingchairdiningtablepersonpottedplantpersoncarbustrain… RescaleFigure4:MultiscaleobjectrecognitionConvolutionaladaptationlayers.

Test results on 80 classes in Microsoft COCO dataset

Computer Vision

More Related Content

Computer Vision