Movies Emotions
Movies Emotions
Movies Emotions
Ivan Laptev
INRIA Rennes, IRISA
ivan.laptev@inria.fr
Marcin Marszaek
Cordelia Schmid
INRIA Grenoble, LEAR - LJK
marcin.marszalek@inria.fr
cordelia.schmid@inria.fr
Benjamin Rozenfeld
Bar-Ilan University
grurgrur@gmail.com
Abstract
The aim of this paper is to address recognition of natural
human actions in diverse and realistic video settings. This
challenging but important subject has mostly been ignored
in the past due to several problems one of which is the lack
of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the
use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action
retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in
video. We present a new method for video classification
that builds upon and extends several recent ideas including
local space-time features, space-time pyramids and multichannel non-linear SVMs. The method is shown to improve
state-of-the-art results on the standard KTH action dataset
by achieving 91.8% accuracy. Given the inherent problem
of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method
to learning and classifying challenging action classes in
movies and show promising results.
1. Introduction
In the last decade the field of visual recognition had an
outstanding evolution from classifying instances of toy objects towards recognizing the classes of objects and scenes
in natural images. Much of this progress has been sparked
by the creation of realistic image datasets as well as by the
new, robust methods for image description and classification. We take inspiration from this progress and aim to
transfer previous experience to the domain of video recognition and the recognition of human actions in particular.
Existing datasets for human action recognition (e.g. [15],
see figure 8) provide samples for only a few action classes
recorded in controlled and simplified settings. This stands
in sharp contrast with the demands of real applications focused on natural video with human actions subjected to in-
Building on the recent experience with image classification, we employ spatio-temporal features and generalize
spatial pyramids to spatio-temporal domain. This allows
us to extend the spatio-temporal bag-of-features representation with weak geometry, and to apply kernel-based learning techniques (cf. section 3). We validate our approach
on a standard benchmark [15] and show that it outperforms
the state-of-the-art. We next turn to the problem of action
classification in realistic videos and show promising results
for eight very challenging action classes in movies. Finally,
we present and evaluate a fully automatic setup with action
learning and classification obtained for an automatically labeled training set.
for the automatic naming of characters in videos by Everingham et al. [4]. Here we extend this idea and apply textbased script search to automatically collect video samples
for human actions.
Automatic annotation of human actions from scripts,
however, is associated with several problems. Firstly,
scripts usually come without time information and have to
be aligned with the video. Secondly, actions described in
scripts do not always correspond with the actions in movies.
Finally, action retrieval has to cope with the substantial variability of action expressions in text. In this section we address these problems in subsections 2.1 and 2.2 and use the
proposed solution to automatically collect annotated video
samples with human actions, see subsection 2.3. The resulting dataset is used to train and to evaluate a visual action
classifier later in section 4.
movie script
1172
01:20:17,240 --> 01:20:20,437
RICK
Why weren't you honest with me? Why
did you keep your marriage a secret?
01:20:17
01:20:23
ILSA
Oh, it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
0.9
a=1.0
0.8
a0.5
0.8
0.8
0.7
0.4
0.5
0.4
0.3
0.2
0.6
50
100
150
200
250
300
number of samples
350
400
[1:13:41 - 1:13:45]
A black car pulls up. Two
army officers get out.
0.2
0.1
0
0
precision
precision
precision
0.7
0.6
AllActions
<AnswerPhone>
<GetOutCar>
<HandShake>
<HugPerson>
<Kiss>
<SitDown>
<SitUp>
<StandUp>
0.1
0.2
0.3
0.4
0.6
0.5
0.4
0.3
0.2
0.1
0.5
recall
0.6
0.7
0.8
0.9
0
0
AllActions
<AnswerPhone>
<GetOutCar>
<HandShake>
<HugPerson>
<Kiss>
<SitDown>
<SitUp>
<StandUp>
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
recall
Figure 3. Evaluation of script-based action annotation. Left: Precision of action annotation evaluated on visual ground truth. Right:
Example of a visual false positive for get out of a car.
es
pl
el
s
sa
m
To
t
al
la
b
U
p>
nd
al
To
t
ta
<S
itU
p>
<S
<S
itD
ow
n>
>
<K
is
s>
rs
Pe
<H
ug
<H
an
dS
ha
on
ke
>
>
e>
C
ar
on
O
ut
w
er
Ph
<G
et
ns
<A
False
10
21
33
96
Correct
15
14
34
30
29
143
All
20
12
23
15
44
51
12
62
239
233
47
231
219
49
217
211
13
20
22
49
47
11
13
19
22
51
30
10
test set
This section presents our approach for action classification. It builds on existing bag-of-features approaches for
video description [3, 13, 15] and extends recent advances in
static image classification to videos [2, 9, 12]. Lazebnik et
al. [9] showed that a spatial pyramid, i.e., a coarse description of the spatial layout of the scene, improves recognition.
Successful extensions of this idea include the optimization
of weights for the individual pyramid levels [2] and the use
of more general spatial grids [12]. Here we build on these
ideas and go a step further by building space-time grids.
The details of our approach are described in the following.
t
x
1x1 t1
1x1 t2
h3x1 t1
o2x2 t1
V
1 X (hin hjn )2
2 n=1 hin + hjn
(2)
best in our context. Previous results for static image classification have shown that the best combination depends on
the class as well as the dataset [9, 12]. The approach we
take here is to select the overall most successful channels
and then to choose the most successful combination for each
class individually.
As some grids may not perform well by themselves,
but contribute within a combination [20], we search for
the most successful combination of channels (descriptor &
spatio-temporal grid) for each action class with a greedy
approach. To avoid tuning to a particular dataset, we find
the best spatio-temporal channels for both the KTH action
dataset and our manually labeled movie dataset. The experimental setup and evaluation criteria for these two datasets
are presented in sections 4.2 and 4.4. We refer the reader to
these sections for details.
Figure 7 shows the number of occurrences for each of
our channel components in the optimized channel combinations for KTH and movie actions. We can see that HoG
descriptors are chosen more frequently than HoFs, but both
are used in many channels. Among the spatial grids the
horizontal 3x1 partitioning turns out to be most successful.
The traditional 1x1 grid and the center-focused o2x2 perform also very well. The 2x2, 3x3 and v1x3 grids occur less
often and are dropped in the following. They are either redundant (2x2), too dense (3x3), or do not fit the geometry
of natural scenes (v1x3). For temporal binning no temporal
subdivision of the sequence t1 shows the best results, but
t3 and t2 also perform very well and complement t1. The
ot2 binning turns out to be rarely used in practiceit often
duplicates t2and we drop it from further experiments.
Table 2 presents for each dataset/action the performance
of the standard bag-of-features with HoG and HoF descriptors, of the best channel as well as of the best combination
of channels found with our greedy search. We can observe
that the spatio-temporal grids give a significant gain over the
standard BoF methods. Moreover, combining two to three
25
KTH actions
Movie actions
20
4. Experimental results
5
0
2
ot
t2
t3
t1
In this section we evaluate if spatio-temporal grids improve the classification accuracy and which grids perform
10
x3
v1
3
3x
2
2x
x2
o2
1
1x
x1
h3
15
f
ho
g
ho
Task
HoG BoF
HoF BoF
Best channel
Best combination
KTH multi-class
81.6%
89.7%
hog 1 t3)
Action AnswerPhone
Action GetOutCar
Action HandShake
Action HugPerson
Action Kiss
Action SitDown
Action SitUp
Action StandUp
13.4%
21.9%
18.6%
29.1%
52.0%
29.1%
6.5%
45.4%
24.6%
14.9%
12.1%
17.4%
36.5%
20.7%
5.7%
40.0%
Table 2. Classification performance of different channels and their combinations. For the KTH dataset the average class accuracy is
reported, whereas for our manually cleaned movie dataset the per-class average precision (AP) is given.
Accuracy
Schuldt
et al. [15]
71.7%
Niebles
et al. [13]
81.5%
Wong
et al. [18]
86.7%
ours
91.8%
Boxing
Waving Clapping
Figure 8. Sample frames from the KTH actions sequences. All six
classes (columns) and scenarios (rows) are presented.
Walking
Jogging
Running
Boxing
Waving
Clapping
.99
.04
.01
.00
.00
.00
.01
.89
.19
.00
.00
.00
.00
.07
.80
.00
.00
.00
gg
in
g
Ru
nn
in
g
Bo
xi
ng
W
av
in
g
Cl
ap
pi
ng
Method
Running
ki
ng
Jogging
Jo
Walking
W
al
.00
.00
.00
.97
.00
.05
.00
.00
.00
.00
.91
.00
.00
.00
.00
.03
.09
.95
AnswerPhone
GetOutCar
HandShake
HugPerson
Kiss
SitDown
SitUp
StandUp
Clean
32.1%
41.5%
32.3%
40.6%
53.3%
38.6%
18.2%
50.5%
Automatic
16.4%
16.4%
9.9%
26.8%
45.1%
24.8%
10.4%
33.6%
Chance
10.6%
6.0%
8.8%
10.1%
23.5%
13.8%
4.6%
22.6%
Table 5. Average precision (AP) for each action class of our test
set. We compare results for clean (annotated) and automatic training data. We also show results for a random classifier (chance).
5. Conclusion
1
KTH actions
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
GetOutCar
HandShake
HugPerson
Kiss
SitDown
SitUp
StandUp
FN
FP
TN
TP
AnswerPhone
Figure 10. Example results for action classification trained on the automatically annotated data. We show the key frames for test movies
with the highest confidence values for true/false positives/negatives.
achieves 60% precision and scales easily to a large number of action classes. It also provides a convenient semiautomatic tool for generating action samples with manual
annotation. Our method for action classification extends
recent successful image recognition methods to the spatiotemporal domain and achieves best up to date recognition
performance on a standard benchmark [15]. Furthermore, it
demonstrates high tolerance to noisy labels in the training
set and, therefore, is appropriate for action learning in automatic settings. We demonstrate promising recognition results for eight difficult and realistic action classes in movies.
Future work includes improving the script-to-video
alignment and extending the video collection to a much
larger dataset. We also plan to improve the robustness of our
classifier to noisy training labels based on an iterative learning approach. Furthermore, we plan to experiment with a
larger variety of space-time low-level features. In the long
term we plan to move away from bag-of-features based representations by introducing detector style action classifiers.
Acknowledgments.
M. Marszaek is supported by the
European Community under the Marie-Curie project V IS ITOR . This work was supported by the European research
project C LASS. We would like to thank J. Ponce and A. Zisserman for discussions.
References
[1] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh,
E. Learned Miller, and D. Forsyth. Names and faces in the news.
In CVPR, 2004.
[2] A. Bosch, A. Zisserman, and X. Munoz. Representing shape with a
spatial pyramid kernel. In CIVR, 2007.
[3] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005.