HMDB: A Large Video Database For Human Motion Recognition
HMDB: A Large Video Database For Human Motion Recognition
HMDB: A Large Video Database For Human Motion Recognition
Abstract
2557
lower head tifiable through most of the clip, limited motion blur and
a) full (56.3%) upper (30.5%) (0.8%) (12.3%) limited compression artifacts; 2) Medium – large body parts
like the upper and lower arms and legs identifiable through
b) camera motion (59.9%) no motion (40.1%)
most of the clip; 3) Low – large body parts not identifiable
due in part to the presence of motion blur and compression
c) front (40.8%) back (18.2%) left (22.1%) right (19.0%)
artifacts. The distribution of the meta tags for the entire
dataset is shown in Figure 2.
d) low (20.8%) medium (62.1%) good (17.1%)
2558
the discriminative power of various low-level features. For
an ideal unbiased action dataset, low-level features such as
color should not be predictive of the high-level action cate-
gory. For low-level features we considered the mean color
in the HSV color space computed for each frame over a
12 × 16 spatial grid as well as the combination of color and
gray value and the use of PCA to reduce the feature dimen-
sion of those descriptors. Here we report the results “color
+ gray + PCA”.
We further considered the low-level global scene infor-
mation (gist) [15] computed for three frames of a clip. Gist
is a coarse orientation-based representation of an image that
has been shown to capture well the contextual information
in a scene and shown to perform quite well on a variety of
recognition tasks, see [15]. We used the source code pro-
vided by the authors.
Lastly, we compare these low-level cues with a common
mid-level spatio-temporal bag-of-words cue (HOG/HOF)
Figure 3. Examples of a clip stabilized over 50 frames showing
by computing spatial temporal interest points for all clips.
from the top to the bottom, the 1st, 30th and 50th frame of the
original (left column) and stabilized clip (right column). A standard bag of words approach with 2,000 , 3,000 , 4,000
, and 5,000 visual words was used for classification and the
Table 2. The recognition accuracy of low-level color/gist cues for best result is reported. For evaluation we used the testing
different action datasets. and training splits that came with the datasets, otherwise a
Dataset N Color+ Percent Gist Percent HOG/
Gray+ drop drop HOF
3- or 5-fold cross validation was used for datasets without
PCA specified splits. Table 2 shows the results sorted by the num-
Hollywood 8 26.9% 16.7% 27.4% 15.2% 32.3% ber of classes (N) in each dataset. Percent drop is computed
UCF Sports 9 47.7% 18.6% 60.0% -2.4% 58.6% for the performance down from HOG/HOF features to each
UCF YouTube 11 38.3% 35.0% 53.8% 8.7% 58.9%
Hollywood2 12 16.2% 68.7% 21.8% 57.8% 51.7%
of the two types of low-level features. A small percentage
UCF50 50 41.3% 13.8% 38.8% 19.0% 47.9% drop means that the low-level features perform as well as
HMDB51 51 8.8% 56.4% 13.4% 33.7% 20.2% the mid-level motion features.
Results obtained by classifying these very simple fea-
tures show that the UCF Sports dataset can be classified by
To do this, a background plane is estimated by detecting
scene descriptors rather than by action descriptors as gist
and matching salient features in two adjacent frames. Cor-
is more predictive than mid-level spatio-temporal features.
responding features are computed using a distance measure
We conjecture that gist features are predictive of the sports
that includes both the absolute pixel differences and the Eu-
actions (i.e., UCF Sports) because most sports are location-
ler distance of the detected points. Points with a minimum
specific. For example, ball games usually occur on grass
distance are then matched and the RANSAC algorithm is
field, swimming is always in water, and most skiing hap-
used to estimate the geometric transformation between all
pens on snow. The results also reveal that low-level features
neighboring frames. This is done independently for every
are fairly predictive as compared to mid-level features for
pair of frames. Using this estimated transformation, all
the UCF YouTube and UCF50 dataset. This might be due
frames of the clip are warped and combined to achieve a
to low-level biases for videos on YouTube, e.g., preferred
stabilized clip. We visually inspected a large number of the
vantage points and camera positions for amateur directors.
resulting stabilized clips and found that the image stitch-
For the dataset collected from general movies or Hollywood
ing techniques work surprisingly well. Figure 3 shows an
movies, the performance of various low-level cues is on av-
example. For the evaluation of the action recognition sys-
erage lower than that of the mid-level spatio-temporal fea-
tems, the performance was reported for the original clips as
tures. This implies that the datasets collected from YouTube
well as the stabilized clips.
tend to be biased and capture only a small range of colors
3. Comparison with other action datasets and scenes across action categories compared to those col-
lected from movies. The similar performance using low-
To compare the proposed HMDB51 with existing real- level and mid-level features for the Hollywood dataset is
world action datasets such as Hollywood, Hollywood2, likely due to the low number of source movies (12). Clips
UCF Sports, and the UCF YouTube dataset, we evaluate extracted from the same movie usually have similar scenes.
2559
4. Benchmark systems is the number of visual words learned from k-means. These
clip descriptors are then used to train and test a support vec-
To evaluate the discriminability of our 51 action cate- tor machine (SVM) in the classification stage.
gories we focus on the class of algorithms for action recog-
We used a SVM with an RBF kernel K(u, v) =
nition based on the extraction of local space-time informa-
exp(−γ ∗ |u − v|2 )). The parameters of the RBF kernel
tion from videos, which have become the dominant trend in
(the cost term C and kernel bandwidth γ) were optimized
the past five years [24]. Various local space-time based ap-
using a greedy search with a 5-fold cross-validation on the
proaches mainly differ in the type of detectors (e.g., the im-
training set.
plementation of the spatio-temporal filters), the feature de-
The best result for the original clips was reached for
scriptors, and the number of spatio-temporal points sampled
k = 8, 000 whereas the best result for the stabilized clips
(dense vs. sparse). Wang et al. have grouped these detectors
was for at k = 2000 (see Section 5.1). To validate our re-
and descriptors into six types and evaluated their perfor-
implementation of Laptev’s system, we evaluated the per-
mance on the KTH, UCF Sports and Hollywood2 datasets
formance of the system on the KTH dataset and were able
in a common experimental setup [24].
to reproduce the results for the HOG (81.4%) and HOF de-
The results have shown that Laptev’s combination of a
scriptors (90.7%) as reported in [24].
histogram of oriented gradient (HOG) and histogram of ori-
ented flow (HOF) descriptors performed best for the Hol- 4.2. C2 features
lywood2 and UCF Sports. As HMDB51 contains movies
and YouTube videos, these datasets are considered the most Two types of C2 features have been described in the lit-
erature. One is from a model that was designed to mimic the
similar in terms of video sources. Therefore, we selected
hierarchical organization and functions of the ventral stream
the algorithm by Latptev and colleagues [11] as one of our
benchmarks. To expand beyond [24], we chose for our sec- of the visual cortex [21]. The ventral stream is believed to
ond benchmark approaches developed by our group [21, 8]. be critically involved in the processing of shape informa-
It uses a hierarchical architecture modeled after the ventral tion and the scale-and-position-invariant object recognition.
The model starts with a pyramid of Gabor filters (S1 units at
and dorsal streams of the primate visual cortex for the task
of object and action recognition, respectively. different orientations and scales), which correspond simple
In the following we provide a detailed comparison be- cells in the primary visual cortex. The next layer (C1) mod-
tween these algorithms, looking in particular at the robust- els the complex cells in the primary visual cortex by pooling
ness of the two approaches with respect to various nuisance together the activity of S1 units in a local spatial region and
factors including the quality of the video and the camera across scales to build some tolerance to 2D transformations
(translation and size) of inputs.
motion, as well as changes in the position, scale and view-
point of the main actors. The third layer (S2) responses are computed by matching
the C1 inputs with a dictionary of n prototypes learned from
4.1. HOG/HOF features a set of training images. As opposed to the bag-of-words
approach that uses vector quantization and summarizes the
The combination of HOG, which has been used for the
indices of the matched codebook entries, we retain the sim-
recognition of objects and scenes, and HOF, a 3D flow-
ilarity (ranging from 0 to 1) with each of the n prototypes.
based version of HOG, has been shown to achieve state-
In the top layer of the feature hierarchy, a n-dimensional
of-the-art performance on several commonly used action
C2 vector is obtained for each image by pooling the max-
datasets [11, 24]. We used the binaries provided by [11] to
imum of S2 responses across scales and positions for each
extract features using the Harris3D as feature detector and
of the n prototypes. The C2 features have been shown to
the HOG/HOF feature descriptors. For every clip a set of
perform comparably to state-of-the-art algorithms applied
3D Harris corners is detected and a local descriptor is com-
to the problem of object recognition [21]. They have also
puted as a concatenation of the HOG and HOF around the
been shown to account well for the properties of cells in
corner.
the inferotemporal cortex (IT), which is the highest purely
For classification, we implemented a bag-of-words sys-
visual area in the primate brain.
tem as described in [11]. To evaluate the best code book
Based on the work described above, Jhuang et al. [8]
size, we sampled 100,000 space-time interest-point descrip-
proposed a model of the dorsal stream of the visual cor-
tors from the training set and applied the k-means clustering
tex. The dorsal stream is thought to be critically involved
to obtain a set of k = [2, 000, 4, 000, 6, 000, 8, 000] visual
in the processing of motion information and the perception
words. For every clip, each of the local point descriptors is
of motion. The model starts with spatio-temporal Gabor
matched to the nearest prototype returned by k-means clus-
filters that mimic the direction-sensitive simple cells in the
tering and a global feature descriptor is obtained by comput-
primary visual cortex.
ing a histogram over the index of the matched codebook en-
tries. This results in a k-dimensional feature vector where k
2560
HOG/HOF − Original Clips
The dorsal stream model is a 3D (space-time) extension chew
laugh
smile
of the ventral stream model. The S1 units in the ventral talk
drink
eat
0.7
fall_floor
sit
time. It has been suggested that motion-direction sensitive somersault
situp
stand
brush_hair
for two parallel channels of feature processing, one for mo- draw_sword
catch
dribble
tion in the dorsal stream, and another for shape in the ventral golf
kick_ball
hit
0.3
pick
stream. pour
push
ride_bike
Beyond the S1 layer, the dorsal steam model follows the ride_horse
shoot_ball
shoot_bow
0.2
shoot_gun
same architecture as the ventral stream model. It contains swing_baseball
sword_exercise
throw
kiss
its ventral stream counterpart. The S2 units in the dorsal punch
shake_hands
sword
fe hro e
h_ ve
ib d
d
nd _fla r
e
po k
er all
la ew
pic ll
or _b t_g w
nc w
hu g
swnds
he e
ak p kisk
br walk
ble
ta e
dr lk
smgh
rtw ok t
st ult
sh _h bike
m c p
st b
fa d irs
w h
sh p
kic olf
e_ un s
ha ch
up
k_ hit
ridide_ us r
kicg
k
st c
ju nd
pu mp
sa p
tu d
n
m sit sit
aw c hair
ex eb n
sw s oo t_b e
rn
clael
ca sm ea
ha flic floo
r p u
ba
t a
t is
dr or
in
or
ll_ iv
in
il
b_ lim
_s atc
pu llu
er u
an
d_ as u
ru
sh oo ors
us wa
a
g
u
c
ch
e
respond to combinations of directions of motion whereas
so
cli
dr
sh
C2 − Original Clips
the ventral S2 units are tuned to shape patterns correspond- chew
laugh
smile
0.7
stand 0.4
turn
walk
wave
brush_hair
5. Evaluation draw_sword
catch
dribble
golf
0.3
hit
kick_ball
fe h r o e
h_ v e
ib d
d
nd _fla r
e
p k
er all
la ew
pic ll
o r b t_g w
nc w
hu g
swnds
he e
ak p kisk
us wa lk
ble
ta e
dr lk
rtw ok t
st ult
sh oo o r s e
smgh
m c p
st b
fa d irs
w h
kic olf
e_ un s
ha ch
ju nd
sh p
up
k_ hit
ridide_ us r
st c
pu mp
sa p
tu d
n
m sit sit
aw c hair
e x eb n
kicg
s w s oo t_b e
rn
clael
c a s m ea
ha flic floo
r p ou
ba
t a
t is
confusion matrix for both systems on the original clips is
dr o r
in
or
ll_ iv
in
il
b_ lim
sh _h bik
pu llu
er u
an
_ s atc
d _ as u
ru
br w a
a
g
u
c
ch
_
so
cli
dr
across category labels with no apparent trends. The most Figure 4. Confusion Matrix for HOG/HOF and the C2 features on
surprising result is that the performance of the two systems the set of original (not stabilized) clips.
improved only marginally after stabilization for camera mo-
tion (Table 3).
As recognition results for both systems appear rela-
up, push-ups / push-up, rock climbing indoor / climb as well
tively low compared to previously published results on other
as walking with dog / walk.
datasets [8, 11, 24], we conducted a simple experiment to
find out whether this decrease in performance simply re- Overall, we found a mild drop in performance from the
sults from an increase in the number of object categories UCF50 with 66.3% accuracy down to 54.3% for similar cat-
and a corresponding decrease in chance level recognition or egories on the HMDB51 (chance level 10% for both sets).
an actual increase in the complexity of the dataset due for These results are also comparable to the performance of the
instance to the presence of complex background clutter and same HOG/HOF system on similar sized datasets of dif-
more intra-class variations. We selected 10 common ac- ferent actions with 51.7% over 12 categories of the Holly-
tions in the HMDB51 that were similar to action categories wood2 dataset and 58.9% over 11 categories of the UCF
in the UCF50 and compared the recognition performance YouTube dataset as shown in Table 2. These results suggest
of the HOG/HOF on video clips from the two datasets. that the relatively low performance of the benchmarks on
The following is a list of matched categories: basketball / the proposed HMDB51 is most likely the consequence of
shoot ball, biking / ride bike, diving / dive, fencing / stab, the increase in the number of action categories compared to
golf swing / golf, horse riding / ride horse, pull ups / pull- older datasets.
2561
Table 3. Performance of the benchmark systems on the HMDB51. Table 5. Results of the logistic regression analysis on the key fac-
System Original clips Stabilized clips tors influencing the performance of the two systems.
HOG/HOF 20.44% 21.96% HOG/HOF
C2 22.83% 23.18% Coefficient Coef. est. β p odds ratio
Table 4. Mean recognition performance as a function of camera Intercept -1.60 0.000 0.20
Occluders 0.07 0.427 1.06
motion and clip quality.
Camera motion -0.12 0.132 0.88
Camera motion Quality
View point 0.09 0.267 1.09
yes no low med high
Med. quality 0.11 0.254 1.12
HOG/HOF 19.84% 19.99% 17.18% 18.68% 27.90% High quality 0.65 0.000 1.91
C2 25.20% 19.13% 17.54% 23.10% 28.62%
C2
Coefficient Coef. est. β p odds ratio
5.2. Robustness of the benchmarks Intercept -1.52 0.000 0.22
Occluders -0.22 0.007 0.81
In order to assess the relative strengths and weaknesses Camera motion -0.43 0.000 0.65
View point 0.19 0.009 1.21
of the two benchmark systems on the HMDB51 in the Med. quality 0.47 0.000 1.60
context of various nuisance factors, we broke down their High quality 0.97 0.000 2.65
performance in terms of 1) visible body parts or equiv-
alently the presence/absence of occlusions, 2) the pres- verted into binary variables whereas the labels 10, 01 and
ence/absence of camera motion, 3) viewpoint/ camera po- 00 corresponded to a high, medium, and low quality video
sition, and 4) the quality of the video clips. We found that respectively.
the presence/absence of occlusions and the camera position The estimated β coefficients for the two systems are
did not seem to influence performance. A major factor for shown in Table 5. The largest factor influencing perfor-
the performance of the two systems was the clip quality. mance for both systems remained the quality of the video
As shown on Table 4, from high to low quality videos, the clips. On average the systems were predicted to be nearly
two systems registered a drop in performance of about 10% twice as likely to be correct on high vs. medium quality
(from 27.90%/28.62% for the HOG+HOF/C2 features for videos. This is the strongest influence factor by far. How-
the high quality clips down to 17.18%/17.54% for the low ever the regression analysis also confirmed the assumption
quality clips). that camera motion improves classification performance.
A factor that affected the two systems differently was Consistent with the previous analysis based on error rates,
camera motion: Whereas the HOG/HOF performance was this trend is only significant for the C2 features. The addi-
stable with the presence or absence of camera motion, sur- tional factors, occlusion and camera viewpoint, did not have
prisingly, the performance of the C2 features actually im- a significant influence on the results of the HOG/HOF or C2
proved with the presence of camera motion. We suspect approach.
that camera motion might actually increase the response of
5.3. Shape vs. motion information
the low-level S1 motion detectors. An alternative explana-
tion is that the camera motion by itself might be correlated The role of shape vs. motion cues for the recognition of
with the action category. To evaluate whether camera mo- biological motion has been the subject of an intense debate.
tion alone can be predictive of the action category, we tried Computer vision could provide critical insight to this ques-
to classify the mean parameters of the estimated frame-by- tion as various approaches have been proposed that rely not
frame motion returned by the video stabilization algorithm. just on motion cues like the two systems we have tested but
The result of 5.29% recognition shows that at least cam- also on single-frame shape-based cues, such as posture [18]
era motion alone does not provide significant information and shape [19], and contextual information [13, 28].
in this case. We here study the relative contributions of shape vs. mo-
To further investigate how various nuisance factors may tion cues for the recognition of actions on the HMDB51.
affect the recognition performance of the two systems, we We compared the HOG/HOF descriptor with the recogni-
conducted a logistic regression analysis to predict whether tion of a shape-only HOG descriptor and a motion-only
each of the two systems will be correct vs. incorrect for spe- HOF descriptor. We also compared the performance of the
cific conditions. The logistic regression model was built as previously mentioned motion-based C2 to those of shape-
follows: the correctness of the predicted label was used as based C2. Table 6 shows the performance of the various
binary dependent variable, the camera viewpoints were split descriptors.
into one group for front and back views (because of simi- In general we find that shape cues alone perform much
lar appearances; front, back =0) and another group for side worse than motion cues alone, and their combination tends
views (left, right =1). The occlusion condition was split to improve recognition performance very moderately. This
into full body view (=0) and occluded views (head, upper combination seems to affect the recognition of the original
or lower body only =1). The video quality label was con- clips rather than the recognition of the stabilized clips. An
2562
Table 6. Average performance for shape vs. motion cues. [3] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions
HOG/HOF HOGHOF HOG HOF as space-time shapes. ICCV, 2005. 1, 2
Original 20.44% 15.01% 17.95% [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet:
Stabilized 21.96% 15.47% 22.48% A large-scale hierarchical image database. CVPR, 2009. 1
C2 Motion+Shape Shape Motion [5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
Original 22.83% 13.40% 21.96% and A. Zisserman. The PASCAL Visual Object Classes
Stabilized 23.18% 13.44% 22.73% Challenge 2010 (VOC2010) results. http://www.pascal-
network.org/challenges/voc/voc2010/workshop/index.html. 1
earlier study [19] suggested that “local shape and flow for [6] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual mod-
a single frame is enough to recognize actions”. Our results els from few training examples: an incremental bayesian approach
tested on 101 object categories. CVPR Workshop on Generative-
suggest that the statement might be true for simple actions Model Based Vision, 2004. 8
as is the case for the KTH dataset but motion cues do seem [7] H. Jhuang, E. Garrote, J. Mutch, X. Yu, V. Khilnani, T. Poggio, A. D.
to be more powerful than shape cues for the recognition of Steele, and T. Serre. Automated home-cage behavioral phenotyping
complex actions like the ones in the HMDB51. of mice. Nature Communications, 1(5):1–9, 2010. 2
[8] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired
system for action recognition. ICCV, 2007. 2, 5, 6
6. Conclusion [9] G. Johansson, S. Bergström, and W. Epstein. Perceiving events and
objects. Lawrence Erlbaum Associates, 1994. 2
We described an effort to advance the field of action [10] I. Laptev. On space-time interest points. Int. J. of Comput. Vision,
recognition with the design of what is, to our knowledge, 64(2-3):107–123, 2005. 2
[11] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning
currently the largest action dataset. With 51 action cat- realistic human actions from movies. CVPR, 2008. 2, 5, 6
egories and just under 7,000 video clips, the proposed [12] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from
HMDB51 is still far from capturing the richness and the full videos ”in the wild”. CVPR, 2009. 2
complexity of video clips commonly found in the movies or [13] M. Marszałek, I. Laptev, and C. Schmid. Actions in context. CVPR,
2009. 2, 7
online videos. However given the level of performance of
[14] J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure
representative state-of-the-art computer vision algorithms of decomposable motion segments for activity classification. ECCV,
with accuracy about 23%, this dataset is arguably a good 2010. 2
place to start (performance on the CalTech-101 database [15] A. Oliva and A. Torralba. Modeling the shape of the scene: A holis-
tic representation of the spatial envelope. Int. J. Comput. Vision,
for object recognition started around 16% [6]). Further- 42:145–175, 2001. 4
more our exhaustive evaluation of two state-of-the-art sys- [16] M. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatio-
tems suggest that performance is not significantly affected temporal maximum average correlation height filter for action recog-
over a range of factors such as camera position and motion nition. CVPR, 2008. 2
[17] B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: a
as well as occlusions. This suggests that current methods
database and web-based tool for image annotation. Int. J. Comput.
are fairly robust with respect to these low-level video degra- Vision, 77(1):157–173, 2008. 1
dations but remain limited in their representative power in [18] J. M. S. Maji, L. Bourdev. Action recognition from a distributed
order to capture the complexity of human actions. representation of pose and appearance. CVPR, 2011. 7
[19] K. Schindler and L. V. Gool. Action snippets: How many frames
does human action recognition require. CVPR, 2008. 7, 8
Acknowledgements [20] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A
This paper describes research done in part at the Center for Bi- local SVM approach. ICPR, 2004. 1, 2
ological & Computational Learning, affiliated with MIBR, BCS, [21] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust
object recognition with cortex-like mechanisms. IEEE Trans. Pattern
CSAIL at MIT. This research was sponsored by grants from
Anal. Mach. Intell., 29(3):411–26, 2007. 5
DARPA (IPTO and DSO), NSF (NSF-0640097, NSF-0827427), [22] M. Thirkettle, C. Benton, and N. Scott-Samuel. Contributions of
AFSOR-THRL (FA8650-05- C-7262). Additional support was form, motion and task to biological motion perception. Journal of
provided by: Adobe, King Abdullah University Science and Tech- Vision, 9(3):1–11, 2009. 2
[23] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A
nology grant to B. DeVore, NEC, Sony and by the Eugene Mc-
large data set for nonparametric object and scene recognition. IEEE
Dermott Foundation. This work is also done and supported by Trans. Pattern Anal. Mach. Intell., 11(30):1958–1970, 2008. 1
Brown University, Center for Computation and Visualization, and [24] H. Wang, M. Ullah, A. Kläser, I. Laptev, and C. Schmid. Evalua-
the Robert J. and Nancy D. Carney Fund for Scientific Innova- tion of local spatio-temporal features for action recognition. BMVC,
2009. 2, 5, 6
tion, by DARPA (DARPA-BAA-09-31), and ONR (ONR-BAA-
[25] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from
11-001). H.K. was supported by a grant from the Ministry of Sci- arbitrary views using 3D exemplars. ICCV, 2007. 1, 2
ence, Research and the Arts of Baden Württemberg, Germany. [26] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based
methods for action representation, segmentation and recognition.
Comput. Vis. Image Und., 115(2):224–241, 2010. 1
References [27] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database:
Large-scale scene recognition from abbey to zoo. CVPR, 2010. 1
[1] http://serre-lab.clps.brown.edu/resources/
HMDB/. 1, 2 [28] B. Yao and L. Fei-Fei. Grouplet: A structured image representation
for recognizing human and object interactions. CVPR, 2010. 7
[2] http://server.cs.ucf.edu/˜vision/data.html. 2
2563