View-invariant action recognition
Yogesh Rawat 1 ,
CRCV, University of Central Florida, Orlando, Florida, USA.
arXiv:2009.00638v1 [cs.CV] 1 Sep 2020
Shruti Vyas,
CRCV, University of Central Florida, Orlando, Florida, USA.
Synonyms
– Cross-view action recognition
– View-invariant action classification
– View-invariant activity recognition
Related Concepts
– View-invariance
– Action recognition
– Activity classification
Definition
Recognizing human actions from previously seen viewpoints is relatively easy
when compared with unseen viewpoints. View-invariant action recognition aims
at recognizing human actions from unseen viewpoints.
Background
Human action recognition is an important problem in computer vision. It
has a wide range of applications in surveillance, human-computer interaction,
augmented reality, video indexing, and retrieval. The varying pattern of spatiotemporal appearance generated by human action is key for identifying the performed action. We have seen a lot of research exploring this dynamics of spatiotemporal appearance for learning a visual representation of human actions. However, most of the research in action recognition is focused on some common viewpoints [1], and these approaches do not perform well when there is a change in
viewpoint. Human actions are performed in a 3-dimensional environment and
are projected to a 2-dimensional space when captured as a video from a given
viewpoint. Therefore, an action will have a different spatio-temporal appearance from different viewpoints. As shown in Figure 1, observation o1 is different
from observation o2 and so on. The research in view-invariant action recognition
addresses this problem and focuses on recognizing human actions from unseen
viewpoints.
There are different data modalities which can be used for view-invariant representation learning and perform action recognition. These include RGB videos,
1
corresponding author
Fig. 1: An action captured from different viewpoints (v1, v2, and v3) providing
different observations (o1, o2, and o3) [15].
skeleton sequences, depth information, and optical flow. The skeleton sequences
and depth information require additional sensors and are comparatively more
difficult to capture when compared with RGB videos. Similarly, the optical flow
is computationally expensive and require extra processing on RGB videos. These
modalities can be used independently as well as in combination to solve the problem of view-invariant action recognition. Figure 2 shows a sample instances activities performed by an actor which is captured from three different viewpoints
in three different modalities.
Human action recognition from video sequences involves the extraction of
visual features and encoding the performed action in some meaningful representation which can be used for interpretation. The view-invariant encoding of
actions involves a lot of challenges, and there are different ways to address them.
One possible solution is to track the motion as it evolves with the performed action. The track in itself will be invariant to any change in the viewpoint and can
be used to extract a view-invariant representation of actions. Another approach
is to analyze the spatio-temporal volume, which is covered by a human while performing any action. It is interesting to observe that such spatio-temporal volumes
will have some similarities for same actions, and they can be useful to address
change in viewpoints. The tracking of a human body can be useful to some extent, but human joints can move independently while performing most of the
activities. Therefore, tracking of skeleton joints independently is very important
for understanding human actions. The availability of large-scale training data
has also enabled us to learn view-invariant representations using deep learning.
We will cover the details of these approaches in the following sections.
View-invariance in human actions
Fig. 2: Video frames showing action in different modalities as seen from three
different view-points [12], Row-1: RGB, Row-2: Skeleton, and Row-3: Depth.
The dynamics of body parts and the change in appearance play an important role in understanding human actions. These two properties can be effectively
used to determine human actions in a video stream if there is no change in viewpoint. However, it becomes challenging when there is a change in the viewpoint,
as the dynamics, as well as the appearance, will change with it significantly.
Therefore, it is important to represent human actions in such a way that the
representation is invariant to any change in viewpoint. The idea is to encode the
human action with a representation which does not change with change in viewpoint. We can observe in Figure 2 how the appearance of the activity changes
when seen from different viewpoints. It does not matter how the action is captured; this variation is present in all the modalities, including RGB, skeleton,
and depth.
0.1
Motion trajectories
An action sequence performed by any person will have a motion trajectory,
which will be different for different actions. These motion trajectories can be
useful in extracting rich view-invariant representation for action classification.
The actions are performed in a 3D environment, and therefore, the corresponding motion trajectories are in 3D space. The change in speed and direction of
the trajectory plays an important role in inferring the performed action. The
continuities and discontinuities in position, velocity, and acceleration in a 3D
trajectory are preserved in 2D trajectories under a continuous projection [11].
Therefore, we can use the 2D spatio-temporal curvature of these 3D motion tra-
jectories to represent actions which will capture the change in speed as well as
direction.
A spatio-temporal curvature can be represented using instants which segment
the motion trajectory into intervals. Instances indicate a significant change in the
speed and direction in the motion trajectory and define motion boundaries [11].
A special class of motion boundary, which is independent of starts and stops, is
the dynamic instant that happens while performing an action. A dynamic instant
represents a significant change in the motion characteristics and occurs only for
one frame. It provides motion boundaries, called intervals, which represent the
time period between two dynamic instants when the motion characteristics are
not changing. In Figure 3a, we can observe a sample video frame for activity
‘opening a cabinet’. If we track the hand motion of the person while performing
this action, we will get a spatio-temporal curvature shown in Figure 3b. It shows
the instants as well as corresponding intervals in the motion trajectory. Figure
3c shows the spatio-temporal curvature values along with the detected dynamic
instants in the motion trajectory.
(a)
(b)
(c)
Fig. 3: (a) A sample video frame showing ‘opening a cabinet’ action. A hand trajectory in white is superimposed on the image, (b) A representation of the trajectory in terms of instants and intervals, and (c) corresponding spatio-temporal
curvature values and detected maximums (dynamic instants) [11].
View-invariance: The discontinuities in 3D motion trajectory, which we
perceive as instants, are always projected as discontinuities when projected in 2D
curvatures [11]. These instants which are maxima in spatio-temporal curvature
will be view-invariant for most scenarios. Therefore, the number of instants in
a spatio-temporal curvature is an important characteristic which will be viewinvariant. The only exception will be the cases when there is a perfect alignment
of the viewing direction with the plane where the action is being performed. In
these cases, the position of the trajectory in consecutive frames will be projected
to the same location in the frame resulting in a 2D trajectory, which is essentially
a single point. Another important characteristic of instants in spatio-temporal
curvature is the change in direction. This is known as sign characteristic, and it
defines the direction of turns in action. It is very useful in distinguishing different
actions captured from varying viewpoints. A clock-wise turn is represented by a
‘+’ sign and counter clock-wise turn by ‘-’ sign.
These two characteristics of instants, number of instants and sign of instants,
in a spatio-temporal curvature provide a view-invariant representation of actions.
This representation can be used to determine whether two spatio-temporal curvature belongs to the same action class. The first requisite is a match between
the number of instants and the sign sequence. Thereafter, we can compute a
view-invariant similarity measure between the two trajectories. This can be performed using affine epipolar geometry, which will ensure viewpoint invariance in
the similarity measure [11]. However, there are some limitations with this approach. It requires an exact match between corresponding instants in the spatiotemporal curvature, which can be difficult as there can be false or missed instant
detection. Also, this approach does not take into account the temporal information between the instants. These issues can be addressed using a view-invariant
dynamic time warping to measure the similarity between two spatio-temporal
curvatures [9]. The view-invariant time-warping not only suppress the instant
outliers, it also compensates for the variation in the execution-style [10]. It can
shrink the slow-motion trajectories, which are longer in temporal axis, and expand the fast motion trajectories which are relatively shorter.
Limitations: The idea of spatio-temporal curvature is simple yet effective in
extracting view-invariant representation for actions. However, there are certain
limitations in this approach. The assumption that an action can be represented
by a trajectory of single point in a video frame will not hold true for actions where
full-body is involved. The skeleton joints in a human body will have a different
motion for the same performed action. Therefore, this approach is limited to
actions where the action can be approximated to a trajectory of single point at
any time frame during the motion.
0.2
Tracking joints
A single point-based motion tracking of human actions can not be generalized
for actions where multiple body joints are involved. It only carries motion information ignoring any shape or relative spatial information. Therefore, it is important to consider multiple human joints involved in the action while tracking the
motion. The movement of different joints while performing an action is not independent of each other. The human body has certain anthropometric proportion,
and there exist some geometric constraints between multiple anatomical landmarks such as body joints. This allows us to analyze human actions performed
by different people using the semantic correspondence between human bodies.
The pose X̂ of an actor while performing an action can be represented
as a set of points in any given frame of a video. Here the pose is defined as
X̂ = {X1 , X2 ...Xn }, where Xi = (xi , yi , zi , Λ)T are homogenous coordinates and
there are n such landmarks. Each point represents the spatial coordinate of an
anatomical landmark on the human body, as shown in Figure 4. The body size
and proportion vary greatly between different people and age groups. However,
even though the human dimensional variability is substantial, it is not arbitrary
[2]. Therefore, geometric constraints can be used between the pose of two different actors performing the same action. The proportion between two sets of
points describing the pose of two different actors can be captured by a projective
transformation. Suppose we have a set of points X̂ and Ŷ describing actors A1
and A2 . Then the relationship between these two sets can be described by a
matrix M such that Xi = M Yi , where i = 1, 2...n and M is a 4x4 non-singular
matrix. The transformation simultaneously captures the different pose of each
actor as well as the difference in size/proportions of the two actors. There are
two types of constraints which are useful for action recognition, postural constraints and action constraint [2]. The postures of two actors performing the
same action at any time instant should be the same. This constraint allows us
to recognize the action at each time instant by measuring the similarity between
the postures. Along with this frame-wise constraint, another global constraint
can be used on the point sets describing two actors if they are performing the
same action.
Fig. 4: A sample video frame showing a pose and corresponding point-based pose
representation [2].
The anthropometric constraints in a human body allow the transformation of
pose between two actors. Moreover, utilizing the postural and action constraints
can help in recognizing action by measuring the similarity between two sets of
points. The transformation and geometric constraints will address the issue of
view-invariance in actions. However, this approach assumes the temporal alignment of poses for each frame in the video. This can be problematic as different
actors will have a unique style of executing an action. Therefore, a temporal
alignment is also required, which is invariant to temporal transformations. This
can be done using dynamic time warping, which is particularly suited to action
recognition as it is expected that different actors may perform portions of actions
at a varying rate [2].
The frame-wise representation of action decouples pose from its motion along
the temporal domain. This approach can be effective to some level, but it ignores
the temporal information focusing only on the order of poses, which limits its
potential. We can model an action as a spatio-temporal construct since it is
a function of time as well. It can be be represented as a set of points, Â =
{X1 , X2 ...Xp }, where Xi = (xi , yi , zi , ti )T are spatio-temporal coordinates, p =
mn, where we have m landmarks and there are n recorded postures for that
action [14]. A sample action ‘walking’ is shown in Figure 5 in both xyz and xyt
space at different time steps.
(a)
(b)
Fig. 5: Representation of an action in xyzt 4D space. (a) Action in xyz space,
(b) action in xyt space [14].
The variability associated with the execution of an action can be closely
approximated by a linear combination of action basis in joint spatio-temporal
space. An instance of an action can be defined as a linear combination of a set of
action-basis A1 , A2 ...Ak . Therefore, any instance of an action can be expressed
k
ai Ai , where ai ∈ R is the coefficient associated with the actionas, A′ = Σi=1
basis Ai ∈ R4xp . The space of an action, A is the span of all its action basis.
The variance captured by action basis can include different execution rates of
the same action, different individual styles of performances as well as the anthropometric transformations of different actions. These action basis can be used to
project a 4D pose point to its image (xyt) using a space-time projection matrix
[14]. These projections are useful in forming an action representation which can
further be used for recognition of new instances.
All the joints in the pose and all the time steps in a trajectory may not be
required to identify the action category. We can select the joints dynamically
based on the action and also choose fewer poses along the temporal domain
known as canonical poses [6]. These empirically selected joints and canonical
poses provide view-invariant trajectories which we call Invariance Space Trajectories (IST). These trajectories are also invariant representations of human
actions and useful for action recognition.
0.3
Spatio-temporal volume
The tracking of joints in the motion trajectories can be effective for action representation. However, as we discussed earlier, not all joints may be useful in
recognizing the performed action. Another alternative to joints is the contour
of the actor, which also captures the complete shape information. Tracking of
actor contours will consider both shape and motion of the actor performing the
action. When an actor performs some action in a 3D environment, the points on
the outer boundary of the actor are projected as 2D (x,y) contour in the image
plane. A sequence of 2D contours from frames with respect to time generates a
Spatio-Temporal Volume (STV) in (x, y, t), which is a 3D object in the (x, y,
t) space [19]. The object contours in two consecutive frames can be tracked by
finding a point correspondence between them. This can be done using a graphtheoretical approach by maximizing the match of a weighted bipartite graph. In
Figure 6, we can observe some samples of generated spatio-temporal volumes for
various actions.
(a) Falling
(b) Tennis stroke
(c) Walking
(d) Dancing
Fig. 6: Generated Spatio-Temporal Volumes (STVs) for some sample actions.
The color coded action descriptors are also shown corresponding to ridges (yellow), saddle ridges (white), peak (red), valley (pink), and saddle valleys (green)
[19].
An STV can be considered as a manifold, and it will be nearly flat for small
scales defined by a small neighborhood around a point. Therefore, it can be
represented as a continuous action volume B by computing plane equations in
the neighborhood around a point. We define it using a 2D parametric representation by considering the arc length of the contour (s), which encodes the
object shape, and time (t) which encodes the motion. The action volume is defined as B = f (s, t) = [x(s, t), y(s, t), t]. The parameters s and t can be used to
generate trajectories and contours for any point in the object boundary. This
representation is important for computing action descriptors corresponding to
changes in direction, speed, and shape of parts of contours [19]. This can be
done by analyzing the surface type of a point in STV using Gaussian curvature
and mean curvature. These surface types are important action descriptors, and
for any given action they are called action sketch. In figure 6, several of these
action descriptors are superimposed on the STVs for various actions. The underlying curves of the contour and point trajectory for each action descriptor in
the action sketch will have a maxima or a minima. It can be proved that the
maxima/minima of the contour and the trajectory will not change by changing
the viewpoint of the camera. Therefore, this representation is also invariant to
any change in viewpoint.
Fig. 7: Illustration of constructing 4D-AFM. The first row shows videos from
four different views. The second row shows the STVs extracted from the videos
and the locations of the spatio-temporal action features on the surface of STVs.
These action features are mapped to the 4D action shape as shown in the third
row [18].
The 3D volume of STV, along with the action descriptor derived from this
volume, can be used for action recognition. The relation between two STVs is
defined as xFx′ = 0, where x and x′ are points on two different action sketches
and F is a 3x3 fundamental matrix defining this relation. This relation estimates
whether two different action sketches belong to the same action class and is a
useful property for action recognition. This 3D volume can further be utilized
to build a 4D action feature model(4D-AFM) [18]. This model elegantly encodes
the shape and motion of actors observed from multiple views. A sample example
for this model is shown in Figure 7. This model enhances the view-invariant
robustness of this approach, as features from multiple views are mapped to a
unified model. Action recognition can be performed with this model based on
the scores of matching action features from the action videos to the model points
by exploiting pairwise interactions of features.
0.4
Learning based methods
The classical approaches focused on finding good features and utilize simple
matching based techniques for action recognition. In learning based methods,
both feature extraction and action recognition are performed jointly. The availability of multi-view activity datasets, such as IXMAS [17], UWA3D MultiviewActivity II [7], Northwestern-UCLA [16] and NTU-RGBD [12], has enabled
the development of learning based robust methods for view-invariant feature
learning. These features can be learned from different types of modalities, such
as RGB [4,15], skeleton [12,8], depth [3,15], and optical flow [3].
We have seen earlier how motion trajectory is effective for extracting viewinvariant descriptions. Apart from point-based and joint based trajectories, we
can also extract dense trajectories from any action sequence. These trajectories
provide dense points from each frame and track them using displacement information from a dense optical flow field. Given a dense trajectory of length L, a
sequence S is formed to represent the displacement vectors ∆Pt = (Pt+1 − Pt ) =
(xt+1 − xt , yt+1 − yt ). The sequence is defined as a series of displacement vectors
Pt+L−1
(∆Pt , ..., ∆Pt+L−1 ) which is normalized by i=t
||∆Pi ||. Any action sequence
can be represented using this dense motion trajectory description.
These representations can be used to learn action-basis if we have a limited
amount of samples. We discussed earlier how these action-basis can be useful
for view-invariant action recognition. However, if we have a large amount of
samples, computing action basis will be computationally expensive. An alternative is to perform clustering of these action descriptions and use the cluster
centers as action representatives [8]. These cluster centers can further be used
for describing new action trajectories using bag-of-word approach. If we have a
sufficient number of samples from multiple views to compute these cluster centers, then we can train a fully connected neural network to perform non-linear
transformation and learn view-invariant feature representation. These learned
view-invariant features can then be used to perform robust action classification
on novel and unseen views.
Multi-task learning: The dense trajectories can be very effective for learning view-invariant representations for action recognition. However, they require
an additional computation time, which is not suitable for systems with low latency. Convolutional neural network provides an efficient way to learn meaningful
representations directly using the input video. These networks can be very effective in performing action classification on videos with previously seen views.
However, they fail to generalize well for unseen views which are not present in
the training data. This is caused due to the absence of view-invariance in the
learned representation. This can be addressed by a multi-tasking approach where
the representation is also utilized for some other task which enforces the network to learn a view-invariant representation. This task could be a prediction of
optical flow from unseen view [3] or cross-view video synthesis [15]. Both these
approaches enable the network to learn view-invariant feature representations
and perform well on action recognition for unseen views.
Datasets and Experimental Results
The availability of public datasets for multi-view action recognition enables
the research community to benchmark the research progress. There are four main
public datasets which are widely used and provide action sequences captured
from multiple viewpoints.
IXMAS: The Inria Xmas Motion Acquisition Sequences (IXMAS) dataset
[17] have videos for 11 action classes with actions performed 3 times by 10 actors.
All actions have been captured using five camera views.
UWA3D MultiviewActivity II: This dataset [7] contains RGB videos of
30 human activities performed by 10 subjects with different scales. Each subject
performed all the actions 4 times. Depth and Skeleton data captured using Kinect
is also available with this dataset.
Northwestern-UCLA: Northwestern-UCLA Multiview 3D event dataset
[16] contains RGB, depth and human skeleton data captured simultaneously by
three Kinect cameras. This dataset has 10 action categories, and each action is
performed by 10 actors. There are a total of 1493 action sequences. There are
two main data splits: cross-subject (CS) and cross-view (CV) as suggested by
[16]. View-invariant action recognition is examined with CV split where videos
from the first two views are used for training, and the third view is for testing.
NTU RGB-D: This is the largest dataset for view-invariant human action recognition [12]. Along with RGB videos, depth and skeleton data is also
available. It contains more than 56K videos and 4 million frames with 60 different action classes. There are a total of 40 different actors, who perform actions
captured from 80 different viewpoints.
Northwestern-UCLA and NTU RGB-D datasets provide multiple modalities, such as RGB, depth, and skeleton, along with multiple views to perform
view-invariant action recognition. The performance of a method is evaluated by
computing an accuracy score based on whether it predicts the correct action
class or not. In Table 1 and Table 2, we can observe the performance of different
methods using different modalities on Northwestern-UCLA and NTU RGB-D
datasets. The cross-subject (CS) evaluation measure the performance on unseen
subjects and the cross-view (CV) evaluation measure the performance on unseen
views. The cross-subject performance is better than cross-view performance for
most of the methods. This is mainly due to high variation in actions when seen
from different viewpoints as compared with an unseen person where the appearance will have more variation. Therefore, the methods which focus on the motion
will perform better than those which focus more on the appearance of the actor.
Conclusion and open problems
Action recognition for unknown and unseen views is a challenging problem.
The tracking of motion trajectories have found to be successful so far but they
are computationally expensive to extract and therefore not scalable for largescale scenarios. The availability of large-scale multi-view datasets has enabled
us to develop networks which are effective in recognition performance and can
be trained end-to-end. However, their performance is limited by the variation of
Method
Vyas et al. [15]
MST-AOG [16]
HOPC [7]
CNN-BiLSTM [3]
R-NKTM [8]
MST-AOG [16]
Modality
RGB
Depth
Depth
Depth
Skeleton
RGB-S
CS
87.5
81.6
CV
73.2
53.6
71.9
62.5
78.1
73.3
Table 1: A comparison of cross-subject (CS) and cross-view (CV) action recognition on N-UCLA MultiviewAction3D dataset.
Method
CNN-BiLSTM [3]
Vyas et al. [15]
CNN-BiLSTM [3]
Vyas et al. [15]
Shahroudy et al.[12]
CNN-BiLSTM [3]
DSSCA - SSLM[13]
Modality
RGB
RGB
Depth
Depth
Skeleton
Flow
RGB-DS
CS
55.5
88.9
68.1
79.4
62.9
80.9
74.9
CV
49.3
86.3
63.9
78.7
70.3
83.4
-
Table 2: A comparison of cross-subject (CS) and cross-view (CV) action recognition on NTU-RGB+D dataset.
viewpoints in the available datasets. These datasets are lab curated, and actions
are performed in a controlled environment. Apart from this, they are also limited
in terms of variation in the number of available viewpoints. Therefore, it will be
challenging to generalize the methods trained on these datasets to real-world
scenarios.
Recommended Readings
[1] Kevin Duarte, Yogesh Rawat, and Mubarak Shah. Videocapsulenet: A simplified network for action detection. In Advances in Neural Information
Processing Systems, pages 7610–7619, 2018.
[2] Alexei Gritai, Yaser Sheikh, and Mubarak Shah. On the use of anthropometry in the invariant analysis of human actions. In Proceedings of the
17th International Conference on Pattern Recognition, 2004. ICPR 2004.,
volume 2, pages 923–926. IEEE, 2004.
[3] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan Kankanhalli. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems, pages 1254–1264, 2018.
[4] Jingen Liu, M Shah, B Kuipers, and S Savarese. Cross-view action recognition via view knowledge transfer. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pages 3209–3216. IEEE
Computer Society, 2011.
[5] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization
for view invariant human action recognition. Pattern Recognition, 68:346–
362, 2017.
[6] Vasu Parameswaran and Rama Chellappa. View invariance for human action recognition. International Journal of Computer Vision, 66(1):83–101,
2006.
[7] Hossein Rahmani, Arif Mahmood, Du Huynh, and Ajmal Mian. Histogram
of oriented principal components for cross-view action recognition. IEEE
transactions on pattern analysis and machine intelligence, 38(12):2430–
2443, 2016.
[8] Hossein Rahmani, Ajmal Mian, and Mubarak Shah. Learning a deep model
for human action recognition from novel viewpoints. IEEE transactions on
pattern analysis and machine intelligence, 40(3):667–681, 2018.
[9] C Rao, A Gritai, M Shah, and T Syeda-Mahmood. View-invariant alignment
and matching of video sequences. In Proceedings Ninth IEEE International
Conference on Computer Vision, 2003.
[10] Cen Rao, Mubarak Shah, and Tanveer Syeda-Mahmood. Invariance in motion analysis of videos. In Proceedings of the eleventh ACM International
Conference on Multimedia, pages 518–527. ACM, 2003.
[11] Cen Rao, Alper Yilmaz, and Mubarak Shah. View-invariant representation and recognition of actions. International Journal of Computer Vision,
50(2):203–226, 2002.
[12] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+
d: A large scale dataset for 3d human activity analysis. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages
1010–1019, 2016.
[13] Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, and Gang Wang. Deep
multimodal feature analysis for action recognition in rgb+ d videos. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 40(5):1045–
1058, 2018.
[14] Yaser Sheikh, Mumtaz Sheikh, and Mubarak Shah. Exploring the space
of a human action. In Tenth IEEE International Conference on Computer
Vision (ICCV’05) Volume 1, volume 1, pages 144–149. IEEE, 2005.
[15] Shruti Vyas, Yogesh S Rawat, and Mubarak Shah. Time-aware and viewaware video rendering for unsupervised representation learning. arXiv
preprint arXiv:1811.10699, 2018.
[16] Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Crossview action modeling, learning and recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 2649–2656,
2014.
[17] Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recognition using motion history volumes. Computer vision and image
understanding, 104(2-3):249–257, 2006.
[18] Pingkun Yan, Saad M Khan, and Mubarak Shah. Learning 4d action feature
models for arbitrary view action recognition. In 2008 IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008.
[19] Alper Yilmaz and Mubarak Shah. Actions sketch: A novel action representation. In Proceedings of the 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05)-Volume 1-Volume
01, pages 984–989. IEEE Computer Society, 2005.
[20] Jingjing Zheng, Zhuolin Jiang, P Jonathon Phillips, and Rama Chellappa.
Cross-view action recognition via a transferable dictionary pair. In bmvc,
volume 1, page 7, 2012.