Estimating Human Actions Affinities Across Views
Nicoletta Noceti1 , Alessandra Sciutti2 , Francesco Rea2 , Francesca Odone1 and Giulio Sandini2
1 DIBRIS,
2 RBCS,
Università di Genova, via Dodecaneso 35, 16146, Genova, Italy
Istituto Italiano di Tecnologia, via Morego 30, 16163, Genova, Italy
Keywords:
Human Actions Understanding, View-invariant Representation, Trajectory Curvature.
Abstract:
This paper deals with the problem of estimating the affinity level between different types of human actions
observed from different viewpoints. We analyse simple repetitive upper body human actions with the goal
of producing a view-invariant model from simple motion cues, that have been inspired by studies on the
human perception. We adopt a simple descriptor that summarizes the evolution of spatio-temporal curvature
of the trajectories, which we use for evaluating the similarity between actions pair on a multi-level matching.
We experimentally verified the presence of semantic connections between actions across views, inferring a
relations graph that shows such affinities.
1
INTRODUCTION
Since birth human neonates present a preference for
biological motion (Simion et al., 2008) and this is an
important trigger for social interaction. This inclination – a key element in human development research
– is also inspiring computer vision researchers to finding computational models able to replicate it on artificial systems.
The contamination between the two research communities may also guide the choice of the most appropriate tools for a given computer vision task: if neonates
have this capability of “understanding” motion while
their visual perception is still rudimental, it is likely
that the analysis is based on very simple motion cues –
such as sensitivity to local apparent motion, and simple features as velocity, acceleration or curvature (see
e.g. (Kuhlmeier et al., 2010; Simion et al., 2008)).
The goal of our research is to build computational
models for the iCub humanoid robot (Metta et al.,
2008) to simulate this phase of human development.
Our long term objective is to understand how much
we can infer on the nature of motion from such simple observations, as this skill seems to be at the basis of human ability in interacting with others. Also
we would like to preserve some abilities typical of
a developing human being, such as a degree of tolerance to view-point changes (Troje and Westhoff,
2006; Goren et al., 1975; Farah et al., 1995)
On a shorter term, our research is focusing on the
identification of biological motion (some preliminary
130
results can be found in (Sciutti et al., 2014)) and on
the analysis of different types of biological motion.
The latter is the goal of this paper, in which we analyze simple repetitive upper body human actions with
the goal of producing a view-invariant model from
simple motion cues. We apply this model to the estimate of the affinity level between different types of
actions, focusing in particular on two main categories:
transitive – which involve object manipulation – and
intransitive actions.
Since we are primarily interested in capturing abilities
typical of the early months of human development we
do not address classical action recognition tasks, abilities which are likely to be gained in later stages of
development (Camaioni, 2004; Kanakogi and Itakura,
2011), also thanks to the infants prior motor experience .
Our model takes inspiration from the seminal
work (Rao et al., 2002), where the authors discuss on
the use of dynamic instants, i.e. meaningful action
units whose properties are highly characterizing the
evolution of structured activities (e.g. Pick up an object from the floor and put it on the desk) and that have
been proved to be of substantial relevance for the human motion perception. Such points – consequence
of a variation in the force applied during an activity –
correspond to the local maxima of the curvature of the
trajectory describing the activity evolution on the image plane. The authors formally prove that they also
have view-invariant properties.
In this paper we focus instead on intervals, mean-
Noceti N., Sciutti A., Rea F., Odone F. and Sandini G..
Estimating Human Actions Affinities Across Views.
DOI: 10.5220/0005307801300137
In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 130-137
ISBN: 978-989-758-090-1
Copyright c 2015 SCITEPRESS (Science and Technology Publications, Lda.)
EstimatingHumanActionsAffinitiesAcrossViews
ing portions of trajectories between two dynamic instants, and investigate on their potentially informative content to be exploited for cross-view recognition. For our investigation, we collected a data set
of videos, each one including repetitions of a given
atomic action. Indeed, in our case dynamic instants
mainly refer to points partitioning one action instance
from the next one. After an initial low-level analysis,
that aims at segmenting the moving arm of the user,
we extract corner points from the arm region and track
them over time to collect a set of trajectories. Then,
we measure the curvature and find its local maxima.
Finally, we describe the intervals with histograms of
curvature, that we adopt in a multi-level matching to
evaluate the level of affinity between two observed
events. We experimentally provide the evidence of
the presence of actions classes inferred by estimating
the pairwise similarities of action sequences from the
same view and across different views.
Works related to the computational model we consider can be found in fields as video surveillance,
video retrieval and robotics, where tasks as gesture
and action recognition or behavior modeling have
been for many years very fertile disciplines, and still
are (Fanello et al., 2013; Malgireddy et al., 2012;
Mahbub et al., 2011; Noceti and Odone, 2012; Wang
et al., 2009) We refer the interested reader to a recent
survey (Aggarwal and Ryoo, 2011) for a complete account on the topic.
From the view-invariance standpoint, the problem has been addressed considering two different settings, i.e. observing the same dynamic event simultaneously from two (or more) cameras (Zheng et al.,
2012; Wu and Jia, 2012; Li and Zickler, 2012; Huang
et al., 2012a; Zheng and Jiang, 2013) or considering different instances of a same concept of dynamic
event (Lewandowski et al., 2010; Gong and Medioni,
2011; Junejo et al., 2011; Li et al., 2012; Huang
et al., 2012b). The latter are more related to our setting. In general, view-invariance may be addressed
at a descriptor level (Junejo et al., 2011; Li et al.,
2012; Huang et al., 2012b) or at the similarity estimate level. In this case machine learning (Wu and Jia,
2012; Huang et al., 2012a; Zheng and Jiang, 2013)
and, more recently, transfer learning (Zheng et al.,
2012; Li and Zickler, 2012) may be beneficial.
The approach we follow shares computational
tools and models with many of the above mentioned
works, but significantly differs in the intentions, in
that we are not interested in recognizing specific gestures, actions or activities, but instead we consider a
more abstract task: to what extent are we able to infer
properties (if any) on the observed motion that persist across views, even from a very coarse representa-
tion?.
The rest of the paper is organized as follows. Sec.
2 is devoted to the description of the approach we
follow, from the low-level analysis to the matching,
while Sec. 3 describes our experimental analysis. The
final section is left to conclusions.
2
CURVATURE-BASED MOTION
REPRESENTATION
In this section we discuss our approach to motion
modeling, that builds on top of a sparse representation and then relies on the computation of histograms
of the spatio-temporal curvature. Then, we describe
the strategy we adopt to match image sequences.
2.1 Low-level Video Analysis
The very first step of our method relies on a widely accepted video analysis pipeline, that aims at segmenting each image of a sequence with respect to both
motion and appearance information. Instantiated to
our case study, this corresponds to detecting the image region with the moving arm of a subject while
performing a given action. The intermediate steps of
the pipeline are reported in Fig. 1. We first perform
background subtraction (Zivkovic, 2004), then refine
the obtained binary map by applying skin detection in
the HSV color space to only the foreground. Finally,
assuming only the subject of interest is moving in the
observed scene, we keep the largest connected component of the final binary map as region of interest
(ROI).
The second stage of our method relies on describing the visual dynamic of the moving region by
means of points trajectories (Fig. 2). To this purpose, we extract from the ROI the Harris corners (Shi
and Tomasi, 1994), which we describe using SIFT descriptors (Lowe, 2004). Then, we track SIFTs with
a Kalman filter (Welch and Bishop, 1995) and using histogram intersection as a similarity measure between observations. To improve the quality of the
obtained spatio-temporal trajectories, we finally filter
them with anisotropic diffusion (Perona and Malik,
1990).
We collect spatio-temporal trajectories for
each video, and thus set the basis for the next step of
motion representation, based on the concept of curvature.
2.2 Spatio-temporal Curvature
The projection of the dynamic evolution of a 3D
point in the image plane is composed as a spatio-
131
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
(a)
(b)
(c)
(d)
Figure 1: A visual sketch of the initial video analysis: starting from an image of the sequence (Fig. 1(a)), we first apply
background subtraction (Fig.1(b)), then detect skin on the foreground (Fig. 1(c)), and finally keep only the largest connected
component (Fig. 1(d)) as the region of interest.
Figure 2: Examples of trajectories of corner points, which we subsampled for the sake of clarity of the figure.
temporal trajectory of observations T = {P(ti )}ni=1 ,
where P(ti ) = (x(ti ), y(ti ),ti ) is an image point observed at time ti with coordinates (x(ti ), y(ti )), which
are functions of time. The velocity of the points
can be expressed as the derivative of the positions,
i.e. V (ti ) = (x′ (ti ), y′ (ti ), ∆t ) where ∆t is the temporal gap between adjacent images. Similarly, the acceleration is the derivative of the velocity, A(ti ) =
(x′′ (ti ), y′′ (ti ), 0). At this point, we can compute the
trajectory curvature as
C(ti ) =
||V (ti ) × A(ti )||
.
||V (ti )||3
(1)
Consider the corner trajectories of Fig. 2, one of
which is reported in the 3D space-time reference system of Fig. 3, left. On the right, we show the trend of
velocity magnitude (above) and of the curvature (below) over time, enhancing their local maxima in red
and green, respectively. The corresponding spacetime points are coherently marked on the 3D visualization on the left. As it can be easily observed,
while the first type of points indicates time instants in
the middle of a segment of the space-time trajectories
(points in which the user starts to decelerate to finally
stop the movement), the latter refers to instantaneous
changes in motion direction and/or dynamic.
The dynamic instants are the units on top of which
we build the representation of a trajectory. Following the notation of the previous section, an observed
spatio-temporal sequence T = {(x(ti ), y(ti ),ti )}ni=1 is
now associated with a sequence DI = [tˆ1 . . . tˆm ] of
dynamic instants, where tˆi ∈ {t1 , . . . ,tn } and m < n.
According to (Rao et al., 2002), we define an
132
interval as a trajectory segment laying in the middle of two dynamic instants. We chose to represent the distribution of the curvature in it by means
of histograms. Therefore, at the end of the representation, each observed trajectory is associated with
a sequence of curvature histograms, i.e. H(T ) =
[H(tˆ1 , tˆ2 ), . . . , H(tˆm−1 , tˆm )], where H(tˆi , tˆi+1 ) refers to
the fact that the histograms are computed between
each pair of adjacent dynamic instants.
2.3 Multi-level Matching between
Image Sequences
Once we have detected the dynamic instants as the local extrema of the curvature and represented the intervals curvature as histograms, we may set up a multilevel procedure to match image sequences. Hence,
let us consider two videos and the two corresponding
sets of observed trajectories, T 1 = {T11 , . . . , TN1 } and
T 2 = {T12 , . . . , TM2 }, described with the curvature histograms to collect the sets H 1 = {H(T11 ), . . . , H(TN1 )}
and H 2 = {H(T12 ), . . . , H(TM2 )}.
To match two image sequences, we start by comparing pairs of trajectories of type (Ti1 , T j2 ), with
1 ≤ i ≤ N and 1 ≤ j ≤ M. For the sake of clarity,
we express the sequence of histograms with a more
1 , H1 , . . . , H1
compact style as H(Ti1 ) = [Hi,1
i,2
i,mi −1 ] and
2
2
2
2
H(T j ) = [H j,1 , H j,2 , . . . , H j,m j −1 . Since the videos we
consider refer to repetitions of a given atomic actions,
we consider the average similarity between all pairs
of histograms describing portions of the two trajectories. The similarity between two trajectories is thus
EstimatingHumanActionsAffinitiesAcrossViews
Figure 3: Left: a space-time representation of a trajectory. Right: velocity magnitude (above) and curvature (below) as
functions of time. We denote local maxima of the velocity magnitude (red) and of the curvature (green). The latter correspond
to dynamic instants, relevant for the human perception of motion.
3.1 Experimental Setup
formalized as
S(Ti1 , T j2 ) =
mi −1
1
∑
(mi − 1)(m j − 1) k=1
m j −1
∑
1
HI(Hi,k
, H 2j,h )
h=1
(2)
where HI denotes the intersection between histograms. Given a video, the observed trajectories describe the evolution of 3D points all related to the
same physical event, i.e. the motion of the user arm.
Thus, it is convenient to summarize the contribution
of all the trajectories to end up with a single value
quantifying the global similarity between two videos,
i.e. two physical events. To this purpose, we select
the maximum similarities between all pairs of trajectories. More formally
S(T 1 , T 2 ) =
3
max
i=1...N, j=1...M
S(Ti1 , T j2 ).
(3)
EXPERIMENTAL ANALYSIS
In the following we report on the experiments we performed in order to evaluate the level of view-invariant
information included in the intervals between dynamic instants, which we describe and compare according to the previous section. Our main objective
is to extract knowledge about properties of (classes
of) actions that might be captured across views with
this somehow primitive representation. To this end
we performed a qualitative evaluation on a dataset we
collected in-house.
We acquired a set of image sequences of two subjects
observed from two different viewpoints. The acquisitions have been made on an indoor environment to
favor the low-level analysis and thus allow a higher
focus on the second step of motion representation and
matching. The variation of the viewpoint reflects the
application we have in mind, i.e. human robot interaction, where we can assume the interacting subject
to be located in a limited radial spatial range in front
of the camera (i.e. the robot). Similarly, the actions
included in the data set (shown in Fig. 4) are suggested by the application. We considered 6 actions:
Pointing a finger towards a certain 3D location; Waving the hand from left to right and vice-versa; Lifting
and object from the table to a box placed on it; Throwing an object away (action that we only simulated for
practical reasons); Transporting an object from and to
different positions on the table. The latter is instantiated in two versions, with left-right and random object
repositioning.
Each video consists of 20 repetitions of the same
atomic action (e.g. move the object from left to right);
for each subject we acquired two videos in each view
for each action.
3.2 Proof of Concepts
Spatio-temporal Curvature. After having extracted corners trajectories from each video, we
first detect the dynamic instants. Let us start our
analysis by providing an experimental evidence of
the information carried by the dynamic instants in
the setting we consider. In Fig. 5 we show examples
133
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
Table 3: Ranking from the comparison of videos acquired
acquired from different viewpoints.
Test
(a) Lifting
Lift.
Point.
Throw.
T. LR
T. Rnd
Wav.
(b) Pointing
1st
2nd
Ranking
3rd
4th
5th
6th
T.Rnd
Th
Po
T.Rnd
T.LR
Wa
T.LR
Li
Th
T.LR
T.Rnd
Th
Li
T.LR
T.Rnd
Li
Li
Po
Po
T.Rnd
T.LR
Th
Po
T.Rnd
Wa
Po
Li
Wa
Wa
Li
Th
Wa
Wa
Po
Th
T.LR
of trajectories related to different actions observed
from the two viewpoints. On the same plot, we also
report the local maxima of velocity magnitude and
curvature. As a first thing, we may observe that the
dynamic instants are tolerant to view-point changes.
Furthermore, trajectories with diverse appearances in
the image plane (e.g. with different lengths or spatial
extents) present similar representations, showing that
the dynamic instants are also tolerant to variations
among the input data.
(c) Throwing
(d) Transporting left-right
(e) Transporting from and to random positions
Action Recognition. We consider different experimental configurations of increasing complexity:
• Same subject same view
(f) Waving
Figure 4: Samples from the acquisitions of a subject from a
single viewpoint.
Table 1: Ranking from the comparison of videos of the
same subject and acquired from the same viewpoint.
Test
Lift
Point
Throw
T. LR
T. Rnd
Wav
1st
2nd
Ranking
3rd
4th
5th
6th
Li
T.Rnd
Th
T.LR
T.Rnd
Wa
T.LR
Po
Po
Li
Po
T.Rnd
T.Rnd
Th
T.Rnd
T.Rnd
T.LR
T.LR
Po
T.LR
Li
Th
Wa
Po
Th
Wa
Wa
Po
Th
Th
Wa
Li
T.LR
Wa
Li
Li
Table 2: Ranking from the comparison of videos of different
subjects, acquired from the same viewpoint.
Test
Lift.
Point.
Throw.
T. LR
T. Rnd
Wav.
134
1st
2nd
Ranking
3rd
4th
5th
6th
T.Rnd
Th
Po
T.Rnd
Li
Wa
T.LR
Po
Th
T.LR
T.Rnd
Th
Li
T.LR
Li
Li
T.LR
Po
Po
T.Rnd
T.LR
Th
Po
T.Rnd
Wa
Wa
Wa
Wa
Wa
Li
Th
Li
T.Rnd
Po
Th
T.LR
• Different subject same view
• Matching across different views.
For each configuration, we consider each time a video
of test and match it with videos of all the available actions, then rank the similarities we obtained. Such
rankings are reported in Tab. 1, 2 and 3, where actions are referred to as Li (Lifting), Po (Pointing),
Th (Throwing), T.LR (Transporting left-right), T.Rnd
(Transporting random), and Wa (Waving).
It is apparent how the increasing complexity of the
configurations are reflected on the ranking results.
If the matching performs accurately when considering videos of the same subject and from the same
viewpoint, the results degrade already when the comparisons involve different subjects, even if observed
from the same viewpoints. This is consequence of
the movements subjectivity, that cause the presence of
maybe subtle properties in the motion that fail to be
captured from the basic representation we adopted.
Actions Type Affinity. What is interesting to be observed is that the comparison of the rankings of Tab.
2 and 3 suggests the presence of some equivalence
classes between actions. To clarify this point, we perform an analysis of the overall similarities between
actions and visualize the results in the similarity matrix of Fig. 6. Some remarks are in order. The first
EstimatingHumanActionsAffinitiesAcrossViews
(a)
(b)
(c)
(d)
Figure 5: Examples of dynamic instants extracted from two views and considering different actions: Lifting (Fig.5(a) and
5(b)) and Waving (Fig.5(c) and 5(d)). Local extrema of velocity magnitude (red) and curvature (green) are marked.
is that the computed similarities are high on average
(all above 0.9) speaking in favor of the complexity of
the problem. Second, some affinities between actions
can be inferred. Waving appears as the most distinc-
135
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
tive action among the considered set. The Transporting actions (both versions) are highly similar to each
other. Moreover, they show affinities with Lifting.
Pointing seems to share properties with Transport random (probably because of the variability in the movements direction), but also with Throwing. The latter,
not influenced by the forces caused by the manipulation of objects, is thus more related to an intransitive
action type. We further go on with the analysis by
thresholding the similarity matrix with respect to the
lowest value on the diagonal (i.e. the lowest similarity between two actions that must be similar). After that, we inferred the graph-like structure in Fig.
7, where the survived elements – that can be interpreted as affinities between the involved actions – are
represented as arrows. It is easy to observe how the
transitive actions (in green) formed a cluster, which is
connected to an intransitive action (red) probably due
to the affinities between the two dynamics. Pointing
is also related to Throwing, that as above mentioned
can be considered (in our instance) as an intransitive
action.
Figure 6: Average mutual similarity.
Figure 7: A visual sketch of the actions affinities inferred
by mean of the analysis (in green transitive actions, in red
intransitive actions). Throwing has both colors since in our
instance we did not actually manipulate any object.
4
CONCLUSIONS
This work considered the problem of extracting
knowledge about affinities between (classes of) actions across different views, with specific reference
to the context of human-robot interaction. Starting
from investigations about human motion perception
and inspired from the seminal work (Rao et al., 2002),
we considered a coarse motion description based on
the use of histograms of curvature, which we used to
match pairs of videos with a multi-level approach. We
experimentally inferred a set of semantic connections
that characterize actions groups across views.
Our observations, made with computational tools,
confirm what observed directly in infants: they develop rather early the ability of grasping some aspects
of the actions meaning, while it is likely that the capability of interpreting more specific actions properties
is developed later. This sets the scene for the replicability on an artificial system – a robot in our case –
of the early stages of the human developmental evolution, in which the interpretation of the observed motion refines more and more while strengthening the
perceptual capabilities. In general an interactive robot
needs to be able to autonomously understand where to
focus its attention, for instance by perceiving the presence of motion, and biological motion in particular
as an effect of a potential interacting agent. Starting
from that, already by recognizing the class of actions
of the partner (e.g. some kind of object manipulation)
the robot could focus on the most relevant properties
of the event (e.g. the manipulated object or the effects on the context). Finally, the understanding of
the specific action and of its goal may guide the robot
to the selection of an appropriate reaction (e.g. being prepared to receive an object from the user). Following this mainstream, our future investigations will
be devoted to the development of a multi-level system for action recognition, in which the complexity
of the computational model reflects the complexity of
the task.
From the point of view of the vision task, future
investigations will be also devoted to the design of
models able to cope with different complexity of the
scene (e.g. the presence of more than one moving
agents).
REFERENCES
Aggarwal, J. and Ryoo, M. (2011). Human activity analysis: A review. ACM Computing Surveys.
Camaioni, L. (2004). The role of declarative pointing in
developing a theory of mind. Infancy, 5:291–308.
136
EstimatingHumanActionsAffinitiesAcrossViews
Fanello, S. R., Gori, I., Metta, G., and Odone, F. (2013).
Keep it simple and sparse: Real-time action recognition. J. Mach. Learn. Res., 14(1):2617–2640.
Farah, M. J., Wilson, K. D., Drain, H. M., and Tanaka, J. R.
(1995). The inverted face inversion effect in prosopagnosia: Evidence for mandatory, face-specific perceptual mechanisms. Vision Research, 35(14):2089 –
2093.
Gong, D. and Medioni, G. (2011). Dynamic manifold warping for view invariant action recognition. ICCV, pages
571–578.
Goren, C. C., Sarty, M., and Wu, P. Y. K. (1975). Visual following and pattern discrimination of face-like stimuli
by newborn infants. Pediatrics, 56(4):544–549.
Huang, C.-H., Yeh, Y.-R., and Wang, Y.-C. (2012a). Recognizing actions across cameras by exploring the correlated subspace. In ECCV, volume 7583 of Lecture
Notes in Computer Science, pages 342–351.
Huang, K., Zhang, Y., and Tan, T. (2012b). A discriminative model of motion and cross ratio for view-invariant
action recognition. IEEE Transactions on Image Processing, 21(4):2187–2197.
Junejo, I. N., Dexter, E., Laptev, I., and Prez, P. (2011).
View-independent action recognition from temporal
self-similarities. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):172–185.
Kanakogi, Y. and Itakura, S. (2011). Developmental correspondence between action prediction and motor ability in early infancy. Nat.Commun., 2:341–.
Kuhlmeier, V. A., Troje, N. F., and Lee, V. (2010). Young
infants detect the direction of biological motion in
point-light displays. Infancy, 15(1):83–93.
Lewandowski, M., Makris, D., and Nebel, J.-C. (2010).
View and style-independent action manifolds for human activity recognition. In ECCV, volume 6316 of
Lecture Notes in Computer Science, pages 547–560.
Li, B., Camps, O. I., and Sznaier, M. (2012). Cross-view
activity recognition using hankelets. In CVPR, pages
1362–1369.
Li, R. and Zickler, T. (2012). Discriminative virtual views
for cross-view action recognition. In CVPR, pages
2855–2862.
Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. IJCV, 60:91–110.
Mahbub, U., Imtiaz, H., Roy, T., Rahman, S., and Ahad, A.
(2011). Action recognition from one example. Pattern
Recognition Letters.
Malgireddy, M. R., Inwogu, I., and Govindaraju, V. (2012).
A temporal bayesian model for classifying, detecting and localizing activities in video sequences. In
CVPRW.
Metta, G., Sandini, G., Vernon, D., Natale, L., and Nori, F.
(2008). The icub humanoid robot: An open platform
for research in embodied cognition. In Proceedings of
the 8th Workshop on Performance Metrics for Intelligent Systems, PerMIS ’08, pages 50–56.
Noceti, N. and Odone, F. (2012). Learning common behaviors from large sets of unlabeled temporal series.
Image and Vision Computing, 30(11):875 – 895.
Perona, P. and Malik, J. (1990). Scale-space and edge detection using anisotropic diffusion. PAMI, 12(7):629–
639.
Rao, C., Yilmaz, A., and Shah, M. (2002). View-invariant
representation and recognition of actions. IJCV,
50(2):203–226.
Sciutti, A., Noceti, N., Rea, F., Odone, F., Verri, A., and
Sandini, G. (2014). The informative content of optical
flow features of biological motion. In ECVP.
Shi, J. and Tomasi, C. (1994). Good features to track. In
CVPR, pages 593 – 600.
Simion, F., Regolin, L., and Bulf, H. (2008). A predisposition for biological motion in the newborn baby.
Proceedings of the National Academy of Sciences,
105(2):809–813.
Troje, N. F. and Westhoff, C. (2006). The inversion effect
in biological motion perception: Evidence for a life
detector? Current Biology, 16(8):821 – 824.
Wang, X., Ma, X., and Grimson, W. (2009). Unsupervised
activity perception in crowded and complicated scenes
using hierarchical bayesian models. IEEE transactions on pattern analysis and machine intelligence,
31(3):539–555.
Welch, G. and Bishop, G. (1995). An introduction to the
kalman filter. Technical report.
Wu, X. and Jia, Y. (2012). View-invariant action recognition
using latent kernelized structural svm. In ECCV 2012,
volume 7576 of Lecture Notes in Computer Science,
pages 411–424.
Zheng, J. and Jiang, Z. (2013). Learning view-invariant
sparse representations for cross-view action recognition. In ICCV, pages 3176–3183.
Zheng, J., Jiang, Z., Phillips, P. J., and Chellappa, R. (2012).
Cross-view action recognition via a transferable dictionary pair. In British Machine Vision Conference,
pages 1–11.
Zivkovic, Z. (2004). Improved adaptive gaussian mixture
model for background subtraction. In ICPR, volume 2,
pages 28–31.
137