Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

View-Invariant Action Recognition

Computer Vision
...Read more
View-invariant action recognition Yogesh Rawat 1 , CRCV, University of Central Florida, Orlando, Florida, USA. Shruti Vyas, CRCV, University of Central Florida, Orlando, Florida, USA. Synonyms Cross-view action recognition View-invariant action classification View-invariant activity recognition Related Concepts View-invariance Action recognition Activity classification Definition Recognizing human actions from previously seen viewpoints is relatively easy when compared with unseen viewpoints. View-invariant action recognition aims at recognizing human actions from unseen viewpoints. Background Human action recognition is an important problem in computer vision. It has a wide range of applications in surveillance, human-computer interaction, augmented reality, video indexing, and retrieval. The varying pattern of spatio- temporal appearance generated by human action is key for identifying the per- formed action. We have seen a lot of research exploring this dynamics of spatio- temporal appearance for learning a visual representation of human actions. How- ever, most of the research in action recognition is focused on some common view- points [1], and these approaches do not perform well when there is a change in viewpoint. Human actions are performed in a 3-dimensional environment and are projected to a 2-dimensional space when captured as a video from a given viewpoint. Therefore, an action will have a different spatio-temporal appear- ance from different viewpoints. As shown in Figure 1, observation o1 is different from observation o2 and so on. The research in view-invariant action recognition addresses this problem and focuses on recognizing human actions from unseen viewpoints. There are different data modalities which can be used for view-invariant rep- resentation learning and perform action recognition. These include RGB videos, 1 corresponding author arXiv:2009.00638v1 [cs.CV] 1 Sep 2020
Fig. 1: An action captured from different viewpoints (v1, v2, and v3) providing different observations (o1, o2, and o3) [15]. skeleton sequences, depth information, and optical flow. The skeleton sequences and depth information require additional sensors and are comparatively more difficult to capture when compared with RGB videos. Similarly, the optical flow is computationally expensive and require extra processing on RGB videos. These modalities can be used independently as well as in combination to solve the prob- lem of view-invariant action recognition. Figure 2 shows a sample instances ac- tivities performed by an actor which is captured from three different viewpoints in three different modalities. Human action recognition from video sequences involves the extraction of visual features and encoding the performed action in some meaningful repre- sentation which can be used for interpretation. The view-invariant encoding of actions involves a lot of challenges, and there are different ways to address them. One possible solution is to track the motion as it evolves with the performed ac- tion. The track in itself will be invariant to any change in the viewpoint and can be used to extract a view-invariant representation of actions. Another approach is to analyze the spatio-temporal volume, which is covered by a human while per- forming any action. It is interesting to observe that such spatio-temporal volumes will have some similarities for same actions, and they can be useful to address change in viewpoints. The tracking of a human body can be useful to some ex- tent, but human joints can move independently while performing most of the activities. Therefore, tracking of skeleton joints independently is very important for understanding human actions. The availability of large-scale training data has also enabled us to learn view-invariant representations using deep learning. We will cover the details of these approaches in the following sections. View-invariance in human actions
View-invariant action recognition Yogesh Rawat 1 , CRCV, University of Central Florida, Orlando, Florida, USA. arXiv:2009.00638v1 [cs.CV] 1 Sep 2020 Shruti Vyas, CRCV, University of Central Florida, Orlando, Florida, USA. Synonyms – Cross-view action recognition – View-invariant action classification – View-invariant activity recognition Related Concepts – View-invariance – Action recognition – Activity classification Definition Recognizing human actions from previously seen viewpoints is relatively easy when compared with unseen viewpoints. View-invariant action recognition aims at recognizing human actions from unseen viewpoints. Background Human action recognition is an important problem in computer vision. It has a wide range of applications in surveillance, human-computer interaction, augmented reality, video indexing, and retrieval. The varying pattern of spatiotemporal appearance generated by human action is key for identifying the performed action. We have seen a lot of research exploring this dynamics of spatiotemporal appearance for learning a visual representation of human actions. However, most of the research in action recognition is focused on some common viewpoints [1], and these approaches do not perform well when there is a change in viewpoint. Human actions are performed in a 3-dimensional environment and are projected to a 2-dimensional space when captured as a video from a given viewpoint. Therefore, an action will have a different spatio-temporal appearance from different viewpoints. As shown in Figure 1, observation o1 is different from observation o2 and so on. The research in view-invariant action recognition addresses this problem and focuses on recognizing human actions from unseen viewpoints. There are different data modalities which can be used for view-invariant representation learning and perform action recognition. These include RGB videos, 1 corresponding author Fig. 1: An action captured from different viewpoints (v1, v2, and v3) providing different observations (o1, o2, and o3) [15]. skeleton sequences, depth information, and optical flow. The skeleton sequences and depth information require additional sensors and are comparatively more difficult to capture when compared with RGB videos. Similarly, the optical flow is computationally expensive and require extra processing on RGB videos. These modalities can be used independently as well as in combination to solve the problem of view-invariant action recognition. Figure 2 shows a sample instances activities performed by an actor which is captured from three different viewpoints in three different modalities. Human action recognition from video sequences involves the extraction of visual features and encoding the performed action in some meaningful representation which can be used for interpretation. The view-invariant encoding of actions involves a lot of challenges, and there are different ways to address them. One possible solution is to track the motion as it evolves with the performed action. The track in itself will be invariant to any change in the viewpoint and can be used to extract a view-invariant representation of actions. Another approach is to analyze the spatio-temporal volume, which is covered by a human while performing any action. It is interesting to observe that such spatio-temporal volumes will have some similarities for same actions, and they can be useful to address change in viewpoints. The tracking of a human body can be useful to some extent, but human joints can move independently while performing most of the activities. Therefore, tracking of skeleton joints independently is very important for understanding human actions. The availability of large-scale training data has also enabled us to learn view-invariant representations using deep learning. We will cover the details of these approaches in the following sections. View-invariance in human actions Fig. 2: Video frames showing action in different modalities as seen from three different view-points [12], Row-1: RGB, Row-2: Skeleton, and Row-3: Depth. The dynamics of body parts and the change in appearance play an important role in understanding human actions. These two properties can be effectively used to determine human actions in a video stream if there is no change in viewpoint. However, it becomes challenging when there is a change in the viewpoint, as the dynamics, as well as the appearance, will change with it significantly. Therefore, it is important to represent human actions in such a way that the representation is invariant to any change in viewpoint. The idea is to encode the human action with a representation which does not change with change in viewpoint. We can observe in Figure 2 how the appearance of the activity changes when seen from different viewpoints. It does not matter how the action is captured; this variation is present in all the modalities, including RGB, skeleton, and depth. 0.1 Motion trajectories An action sequence performed by any person will have a motion trajectory, which will be different for different actions. These motion trajectories can be useful in extracting rich view-invariant representation for action classification. The actions are performed in a 3D environment, and therefore, the corresponding motion trajectories are in 3D space. The change in speed and direction of the trajectory plays an important role in inferring the performed action. The continuities and discontinuities in position, velocity, and acceleration in a 3D trajectory are preserved in 2D trajectories under a continuous projection [11]. Therefore, we can use the 2D spatio-temporal curvature of these 3D motion tra- jectories to represent actions which will capture the change in speed as well as direction. A spatio-temporal curvature can be represented using instants which segment the motion trajectory into intervals. Instances indicate a significant change in the speed and direction in the motion trajectory and define motion boundaries [11]. A special class of motion boundary, which is independent of starts and stops, is the dynamic instant that happens while performing an action. A dynamic instant represents a significant change in the motion characteristics and occurs only for one frame. It provides motion boundaries, called intervals, which represent the time period between two dynamic instants when the motion characteristics are not changing. In Figure 3a, we can observe a sample video frame for activity ‘opening a cabinet’. If we track the hand motion of the person while performing this action, we will get a spatio-temporal curvature shown in Figure 3b. It shows the instants as well as corresponding intervals in the motion trajectory. Figure 3c shows the spatio-temporal curvature values along with the detected dynamic instants in the motion trajectory. (a) (b) (c) Fig. 3: (a) A sample video frame showing ‘opening a cabinet’ action. A hand trajectory in white is superimposed on the image, (b) A representation of the trajectory in terms of instants and intervals, and (c) corresponding spatio-temporal curvature values and detected maximums (dynamic instants) [11]. View-invariance: The discontinuities in 3D motion trajectory, which we perceive as instants, are always projected as discontinuities when projected in 2D curvatures [11]. These instants which are maxima in spatio-temporal curvature will be view-invariant for most scenarios. Therefore, the number of instants in a spatio-temporal curvature is an important characteristic which will be viewinvariant. The only exception will be the cases when there is a perfect alignment of the viewing direction with the plane where the action is being performed. In these cases, the position of the trajectory in consecutive frames will be projected to the same location in the frame resulting in a 2D trajectory, which is essentially a single point. Another important characteristic of instants in spatio-temporal curvature is the change in direction. This is known as sign characteristic, and it defines the direction of turns in action. It is very useful in distinguishing different actions captured from varying viewpoints. A clock-wise turn is represented by a ‘+’ sign and counter clock-wise turn by ‘-’ sign. These two characteristics of instants, number of instants and sign of instants, in a spatio-temporal curvature provide a view-invariant representation of actions. This representation can be used to determine whether two spatio-temporal curvature belongs to the same action class. The first requisite is a match between the number of instants and the sign sequence. Thereafter, we can compute a view-invariant similarity measure between the two trajectories. This can be performed using affine epipolar geometry, which will ensure viewpoint invariance in the similarity measure [11]. However, there are some limitations with this approach. It requires an exact match between corresponding instants in the spatiotemporal curvature, which can be difficult as there can be false or missed instant detection. Also, this approach does not take into account the temporal information between the instants. These issues can be addressed using a view-invariant dynamic time warping to measure the similarity between two spatio-temporal curvatures [9]. The view-invariant time-warping not only suppress the instant outliers, it also compensates for the variation in the execution-style [10]. It can shrink the slow-motion trajectories, which are longer in temporal axis, and expand the fast motion trajectories which are relatively shorter. Limitations: The idea of spatio-temporal curvature is simple yet effective in extracting view-invariant representation for actions. However, there are certain limitations in this approach. The assumption that an action can be represented by a trajectory of single point in a video frame will not hold true for actions where full-body is involved. The skeleton joints in a human body will have a different motion for the same performed action. Therefore, this approach is limited to actions where the action can be approximated to a trajectory of single point at any time frame during the motion. 0.2 Tracking joints A single point-based motion tracking of human actions can not be generalized for actions where multiple body joints are involved. It only carries motion information ignoring any shape or relative spatial information. Therefore, it is important to consider multiple human joints involved in the action while tracking the motion. The movement of different joints while performing an action is not independent of each other. The human body has certain anthropometric proportion, and there exist some geometric constraints between multiple anatomical landmarks such as body joints. This allows us to analyze human actions performed by different people using the semantic correspondence between human bodies. The pose X̂ of an actor while performing an action can be represented as a set of points in any given frame of a video. Here the pose is defined as X̂ = {X1 , X2 ...Xn }, where Xi = (xi , yi , zi , Λ)T are homogenous coordinates and there are n such landmarks. Each point represents the spatial coordinate of an anatomical landmark on the human body, as shown in Figure 4. The body size and proportion vary greatly between different people and age groups. However, even though the human dimensional variability is substantial, it is not arbitrary [2]. Therefore, geometric constraints can be used between the pose of two different actors performing the same action. The proportion between two sets of points describing the pose of two different actors can be captured by a projective transformation. Suppose we have a set of points X̂ and Ŷ describing actors A1 and A2 . Then the relationship between these two sets can be described by a matrix M such that Xi = M Yi , where i = 1, 2...n and M is a 4x4 non-singular matrix. The transformation simultaneously captures the different pose of each actor as well as the difference in size/proportions of the two actors. There are two types of constraints which are useful for action recognition, postural constraints and action constraint [2]. The postures of two actors performing the same action at any time instant should be the same. This constraint allows us to recognize the action at each time instant by measuring the similarity between the postures. Along with this frame-wise constraint, another global constraint can be used on the point sets describing two actors if they are performing the same action. Fig. 4: A sample video frame showing a pose and corresponding point-based pose representation [2]. The anthropometric constraints in a human body allow the transformation of pose between two actors. Moreover, utilizing the postural and action constraints can help in recognizing action by measuring the similarity between two sets of points. The transformation and geometric constraints will address the issue of view-invariance in actions. However, this approach assumes the temporal alignment of poses for each frame in the video. This can be problematic as different actors will have a unique style of executing an action. Therefore, a temporal alignment is also required, which is invariant to temporal transformations. This can be done using dynamic time warping, which is particularly suited to action recognition as it is expected that different actors may perform portions of actions at a varying rate [2]. The frame-wise representation of action decouples pose from its motion along the temporal domain. This approach can be effective to some level, but it ignores the temporal information focusing only on the order of poses, which limits its potential. We can model an action as a spatio-temporal construct since it is a function of time as well. It can be be represented as a set of points, Â = {X1 , X2 ...Xp }, where Xi = (xi , yi , zi , ti )T are spatio-temporal coordinates, p = mn, where we have m landmarks and there are n recorded postures for that action [14]. A sample action ‘walking’ is shown in Figure 5 in both xyz and xyt space at different time steps. (a) (b) Fig. 5: Representation of an action in xyzt 4D space. (a) Action in xyz space, (b) action in xyt space [14]. The variability associated with the execution of an action can be closely approximated by a linear combination of action basis in joint spatio-temporal space. An instance of an action can be defined as a linear combination of a set of action-basis A1 , A2 ...Ak . Therefore, any instance of an action can be expressed k ai Ai , where ai ∈ R is the coefficient associated with the actionas, A′ = Σi=1 basis Ai ∈ R4xp . The space of an action, A is the span of all its action basis. The variance captured by action basis can include different execution rates of the same action, different individual styles of performances as well as the anthropometric transformations of different actions. These action basis can be used to project a 4D pose point to its image (xyt) using a space-time projection matrix [14]. These projections are useful in forming an action representation which can further be used for recognition of new instances. All the joints in the pose and all the time steps in a trajectory may not be required to identify the action category. We can select the joints dynamically based on the action and also choose fewer poses along the temporal domain known as canonical poses [6]. These empirically selected joints and canonical poses provide view-invariant trajectories which we call Invariance Space Trajectories (IST). These trajectories are also invariant representations of human actions and useful for action recognition. 0.3 Spatio-temporal volume The tracking of joints in the motion trajectories can be effective for action representation. However, as we discussed earlier, not all joints may be useful in recognizing the performed action. Another alternative to joints is the contour of the actor, which also captures the complete shape information. Tracking of actor contours will consider both shape and motion of the actor performing the action. When an actor performs some action in a 3D environment, the points on the outer boundary of the actor are projected as 2D (x,y) contour in the image plane. A sequence of 2D contours from frames with respect to time generates a Spatio-Temporal Volume (STV) in (x, y, t), which is a 3D object in the (x, y, t) space [19]. The object contours in two consecutive frames can be tracked by finding a point correspondence between them. This can be done using a graphtheoretical approach by maximizing the match of a weighted bipartite graph. In Figure 6, we can observe some samples of generated spatio-temporal volumes for various actions. (a) Falling (b) Tennis stroke (c) Walking (d) Dancing Fig. 6: Generated Spatio-Temporal Volumes (STVs) for some sample actions. The color coded action descriptors are also shown corresponding to ridges (yellow), saddle ridges (white), peak (red), valley (pink), and saddle valleys (green) [19]. An STV can be considered as a manifold, and it will be nearly flat for small scales defined by a small neighborhood around a point. Therefore, it can be represented as a continuous action volume B by computing plane equations in the neighborhood around a point. We define it using a 2D parametric representation by considering the arc length of the contour (s), which encodes the object shape, and time (t) which encodes the motion. The action volume is defined as B = f (s, t) = [x(s, t), y(s, t), t]. The parameters s and t can be used to generate trajectories and contours for any point in the object boundary. This representation is important for computing action descriptors corresponding to changes in direction, speed, and shape of parts of contours [19]. This can be done by analyzing the surface type of a point in STV using Gaussian curvature and mean curvature. These surface types are important action descriptors, and for any given action they are called action sketch. In figure 6, several of these action descriptors are superimposed on the STVs for various actions. The underlying curves of the contour and point trajectory for each action descriptor in the action sketch will have a maxima or a minima. It can be proved that the maxima/minima of the contour and the trajectory will not change by changing the viewpoint of the camera. Therefore, this representation is also invariant to any change in viewpoint. Fig. 7: Illustration of constructing 4D-AFM. The first row shows videos from four different views. The second row shows the STVs extracted from the videos and the locations of the spatio-temporal action features on the surface of STVs. These action features are mapped to the 4D action shape as shown in the third row [18]. The 3D volume of STV, along with the action descriptor derived from this volume, can be used for action recognition. The relation between two STVs is defined as xFx′ = 0, where x and x′ are points on two different action sketches and F is a 3x3 fundamental matrix defining this relation. This relation estimates whether two different action sketches belong to the same action class and is a useful property for action recognition. This 3D volume can further be utilized to build a 4D action feature model(4D-AFM) [18]. This model elegantly encodes the shape and motion of actors observed from multiple views. A sample example for this model is shown in Figure 7. This model enhances the view-invariant robustness of this approach, as features from multiple views are mapped to a unified model. Action recognition can be performed with this model based on the scores of matching action features from the action videos to the model points by exploiting pairwise interactions of features. 0.4 Learning based methods The classical approaches focused on finding good features and utilize simple matching based techniques for action recognition. In learning based methods, both feature extraction and action recognition are performed jointly. The availability of multi-view activity datasets, such as IXMAS [17], UWA3D MultiviewActivity II [7], Northwestern-UCLA [16] and NTU-RGBD [12], has enabled the development of learning based robust methods for view-invariant feature learning. These features can be learned from different types of modalities, such as RGB [4,15], skeleton [12,8], depth [3,15], and optical flow [3]. We have seen earlier how motion trajectory is effective for extracting viewinvariant descriptions. Apart from point-based and joint based trajectories, we can also extract dense trajectories from any action sequence. These trajectories provide dense points from each frame and track them using displacement information from a dense optical flow field. Given a dense trajectory of length L, a sequence S is formed to represent the displacement vectors ∆Pt = (Pt+1 − Pt ) = (xt+1 − xt , yt+1 − yt ). The sequence is defined as a series of displacement vectors Pt+L−1 (∆Pt , ..., ∆Pt+L−1 ) which is normalized by i=t ||∆Pi ||. Any action sequence can be represented using this dense motion trajectory description. These representations can be used to learn action-basis if we have a limited amount of samples. We discussed earlier how these action-basis can be useful for view-invariant action recognition. However, if we have a large amount of samples, computing action basis will be computationally expensive. An alternative is to perform clustering of these action descriptions and use the cluster centers as action representatives [8]. These cluster centers can further be used for describing new action trajectories using bag-of-word approach. If we have a sufficient number of samples from multiple views to compute these cluster centers, then we can train a fully connected neural network to perform non-linear transformation and learn view-invariant feature representation. These learned view-invariant features can then be used to perform robust action classification on novel and unseen views. Multi-task learning: The dense trajectories can be very effective for learning view-invariant representations for action recognition. However, they require an additional computation time, which is not suitable for systems with low latency. Convolutional neural network provides an efficient way to learn meaningful representations directly using the input video. These networks can be very effective in performing action classification on videos with previously seen views. However, they fail to generalize well for unseen views which are not present in the training data. This is caused due to the absence of view-invariance in the learned representation. This can be addressed by a multi-tasking approach where the representation is also utilized for some other task which enforces the network to learn a view-invariant representation. This task could be a prediction of optical flow from unseen view [3] or cross-view video synthesis [15]. Both these approaches enable the network to learn view-invariant feature representations and perform well on action recognition for unseen views. Datasets and Experimental Results The availability of public datasets for multi-view action recognition enables the research community to benchmark the research progress. There are four main public datasets which are widely used and provide action sequences captured from multiple viewpoints. IXMAS: The Inria Xmas Motion Acquisition Sequences (IXMAS) dataset [17] have videos for 11 action classes with actions performed 3 times by 10 actors. All actions have been captured using five camera views. UWA3D MultiviewActivity II: This dataset [7] contains RGB videos of 30 human activities performed by 10 subjects with different scales. Each subject performed all the actions 4 times. Depth and Skeleton data captured using Kinect is also available with this dataset. Northwestern-UCLA: Northwestern-UCLA Multiview 3D event dataset [16] contains RGB, depth and human skeleton data captured simultaneously by three Kinect cameras. This dataset has 10 action categories, and each action is performed by 10 actors. There are a total of 1493 action sequences. There are two main data splits: cross-subject (CS) and cross-view (CV) as suggested by [16]. View-invariant action recognition is examined with CV split where videos from the first two views are used for training, and the third view is for testing. NTU RGB-D: This is the largest dataset for view-invariant human action recognition [12]. Along with RGB videos, depth and skeleton data is also available. It contains more than 56K videos and 4 million frames with 60 different action classes. There are a total of 40 different actors, who perform actions captured from 80 different viewpoints. Northwestern-UCLA and NTU RGB-D datasets provide multiple modalities, such as RGB, depth, and skeleton, along with multiple views to perform view-invariant action recognition. The performance of a method is evaluated by computing an accuracy score based on whether it predicts the correct action class or not. In Table 1 and Table 2, we can observe the performance of different methods using different modalities on Northwestern-UCLA and NTU RGB-D datasets. The cross-subject (CS) evaluation measure the performance on unseen subjects and the cross-view (CV) evaluation measure the performance on unseen views. The cross-subject performance is better than cross-view performance for most of the methods. This is mainly due to high variation in actions when seen from different viewpoints as compared with an unseen person where the appearance will have more variation. Therefore, the methods which focus on the motion will perform better than those which focus more on the appearance of the actor. Conclusion and open problems Action recognition for unknown and unseen views is a challenging problem. The tracking of motion trajectories have found to be successful so far but they are computationally expensive to extract and therefore not scalable for largescale scenarios. The availability of large-scale multi-view datasets has enabled us to develop networks which are effective in recognition performance and can be trained end-to-end. However, their performance is limited by the variation of Method Vyas et al. [15] MST-AOG [16] HOPC [7] CNN-BiLSTM [3] R-NKTM [8] MST-AOG [16] Modality RGB Depth Depth Depth Skeleton RGB-S CS 87.5 81.6 CV 73.2 53.6 71.9 62.5 78.1 73.3 Table 1: A comparison of cross-subject (CS) and cross-view (CV) action recognition on N-UCLA MultiviewAction3D dataset. Method CNN-BiLSTM [3] Vyas et al. [15] CNN-BiLSTM [3] Vyas et al. [15] Shahroudy et al.[12] CNN-BiLSTM [3] DSSCA - SSLM[13] Modality RGB RGB Depth Depth Skeleton Flow RGB-DS CS 55.5 88.9 68.1 79.4 62.9 80.9 74.9 CV 49.3 86.3 63.9 78.7 70.3 83.4 - Table 2: A comparison of cross-subject (CS) and cross-view (CV) action recognition on NTU-RGB+D dataset. viewpoints in the available datasets. These datasets are lab curated, and actions are performed in a controlled environment. Apart from this, they are also limited in terms of variation in the number of available viewpoints. Therefore, it will be challenging to generalize the methods trained on these datasets to real-world scenarios. Recommended Readings [1] Kevin Duarte, Yogesh Rawat, and Mubarak Shah. Videocapsulenet: A simplified network for action detection. In Advances in Neural Information Processing Systems, pages 7610–7619, 2018. [2] Alexei Gritai, Yaser Sheikh, and Mubarak Shah. On the use of anthropometry in the invariant analysis of human actions. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 2, pages 923–926. IEEE, 2004. [3] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan Kankanhalli. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems, pages 1254–1264, 2018. [4] Jingen Liu, M Shah, B Kuipers, and S Savarese. Cross-view action recognition via view knowledge transfer. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pages 3209–3216. IEEE Computer Society, 2011. [5] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68:346– 362, 2017. [6] Vasu Parameswaran and Rama Chellappa. View invariance for human action recognition. International Journal of Computer Vision, 66(1):83–101, 2006. [7] Hossein Rahmani, Arif Mahmood, Du Huynh, and Ajmal Mian. Histogram of oriented principal components for cross-view action recognition. IEEE transactions on pattern analysis and machine intelligence, 38(12):2430– 2443, 2016. [8] Hossein Rahmani, Ajmal Mian, and Mubarak Shah. Learning a deep model for human action recognition from novel viewpoints. IEEE transactions on pattern analysis and machine intelligence, 40(3):667–681, 2018. [9] C Rao, A Gritai, M Shah, and T Syeda-Mahmood. View-invariant alignment and matching of video sequences. In Proceedings Ninth IEEE International Conference on Computer Vision, 2003. [10] Cen Rao, Mubarak Shah, and Tanveer Syeda-Mahmood. Invariance in motion analysis of videos. In Proceedings of the eleventh ACM International Conference on Multimedia, pages 518–527. ACM, 2003. [11] Cen Rao, Alper Yilmaz, and Mubarak Shah. View-invariant representation and recognition of actions. International Journal of Computer Vision, 50(2):203–226, 2002. [12] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016. [13] Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, and Gang Wang. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5):1045– 1058, 2018. [14] Yaser Sheikh, Mumtaz Sheikh, and Mubarak Shah. Exploring the space of a human action. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 1, pages 144–149. IEEE, 2005. [15] Shruti Vyas, Yogesh S Rawat, and Mubarak Shah. Time-aware and viewaware video rendering for unsupervised representation learning. arXiv preprint arXiv:1811.10699, 2018. [16] Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Crossview action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2649–2656, 2014. [17] Daniel Weinland, Remi Ronfard, and Edmond Boyer. Free viewpoint action recognition using motion history volumes. Computer vision and image understanding, 104(2-3):249–257, 2006. [18] Pingkun Yan, Saad M Khan, and Mubarak Shah. Learning 4d action feature models for arbitrary view action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008. [19] Alper Yilmaz and Mubarak Shah. Actions sketch: A novel action representation. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Volume 1-Volume 01, pages 984–989. IEEE Computer Society, 2005. [20] Jingjing Zheng, Zhuolin Jiang, P Jonathon Phillips, and Rama Chellappa. Cross-view action recognition via a transferable dictionary pair. In bmvc, volume 1, page 7, 2012.
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Stefan Wermter
University of Hamburg
Lukas Fischer
Software Competence Center Hagenberg
naokant deo
Delhi Technological University, Delhi, India
Amir Mosavi
German Research Center for Artificial Intelligence