Transfer Learning Enhanced Vision-Based Human Activity Recognition
Transfer Learning Enhanced Vision-Based Human Activity Recognition
Review
a r t i c l e i n f o a b s t r a c t
Keywords: The discovery of several machine learning and deep learning techniques has paved the way to extend the reach
Deep learning of humans in various real-world applications. Classical machine learning algorithms assume that training, val-
Machine learning idation, and testing data come from the same domain, with similar input feature spaces and data distribution
Transfer learning
characteristics. In some real-world exercises, where data collection has become difficult, the above assumption
Human Activity Recognition
does not hold true. Even, if possible, the scarcity of rightful data prevents the model from being successfully
trained. Compensating for outdated data, reducing the need and hardship of recollecting the training data, avoid-
ing many expensive data labeling efforts, and improving the foreseen accuracy of testing data are some significant
contributions of transfer learning in the real-world application. The most cited transfer learning application in-
cludes classification, regression, and clustering problems in activity recognition, image and video classification,
wi-fi localization, detection and tracking, sentiment analysis and classification, and web-document classification.
Human activity recognition plays a cardinal role in human- to-human and human-to-object interaction and inter-
personal relations. Pairing with robust deep learning algorithms and improved hardware technologies, automatic
recognition of human activity has opened the door in the direction of constructing a smart society. To the best of
our knowledge, our survey is the first to link machine learning, transfer learning, and vision sensor-based activity
recognition under one roof.. However, this survey exploits the above connection by reviewing around 350 related
research articles from 2011 to 2021. Findings indicate an approximate 15% increment in research publications
connected to our topic every year. Among these reviewed articles, we have selected around 150 significant ones
that give insights into various activity levels, classification techniques, performance measures, challenges, and
future directions related to transfer learning enhanced vision sensor-based HAR.
1. Introduction days essential for smooth and error-free industrial and institutional op-
eration.
Humans have evolved into an essential resource capable of han- Human Activity Recognition (HAR) datasets are manufactured by
dling cognitive tasks, even in many malicious applications. Human in- taking the knowledge of three fundamental domain-specific aspects:
tervention is still inevitable in many industrial practices, even in this (i) Data related to the sensor device, 2(i) (ii) Data related to the sub-
machinery-driven world of the twenty- first century. Recognition of hu- ject/actor, and (iii) Data related to the sensing background. However,
man action Gupta (2021); Imran and Raman (2020) has become essen- the mutable nature of the above three defies the conventional machine
tial for individual performance appraisal. Manual bookkeeping of such learning assumption that source and target data must belong to the same
activities can be an untidy and error-prone task. As a result, automatic domain. Knowledge transfer came to the rescue by eliminating this con-
recognition tools have become popular and an area of interest among ventional machine learning hypothesis. Apart from this, the older train-
the research fraternity. Automatic detection of any suspicious or unex- ing data are sometimes unsuitable for real-time recognition due to the
pected human behavior will trigger the alarm for either self-correction mutable nature of the sensor and environment. Through the help of
or manual intervention. Auto- recognition of human activities is nowa- transfer learning, we can easily exploit the older samples and utilize
the valuable information to enhance the classification, regression, and
recognition tasks. It is even more difficult and expensive to collect an
∗
Corresponding author: Dr. Adel Hafiane, INSA center Val de Loire: Institut National des Sciences Appliquees center Val de Loire, 88, Boulevard Lahitolle, 18022
Bourges, France.
E-mail address: adel.hafiane@insa-cvl.fr (A. Hafiane).
https://doi.org/10.1016/j.jjimei.2022.100142
Received 9 May 2022; Received in revised form 19 November 2022; Accepted 26 November 2022
2667-0968/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
adequate number of training data samples and label it. Transfer learning 4. We have tried to identify potential research gaps and future direc-
significantly contributes to many real-world applications by compensat- tions concerning vision-based HAR. We believe it will pilot new re-
ing for old data, lowering the requirement and difficulty of recollecting searchers in the right direction after saving their investigation time.
training data, avoiding many expensive data labeling efforts, and boost-
ing the accuracy of testing data. The rest of the paper is organized as follows. The research mehodol-
“Until now, most review articles from activity recognition back- ogy is discussed in section II. The overview of transfer learning, includ-
grounds have summarized the context related to either transfer learn- ing its definition and significance, and architecture related to HAR, is
ing or classifier-based machine/deep learning. The activities addressed demonstrated in Section III. Section IV introduces various HAR datasets,
in those surveys are vision-based or sensor-based. However, our survey their classification, and hierarchical tabular representation with the
enlists only those activity recognition articles where the machine/deep specification. Section V outlines the classification techniques used in
learning classifiers take advantage of the transfer learning techniques vision-based HAR with a three-modular representation format. Perfor-
to enhance the recognition performance. In this work, we perform a maces of various significant articles are summarized in Section VI . The
data-centric and classifier-specific extensive survey on vision-based ac- challenges, and various aspects of future directions are briefly expressed
tivity recognition. To the best of our knowledge, our survey is the first in section VII . In section VIII, the contributions and practical implica-
to review vision sensor-based human activity recognition using transfer tions are briefly discussed. Finally, section IX concludes the paper along
learning enhanced machine/deep learning algorithm. Our paper gives with the improvement that can be considered further.
insights into various activity levels, classification techniques, perfor-
mance measures, challenges, and future directions related to transfer
2. Research methodology
learning enhanced vision sensor-based HAR. Our paper gives insights
into various action recognition datasets with specifications and levels of
We followed Preferred Reporting Items for Systematic review and
activity associated with them after inferring the context of the data. We
Meta-Analysis Protocols (PRISMA-P) Tricco et al. (2018) to single out
also address different classification techniques, performance measures,
relevant and significant articles related to our research domain. We ac-
challenges, and future directions related to transfer learning enhanced
complished this review by adopting three protocols: searching protocol,
vision sensor-based HAR. Our survey guides fresh researchers to become
inclusion, and exclusion protocol, and scoping review protocol.
familiar with the information and management of existing datasets and
learning methods that help to analyze the gaps and opportunities for
future research work. 2.1. Searching protocol
Many studies have reviewed transfer learning and activity recogni-
tion separately. However, a few have reviewed activity recognition in First, we set the search platforms, i.e., search sites, libraries, or dig-
transfer learning platforms, and the number has become scarce while ital databases. Most articles included in this review were taken from
talking about sensor-based HAR in transfer learning enhanced plat- Web of Science, IEEE Xplore, and Google Scholar digital libraries. We
forms. To the best of our knowledge, Cook, Feuz and Krishnan (2013) is reach out to the relevant articles by putting the exact or relevant key-
the last published review article that addresses HAR in the transfer words or a combination of them. Some of the keywords are “human
learning domain. This paper enlists similar research work from 2011- activity recognition,” “video action classification,” “transfer learning,”
2021 but more oriented to HAR dataset and classification techiniques.. “deep learning,” “machine learning,” “CNN,” or the name of different
Deng, Zheng and Wang (2014) investigates sensor-based and vision- activity recognition databases. Some of the searched sentences are the
based HAR by extensively classifying different HAR methodologies combination of more than one keyword with effective meanings. We
based on their pros and cons. Our paper carries similar content but in the downloaded around 350 articles during initial consideration for further
transfer learning platform. Unlike previous surveys, this survey paper processing.
does not abide by constraints like sensor- based modeling, architecture-
based modeling, classifier-based modeling, or dataset-based modeling,
as seen in other surveys. This article combines all these models to forge 2.2. Inclusion and exclusion protocol
a complete superficial package that will boost creativity for beginner
and intermediate-level researchers. The scope of our paper can be fur- We only included those vision-based activity recognition articles that
ther extended to wearable sensor-based and ambient sensor- based HAR adopt machine learning and transfer learning techniques for model de-
Gupta (2021) in the transfer learning domain. This paper precisely ex- signing. Non-English papers were excluded. We considered the date
plains the transfer learning technique, various steps, and datasets used and type of publication (journal or conference), publishing house, and
in vision sensor-based HAR. This survey also introduces a novel classi- cite score during preliminary screening. Furthermore, we extended this
fication hierarchical model related to this research domain. screening procedure to the abstract composition level, where we vali-
We can find many studies depicting transfer learning and activity dated searched articles’ themes to our survey theme. Publications with
recognition separately. To the best of our knowledge, a few analyzed appropriate matches were included. Finally, we filtered out the 150 most
the HAR based on the transfer learning technique, and the number be- significant articles for further review.
comes scarce in our research domain. The contributions of our paper are
summarized below. 2.3. Scoping review protocol
1. To the best of our knowledge, we are the first to divide classification In this last step of methodology, we systematically reviewed the se-
techniques for vision-based HAR in the form of three modular repre- lected papers after thoroughly apprehending many contextual factors
sentations. We categorically discuss these classes in detail for future in detail. First, we structured the summary observing the background,
reference. objective, source of evidence, eligibility criterion, databases, model al-
2. The frequently used visual datasets (source and target datasets) used gorithms, results, and conclusion from the abstract section. Aftermath,
in HAR are organized based on their year of evolution, mode of rep- we stepped into the detailed sketch of the paper considering aforemen-
resentation, frames per second, resolution, classes, subjects, and the tioned factors along with some finer details. For example, computational
number of videos compared. complexity, real-time deployment possibility, limitations, research gaps
3. We chronologically summarize the related research articles by com- and opportunities.
paring their underneath architecture, source/target datasets, the
number of detected classes, and their respective accuracy.
2
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
3. Overview
3
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
4
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Table 1
Popular HAR dataset with specification.
MP-II Cooking Rohrbach, Amin, Andriluka and Schiele (2012) 29.4/1624 × 1224 65/12/44 H-O Level
UCF-101 Soomro, Zamir and Shah (2012) 25/320 × 240 101/-/13,320 H-O/Group Level
DML Smart Action Mohsen Amiri et al. (2013) 30/2HD+1VGA 12/16/932 Atomic/H–O Level
Hollywood 3D Hadfield and Bowden (2013) 24/1920 × 1080 14/-/650 H-O/H–H Level
YouTube Sports 1 M Karpathy et al. (2014a) -/227 × 227 487/-/11,33,158 H-O/Group Level
Thumos’ 14 Thumos14 -/- 101/-/18,000 Atomic/H-O/Group Level
Northwestern-UCLA Wang, Nie, Xia, Wu and Zhu (2014) 30/640 × 480 10/10/1475 Atomic/H-O Level
UTD_MHAD Chen, Jafari and Kehtarnavaz (2015) 30/640 × 480, 320 × 240 27/8/861 Atomic/H-O Level
ActivityNet Caba Heilbron, Escorcia, Ghanem and Niebles (2015) 30/1280 × 720 203/-/27,801 H-O Level
THUMOS’15 Gorban et al. (2015) -/- 102/-/23,500 Atomic/H-O/Group Level
NTU RGB+D 60 Shahroudy, Liu, Ng and Wang (2016) 30/1920 × 1080, 512 × 424 60/40/56,880 Atomic/H-O/H-H Level
YouTube 8 M Abu-El-Haija et al. (2016) 1/- 480/-/82,64,650 H-O/Group Level
Kinetics400 Kay et al. (2017a) -/658 × 1022 400/-/3,06,245 H-H/H-O Level
PKU-MMD Liu, Hu, Li, Song and Liu (2017) 30/1920 × 1080, 512 × 424 51/66/20,000 H-O/H-H Level
Something-SomethingV2 Goyal et al. (2017) 12/96 × 96 174/1133/2,20,847 H-O Level
AVA Gu et al. (2018a) 1/451 × 808 80/-/230K Atomic/H-O Level
MLB-YouTube Piergiovanni and Ryoo (2018) 60/- 20/-/4290 H-O/Group Level
Kinetics600 Carreira, Noland, Banki-Horvath, Hillier and Zisserman (2018) -/658 × 1022 600/-/4,95,547 H-H/H-O Level
SoccerNet Zhou, Xu and Corso (2018) 25/1280 × 720 3/-/6637 H-O/Group Level
YouCook2 YouCook2 -/- 89/-/2000 H-O Level
NTU RGB+D 120 Liu et al. (2019) 30/1920 × 1080, 512 × 424 120/106/1,14,480 Atomic/H-O/H-H Level
Kinetics-700 Carreira, Noland, Hillier and Zisserman (2019) -/658 × 1022 700/-/650K H-H/H-O Level
MOD20 Perera, Law, Ogunwa and Chahl (2020) 29.97/720 × 720 20/-/2324 H-O/Group Level
HAA-500 Chung, Wuu, Yang, Tai and Tang (2021) -/1080 × 720 500/-/10,000 Atomic/H-O/Group Level
EduNet Sharma, Gupta, Kumar and Mishra (2021) 30/1280 × 720 20/-/7851 H-O Level
TAD-08 Gang et al. (2021) -/720 × 576 8/-/2048 H-O Level
Win-Fail Parmar and Morris (2022) -/1080 × 720 4/-/1634 Atomic/H-O Level
categorized into five levels of activity: gesture level activity, atomic level
activity, Human-Object (H–O) interaction level activity, Human-Human
(H–H) interaction level activity, and group level activity.
5
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
6
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
learning to exercise on the PALMAR and Benedek dataset for recogniz- to the encoder layer). The encoder layer only considers the informa-
ing human- object interaction activity. tive data representative of the input to generate low dimensional code
and stores it in the code layer, which is the latent space representa-
5.1.2. Gaussian mixture model tion of the input data. The decoder layer later collects these codes and
K-means clustering is considered a hard clustering method or reconstructs them back to generate output containing only valuable fea-
distance-based clustering method. Hence, it cannot express its signif- tures. These generated outputs are identical and equidimensional to the
icance in undistinguishable or multi-label data environments. So, we input. Regularized (sparse, denoising, and contractive), concrete and
shifted to a soft clustering model called the Gaussian Mixture Model variational AE are the most common types used in many machine learn-
(GMM), where a distribution- based clustering technique is adopted in- ing tasks like facial recognition, activity recognition, dimensionality re-
stead of a distance- based. In GMM, a dataset of D features can have duction, anomaly detection, machine translation, drug discovery, and
a mixture of k Gaussian distributions. Each distribution represents a popularity prediction. Khan and Roy (2018) use a pre-trained trans-
cluster head defined by the D length mean and D × D co- variance fer learning framework called UnTran that transfers the first two lay-
matrix. The expectation-maximization technique determines these vari- ers of the source trained Deep Sparse Autoencoder (DSAE) to incorpo-
ables (means and co-variances) and sets model parameters accordingly. rate with SVM classifier for recognizing human activity on Opportunity,
Xing et al. (2019) use the GMM algorithm to segment the raw RGB WISDM, and Daily and Sports datasets. This multi-layered classifica-
images of small and unseen target datasets and send the segmented data tion model helps generalize the model to overcome user-related, sensor-
to a CNN (AlexNet, GoogleNet, and ResNet-50) model for activity recog- related, and environmental- related diversities. A combined model per-
nition. Xing et al. (2018) use GMM-based segmentation and only pre- forms domain adaptation for re-annotation in the cross-dataset platform
trained AlexNet model to implement the same inductive transfer learn- Sanabria and Ye (2020). The combined model fuse two learning tech-
ing through fine-tuning. Ntalampiras and Potamitis (2018) use tempo- niques for human activity recognition; (i) knowledge and data-driven
ral, spectral, and wavelet features to identify statistically-closely located learning technique, and (ii)Unsupervised Domain Adaptation technique.
classes using GMM and KL divergence algorithm. Class-specific HMM The Variational Auto-encoder (VAE) in UDAR has achieved encourag-
and universal HMM use these distance- based class features for class pre- ing outcomes while learning latent space representation in minimiz-
diction. ESN-based transfer learning technique is adopted to categorize ing the distance across Aruba and Twor datasets. The discussed frame-
seven human-object interaction level activities. Variational Bayesian work is effective and robust for adapting the divergence in training data
Inference (VI) is the generalization of the expectation-maximization count and sensor noise settings. The semi-supervised Inverse Autore-
approach, which maximizes the likelihood iteratively Jänicke, Tom- gressive Flow (IAF) based VAE is associated with Bi-Directional GAN
forde and Sick (2016a). VI is used to determine the latent features of (Bi-GAN) classifier to implement Zero- Shot Learning (ZSL) for HAR us-
GMM, responsible for reducing the model complexity, and nullify the ing synthesized features on UCF101, HMDB51, and Olympic datasets
need for a specific number of components a priori. Transductive trans- Mishra, Pandey and Murthy (2020). The above model adopts a decoder
fer learning is used for self-improvisation, i.e., new node insertion. with skip connections to stabilize the training and prevent overfitting.
Khan and Roy (2018) employ the inductive transfer learning method,
whereas Mishra et al. (2020); Sanabria and Ye (2020) uses the transduc-
5.1.3. Restricted boltzmann machine tive setting for transferring knowledge across datasets. Autoencoders are
Restricted Boltzmann Machine (RBM) is an unsupervised generative very suitable in unsupervised applications like anomaly activity recog-
network with fully connected nodes across layers (bi-partite node con- nition, where we define the data under either normal or abnormal cat-
figuration and hence the term ’restricted’) that is capable of learning egories.
probability distribution from the seen data to make inferences about un-
seen data. It has a visible or input layer(v) associated with the seen data 5.1.5. Generative adversarial network
and one or multiple hidden layers (h) pointing out the unseen inference GAN Aggarwal, Mittal and Battineni (2021) is a synchronous gen-
data having no output layer. RBM is an energy- based model used in clas- erative model that comprises two sub-models (generator and discrim-
sification, regression, dimensionality reduction, feature learning, collab- inator). A generator generates a random sample of target dimensions
orative filtering, and topic modeling. Boltzmann distribution (Gibbs Dis- by taking a fixed-length vector as input and sending it over to a dis-
tribution), derived from statistical mechanics in thermodynamics, is im- criminator for binary classification (real or fake) along with an actual
plemented to explain the impact of entropy on different states in RBM. target domain sample. The generator tries to mislead the discrimina-
It is associated with two biases; (i) hidden bias that helps to produce tor by generating random output close to the real input. Moreover, the
activation on the forward pass (ii) input bias that helps to produce ac- discriminator tries to protect from being fooled by updating its weight.
tivation on the backward pass. The gradient- based contrastive diver- This process of "making fools" and "being fooled" is performed itera-
gence algorithm is implemented to carry out learning during training. tively to accomplish recognition and generation tasks that come un-
Multiple RBMs are stacked together to form Deep Belief Network (DBN) der unsupervised, semi-supervised, fully supervised, and reinforcement
Kolekar (2011) to perform layer-wise training. Roder et al. (2021) first, settings. Vondrick, Pirsiavash and Torralba (2016) use Spatio-temporal
introduce the spectral DBN on HMBD-51 and UCF-101 HAR datasets us- convolutional GAN for unsupervised HAR in videos from Flicker. Spatio-
ing the domain adaptation technique. Gradient- DBN and Aggregative- temporal convolutional architecture helps untangle a scene’s foreground
DBN are proposed to employ image gradient and frame fusion in video- from its background. GAN is employed to generate and classify video
based HAR. Gradient- DBN and Aggregative-DBN are proposed to em- samples by utilizing scene dynamics. The proposed conditional GAN
ploy image gradient and frame fusion in video-based HAR. A Binary- bi- framework is fed with a class prototype vector to implement General-
nary RBM and a gaussian-binary RBM are stacked together to optimize ized FSL (GFSL) on UCF-101, HMDB- 51, and Olympic-Sports datasets.
the weights and learn the informative features of triaxial accelerometer The GFSL sub-module addresses the inadequate data and seen-data bias-
HAR data Alsheikh et al. (2016). To train and fit the model parameters, ing problems. Class prototype Transfer Network (CPTN) generated class
the underlying model should go through the pre-training stage (unsu- prototype vectors with random noise are fed to the generator module to
pervised and generative) and fine-tuning stage (supervised and discrim- produce synthetic features. The generator has gone through an iterative
inative). update based on the discriminatory loss to make random synthetic fea-
tures close to the real features. A classifier is trained with both real and
5.1.4. Autoencoder GAN-generated synthetic features to efficiently address novel data clas-
Autoencoder (AE) is an unsupervised generative ANN model that is sification problems. Common latent semantic representation can be an
embodied with an encoder layer, code layer, and decoder layer (mirror excellent asset to generalizing a model in the zero-shot learning setting.
7
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Zhang, Li and Ogunbona (2017a) take connotative and extensional rela- predictive pre-training methods that come under transfer learning to
tions for solving poor generalization problems on UCF-101 and HMDB- learn efficient and distinctive representation. Like Cheng et al. (2021);
51 datasets. GAN-based model synthesizes action features and word vec- Haresamudram et al. (2020), Zaher Md Faridee et al. (2022) also uses
tors of unseen classes by exploiting this representation from seen exam- the transfer learning technique on the proposed STranGAN (spatial
ples. A knowledge-based graph is prepared by relating the word vectors transformer-based GAN) model for inertial sensor-based HAR applica-
to their corresponding object. Finally, an attention-based Graph Convo- tions. This paper uses domain adaptation via feature alignment to trans-
lutional Network (GCN) is employed to classify the novel samples with fer the knowledge between source and target without any labeled train-
better accuracy and enhanced generalizability. Standard and general- ing data requirement. The transformer is a highly effective and popular
ized settings of transductive ZSL are realized in Ji et al. (2020) through video data interpretation technique despite its highly complex and data-
Bi-directional adversarial GAN and Inverse Auto-regressive flow-based hungry nature.
VAE on UCF-101, HMDB-51, and Olympic-Sports datasets. Skip connec-
tion of decoder in VAE not only results in more stable training but also 5.2. Discriminative-based approach
additionally prevents overfitting.
Discriminative models, also called conditional models, are a class
5.1.6. Long-Short term memory of logistical models used for classification or regression. They distin-
Long-Short Term Memory (LSTM) is an RNN variant where multi- guish decision boundaries through observed data, such as pass/fail,
ple layers stacks together to perform time-series signal processing to win/lose, alive/dead, or healthy/sick. Logistic regression, conditional
preserve long-term dependencies between information sequences. The random fields, and DT can be categorized under discriminative classi-
presence of more complex interactive layers in LSTM helps to realize the fiers. The naive Bayes model, Gaussian mixture model, variational au-
significance of previous sequential knowledge in manipulating future toencoder, and GAN can be categorized under generative classifiers.
ones. The whole internal processing is carried out by passing the earlier
information through three gates; (i) Forget gate that decides whether
completely forget or complete keep the past information, (ii) Input gate 5.2.1. Decision tree
that allows only the relevant input information by discarding the others, A DT is a flowchart-like tree structure comprising nodes and
and (iii) output gate that replace the old cell state with the new one after branches for classification and regression tasks. We can visualize these
concatenating the concerning forget gate signal and input gate signals. nodes and branches in three segments: internal connecting nodes, inter-
LSTM is used as a controller in Ma, Zhang, Wang, Qi and connecting branches, and leaf nodes. Each connecting node evaluates
Chen (2020) that controls the gateway (read and write heads) between an attribute of a given classification or regression task. The branch cor-
the received input signal and the external memory module. Memory responding to that particular node epitomizes the evaluation result of
encoding and retrieval are the primary goal of the read and write that attribute, and the terminal node (leaf node) holds a class label for
heads. Kay et al. (2017b) incorporate the LSTM layer and batch nor- that task. The input feed to the DT may be a discrete set of values or
malization layer that receives spatial features from the CNN module a continuous variable. Based on this, we can specify DT as a classifica-
to perform state encoding, temporal order capturing, and long depen- tion tree or regression tree, respectively. The superior clustering tech-
dency exploring. Read attention- based bi-directional LSTM is used in nique in DT promotes it as a good regressor or a well-performed classi-
Shi, Zhang, Xu and Cheng (2020). The discriminative features from CNN fier in a restricted data environment. Integration of new unseen sensors
are fed to bi-LSTM that contains a forward LSTM module and a back- leads to an extent in input space. DT needs to be reformed to adopt this
ward LSTM module. Similarly, two stacked bi-directional LSTMs look change by replacing specific leaf nodes of the original tree with a sub-
forward and backward in time to garner fine-grained sequence informa- tree Jänicke, Tomforde and Sick (2016b). An iterative semi-supervised
tion in Fu, Damer, Kirchbuchner and Kuijper (2021). Time-dependent training approach is endorsed in Bhattacharya, Nurmi, Hammerla and
video-level representation is generated by feeding aggregated fixed- Plötz (2014) called En-Co-Training. A pool of randomly sampled data
length spatial features from a combined model of ResNet, and AlexNet generated from an unlabeled opportunity and challenge dataset is gen-
Careaga, Hutchinson, Hodas and Phillips (2019). A transformer architec- erated using this algorithm which is later trained with DT. A DT classi-
ture comprising LSTM and class-wise attention module helps re-weight fier is deployed on the features extracted from the last layer excluded
the cross-domain data by assigning the higher weight to more infor- ResNet-50 network Loey, Manogaran, Taha and Khalifa (2021). The
mative data. All of these references adopt inductive learning platforms DT classification model computes the output label based on informa-
for transferring knowledge. LSTM is an effective classification tool com- tion gain and entropy function. Jänicke et al. (2016b) adopt trans-
monly used alongside CNN for Spatio-temporal exploration while inter- ductive transfer learning, whereas the Bhattacharya et al. (2014) and
preting the video data in HAR. Loey et al. (2021) follow an inductive learning platform.
8
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
is meant for transferring knowledge between domains. The same ar- then activated to describe the concerning features of these key interest
chitecture is used in Chen, Wang, Huang and Yu (2019) to evaluate points. These visual features are clustered and subsequently classified by
the transfer learning performance between various positions. The one- the k-means algorithm and KNN classifier, respectively, to implement
shot inductive transfer learning is used in Cabrera and Wachs (2017), OSL. Apart from Karn and Jiang (2016), Bhattacharya et al. (2014);
whereas Wang et al. (2018a) and Chen et al. (2019) follows the trans- Zhang et al. (2017b) adopts the same KNN-based architecture for in-
ductive platform on OPPORTUNITY, PAMAP2, and UCI DSADS activity ductive transfer learning implementation. Lang et al. (2018) propose
datasets. a KNN-based domain adaptation model for micro-doppler data classifi-
cation after fusing three domain-invariant features, i.e., low-level deep
5.2.3. Support vector machine features from CNN, empirical features, and statistical features. After-
Before diving into the deep learning era, the use and popularity of math, a KNN classifier is adapted to classify seven human activities.
the supervised SVM have become sky-high among classification and re- An Adaptive Spatial- Temporal Transfer Learning (ASTTL) approach
gression models. This learning paradigm projects each data sample into is introduced Qin, Chen, Wang and Yu (2019) to deal with negative
a point in n- dimensional space. Then, it sets a best-suited hyperplane transfer and domain intensive transfer in cross-domain HAR. The spa-
or decision boundary by maximizing the distance from each category to tial features are exploited by weighting relative importance between
that boundary. The position of the new sample from that hyperplane in the marginal and conditional probability distribution and temporal fea-
n-dimensional space will decide its class. A binary classification problem tures by incremental manifold learning. KNN is used as a baseline clas-
adopts a linear SVM, whereas a multi-classification problem uses kernel sifier over UCI DSADS, UCI-HAR, USC–HAD, and PAMAP2 datasets.
based non-linear SVM architecture. The extracted Spatio-temporal fea- Along with Lang et al. (2018); Qin et al. (2019), Xu, Hospedales and
tures of gesture data from a 3D Inception-ResNet model with separable Gong (2016) also follows the same KNN-based transductive transfer
convolution are forwarded to an SVM classifier Li et al. (2021). The learning mechanism. The use of KNN has been restricted in present days
above model outperforms many state-of- the-art architectures in per- due to its low performance but is still prevalent in restricted data envi-
formance, computational cost, and efficiency. SVM classifier with ra- ronments and unsupervised learning conditions.
dial basis function kernel is used to recognize gesture level activity in
Cabrera and Wachs (2017). Inductive transfer learning, or more specif-
ically, OSL, is used in Li et al. (2021) and Cabrera and Wachs (2017). 5.2.5. Convolutional neural network
Tran et al. (2015) uses SVM as a classifier in the inductive transfer learn- CNN is a deep learning architecture where spatial information of
ing domain. A multi-class Hierarchical SVM (HSVM) is adopted to train image and video (in vision-based) data is explored through repeated
on ambient sensor features of synthetic, TU Darmstadt, and RCC datasets convolution operations for different vision-based applications. Convo-
which help detect instant semantic attributes of test samples Alam and lutional layers have passed through various levels of transformation
Roy (2017). The confidence score of the HSVM classifier is measured to fetch more significant features effectively. The activation, pooling,
by Contextual Informativeness (CI). Local dense trajectory video fea- batch normalization, and dropout layers are other supporting layers to
tures from UCF101, FCVID, Sports1M, and ActivityNet datasets are ag- improve the feature’s quality and computational efficiency by suppress-
gregated into video-level feature vectors to train a linear SVM classifier ing the noise and parameter count. Some transfer learning-based CNN
Gan, Lin, Yang, De Melo and Hauptmann (2016). The above two mod- architectures are either pre-trained with a definite large dataset or cus-
els endorse transductive ZSL. Chen et al. (2019) also employs transduc- tomized by the user according to their dataset and application. AlexNet,
tive transfer learning techniques on SVM- based classifiers. Rahmani and GoogleNet, Inception, VGG, ResNet, DenseNet, and EfficientNet are ex-
Mian (2015) trains the SVM classifier on IXMAS and N-UCLA datasets amples of pre- trained transfer learning architecture primarily trained
to perform transfer learning in cross-view and cross-dataset scenarios, on the ImageNet dataset.
respectively. In the era of machine learning, SVM has become a very Karpathy et al. (2014b) use the multiresolution CNN model to im-
popular and effective classification tool used in the HAR domain. How- plement transfer learning on a large (487 classes) Sports- 1 M dataset.
ever, its use has been restricted in this large-scale data domain. A two-steam CNN model is trained on the resolution images of this
dataset by various fusion techniques. Five combinations of convolution
5.2.4. K-Nearest neighbor and max-pooling layers followed by two fully connected (FC) layers are
KNN is another class of supervised learning model proposed for clas- trained on ILSVRC-12 datasets to get a final output of 1000 class distri-
sification and regression practices using the distance matrix. K denotes butions Liu, Mei, Zhang, Che and Luo (2015). Eight convolution layers,
the number of nearest labeled data points or trained samples considered five pooling layers, and two FC layers are forged together to form a 3D
for evaluating the distance matrix. The respective Euclidean distance of ConvNets model called C3D and trained on a large Sports-1 M dataset for
KNN from the test data point is aggregated to compose this matrix. So, weight initialization Zhu and Newsam (2017). This C3D architecture is
a distance matrix illustrates the feature similarity index between the later applied to the ActivityNet dataset for classification performances.
new unlabeled data and its K nearest available labeled data. More con- The last pooling layer of ResNet-50 is fed to an LSTM network with batch
gruent the features, the lesser the Euclidean distance, and the test data normalization to get globally pooled Spatio-temporal features of the Ki-
become more biased toward that class label. KNN learning algorithm is netics dataset Kay et al. (2017a). The trained weight of this model is
sometimes termed non-parametric learning as no mapping function in- later on various small datasets to validate its performances. The output
dulgent, lazy learning as the whole dataset is stored for inference, and of a modified pre- trained ResNet-18 comprising 17 convolutions and
instance-based learning as weights are not learned. Depending on the a pooling layer is fed to another three-layered head model to compute
context of use, the output may be a class membership value or an object the classification score Du, He and Jin (2018). An avg pooling, an FC,
property value, i.e., KNN classification or KNN regression. and a softmax layer are stacked together to form the head model. The
The motion and texture features of the Chalearn gesture chal- base model is fine-tuned with the micro-doppler dataset and validated
lenge and NTU RGB+D dataset are extracted using co- variance de- on simulated micro-doppler data. Perrett, Masullo, Burghardt, Mirme-
scriptor after building a Bag of manifold words (BoMW) representa- hdi and Damen (2021) follow CNN-based architecture in the inductive
tion Zhang et al. (2017b). These local features of the distinct cate- transfer domain for different vision-based HAR datasets. Akbari and
gory are passed through the KNN classifier to perform the one-shot Jafari (2019) follows a similar kind of CNN-based architecture in the
learning gesture recognition. Key points around motion patterns of transductive transfer learning platform for different vision-based HAR
the ChaLearn gesture database are detected and tracked by the Shi- datasets. However, CNN has become a very effective and widespread
Tomasi corner detector and sparse optical flow Karn and Jiang (2016). model nowadays due to its high performance and low complexity in the
The Gradient Location and Orientation Histogram feature descriptor is image and video domain.
9
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Multiple modalities are involved in recognizing a human activ- 7.1. Transfer learning
ity, such as pose, appearance, optical flow, depth, and skeleton. Al-
though the dynamic human skeleton modality is a powerful descrip- Sufficient data availability and outdated data management are two
tor in identifying the human action irrespective of illumination change significant issues when dealing with deep learning architecture. Transfer
and background dynamics, it has rewarded relatively less attention. learning becomes very effective for the problems like outdated data com-
Yan et al. (2022) proposed a sensor-based graph model called HAR- pensation, training data recollection, expensive data labeling, and accu-
ResGCNN that combines graph neural residual structure with trans- racy enhancement. However, the most compelling feature selection cri-
fer learning in a cross-dataset setting to validate the performance in teria and techniques for a successful knowledge transfer are still yet to be
PAMAP-2, mHealth, and TNDA action datasets. The use of deep trans- explored even in these modern days. Again, powerful transfer learning
fer learning on the data derived from the accelerometer, gyroscope, techniques like ZSL and unsupervised transfer learning need more atten-
and magnetometer makes the convergence speed faster and the learning tion to make transfer learning more effective in the HAR and classifica-
curve better. A Spatio-Temporal GCN trained on NTU-RGB+D 60 dataset tion domain. The hidden negotiable relationships between HAR datasets
combined with a zero-shot learning module can able to predict unseen can potentially enhance the performance of the transfer learning-based
activity on which it never trained Jasani and Mazagonwalla (2019). Al- HAR model. How can a platform be built to promote devalued learning
though the graph- based HAR algorithm is very effective in view-point concepts such as relational knowledge transfer?
variation and background changes, it has received comparatively less
attention compared to other classification techniques. 7.2. Explorable video model
10
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Table 2
Chronological Performance Comparison for human activity recognition.
2011/Duan, Xu, Tsang and Luo (2011) SVM web video/consumer video 57.9
2011/Liu, Shah, Kuipers and Savarese (2011) SVM IXMAS (cross-view) 75.3
2011/Wei and Pal (2011) RBM UIUC(cross-action) 82.9
2012/Li, Camps and Sznaier (2012) SVM IXMAS (cross-view) 90.57(High)
2013/Rohrbach, Ebert and Schiele (2013) kNN Script data/MP-II composite 36.2
2014/Zhu and Shao (2014) SVM HMDB51+YouTube/UCF YouTube 91.11
HMDB51+YouTube/Caltech101 79.02
HMDB51+YouTube/Caltech256 42.8
HMDB51+YouTube/Kodak consum 62.6
2014/Yamada, Sigal and Raptis (2013) kNN Poser/HUMANEVA-1 –
2014/Bhattacharya et al. (2014) CNN Sports1M/UCF-101 65.4
2015/Yue-Hei Ng et al. (2015) CNN - LSTM + optical flow Sports 1 M/ImageNet 90.4
(AlexNet and GoogleNet) UCF101/ImageNet 88.6
2016/Zhang, Chao, Sha and Grauman (2016) CNN + key frame + Subshot Kodak consumer/seed image data 82.3
2016/Wang, Farhadi and Gupta (2016a) CNN(two stream) UCF 101/ImageNet 92.4
HMDB 51/ImageNet 63.4
ACT/ImageNet 80.6
2016/Wang et al. (2016b) ConvNet + TSN HMDB 51/ImageNet 69.4
UCF101/ImageNet 94.2
2017/Wang, Chen, Hu, Peng and Philip (2018b) Deep CNN + Stratified TL OPPORTUNITY/Intra-class 83.96
PAMAP2/Intra-class 43.47
UCI DSADS/Intra-class 81.6
2017/Bux Sargano et al. (2017) AlexNet + SVM/KNN KTH/ImageNet 98.15
UCF Sports/ImageNet 91.47
2017/Qiu, Yao and Mei (2017) Pseudo 3D ResNet Sports 1 M/ImageNet 87.4(Top5)
UCF101/ImageNet 93.7(Top3)
ActivityNet/ImageNet 87.71(Top3)
ASLAN/ImageNet 80.8
2018/Alwassel et al. (2018) PCA + LSTM AVA/ImageNet 91
THUMOS14/ImageNet 91
2018/Carreira and Zisserman (2017) I3D(two-stream) HMDB 51/Kinetics 80.9
UCF101/Kinetics 98
2018/Wang, Zheng, Chen and Huang (2018c) USSAR and TNNAR OPPORTUNITY/Intra-class 87.43
UCI DSADS/Intra-class 86.76
2018/Ntalampiras and Potamitis (2018) TL-CHMM Imbalance audio data/Intra class 94.6
2018/Tran et al. (2018) kinetics /Seed videos 95.0
UCF 101/Seed videos 97.3
2019/Ghadiyaram, Tran and Mahajan (2019) R(2 + 1)D-d kinetics /(ImageNet1K+IG-Kinetics) 95.3
something-something 79
EPIC-Kitchen/(ImageNet1K+IG-Kinetics) 42.7
2019/Korbar, Tran and Torresani (2019) ir-CSN-152 Sports1M/(ImageNet + AudioNet) 84
2020/An, Bhat, Gumussoy and Ogras (2020) CNN +TL UCI HAR/HAPT, UniMiB,WISDM upto 43%
2021/Coskun et al. (2021) AMAML EPIC/EGTEA 60.7(10shots)
2021/Zhu et al. (2021) PAL(CNN) ImageNet/Kinetics-100 74.1
SSV2–100/ImageNet 62.6
HMDB-51/ImageNet 75.8
UCF-101/ImageNet 85.3
2021/Sabater et al. (2021) TCN NTU RGB+D-120/therapy dataset 46.5(1shot)
2021/Ben-Ari, Shpigel Nacson, Azulai, Barzelay and Rotman (2021) C3D+I3D Sports1m, ActivityNet V1.2/Kinetics-400 83.12
2021/Perrett et al. (2021) ResNet-50+Transformer ImageNet/Kinetics-100 85.9
SSV2/ImageNet 64.6
HMDB-51/ImageNet 75.6
UCF-101/ImageNet 96.1
depth feature carries significant information about activity-related phys- learning schemes. These manually labeling strategies are expensive and
ical factors like distance, movement type, and gait pattern. So, it needs sometimes faulty due to human error. Effective crowdsourcing platform
more attention in exploring H–H and H–O interaction level activity. Both comes to the rescue. Amazon Mechanical Turk (AMT) is a widespread
dataset and architectural level development are required to successfully annotating strategy for automatic and quick labeling during dataset cre-
comprehend the physical aspect of activities. Apart from these, many ation. However, due to lack of generalization, it has its own limitation.
other physical aspects, such as acceleration, direction, and movement So, extra attention should be given to creating more AMT-like anno-
style, get unnoticed while analyzing complex and interaction-level ac- tation strategies. Zero-shot learning is a knowledge transfer approach
tivity. where activities are classified without any prior training. Besides these,
pseudo labeling strategy is sometimes played a pivotal role to anno-
tate large-scale data in one-shot learning, few-shot learning, and semi-
7.6. Labeling strategy in har supervised learning. Here, we approximate the labels in unannotated
data based on the previously annotated data. Pseudo-labeling reduces
Some benchmark datasets comprise large numbers of classes with the overfitting and improves the speed of the model. But this strategy
millions of video sequences, such as Kinetics, HAA, and YouTube 8 M. fails to impact when there is not enough labeled data present or the
Class labeling is not required for recorded, generated, and crowdsourced absence of labeled data for a particular class or increment of data does
dataset as it gets labeled at the time of origination. Accurate and precise not help the model performance. Robustness and universality are still
labeling followed by thorough verification is imperative for supervised
11
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
debatable research topics for this approach to improve classification ac- and target data after a certain bound undesirably hamper the perfor-
curacy. mance of the underlying model instead of boosting it. A similar defi-
nition is also illustrated in Pan and Yang (2009), which compares the
7.7. Feature engineering in har relatedness between the source domain data to the target domain data
to introduce the negative transfer while applying transfer learning. The
To improve the model performance on unseen data, we need to ex- governing factors like domain divergence, transfer learning algorithms,
tract the valuable features from the raw data that better represent the source, and target data quality are needed to be reviewed to rule out
underlying problem to the predictive model. The flexibility, complex- the existence of negative transfer in the learning mechanism. Even af-
ity, and performance of a model are profoundly dependent on these ter the negative transfer is get infused into the transfer model, it can be
extracted and transformed features. Data decomposition and aggrega- overcome through many preventive approaches like the secure transfer
tions are two important operations encountered while transforming the mechanism, domain similarity estimation algorithm, and distant trans-
raw data. "How do we decompose or aggregate the raw data for a better fer assessment Zhang, Deng, Zhang and Wu (2020b). So, the negative
description of the underlying problem?" is a challenging problem dur- transfer is a longstanding and formidable concern that needs to be thor-
ing this extraction process. Apart from this, researchers are still trying oughly reviewed for the vision- based transfer learning model for HAR.
to find out the effective solution to some smaller queries like, "How to
automate this transformation process?", "How to identify and select the 8. Contribution to literature and implication for practice
problem-dependent useful feature?”, and “what are the manual feature
selection criteria?". DT, random forest, regularization variant, principal We have shown an extensive sketch of vision sensor-based HAR us-
component analysis, and structural risk minimization are some widely ing transfer learning. To understand transfer learning, we briefly explain
used feature engineering approaches applied in HAR tasks. the difference between source and target, domain and task. We also rep-
resent five steps followed in HAR: activity types, sensors, transduction,
7.8. Limited hardware computation different approaches, and performance measures. We label the vision-
based HAR datasets from 2011 to 2021 with detailed specifications. We
Well-performed HAR models are very hard to implement in real-time classify and discuss the learning algorithm in three different categories
due to the constrained computing power (Hardware constrain). As a used for this task. These categories are the generative model, discrimi-
result, we are forced to compromise either on input data or computa- native model, and graph-based model. To the best of our knowledge, we
tionally expensive techniques. And for this, analysts adopt many data are the first to divide classification techniques for vision-based HAR into
reduction techniques like cropping, compression, key-frame extraction, three modular representations. We conclude our review by exchanging
sub-shot, and thresholding. Another method is to embrace those sens- views on various challenges and future direction. To the best of our
ing devices that can provide relatively more uncomplicated data forms. knowledge, we are the first to conduct a decade-long review on transfer
Most wearable sensors can be an example of that kind, where we collect learning enhanced vision-based HAR, where we discuss related datasets
mainly the 1-D form of data. Both the compromising techniques lead with specifications and three classification formats relevant to our topic.
the model to decline in performance. We conclude with a similar re- This paper transfer in-depth information about different datasets from
sult while adjusting the classification technique. We need a model that 2011 to 2021 and is intended to be managed under various applica-
comes up with an acceptable trade-off between the computational bur- tion scenarios. The detailed depiction of classification algorithms under
den and performance in a constrained computational environment. And transfer learning scenarios enhances the ideation of researcher for future
to make this viable, our research focus should rest upon a track of infor- implementation in this domain.
mative sensing techniques, efficient descriptor extracting methods, and
high-performing model architecture. 9. Conclusion
7.9. Contextual information gathering In this extensive survey, we emphasize the idea of using state- of-
the-art transfer learning methods that reduce the difficulty and effort
Our model may not be able to recognize high-level behavior or ac- behind data collection, data extinction, data labeling, and accuracy en-
tivity properly. For example, our model may fail to recognize "group hancement in the action recognition domain. This paper focuses on the
discussion" activity. Instead, it may be wrongly interpreted as "sitting vision-based HAR in context- aware applications and empathizes its di-
and talking." Similarly, "running on the road" or "running on the track" versity with transfer learning functionality. This paper’s whole-length
can get mixed and misclassified under a more superficial activity, i.e., depiction, investigation, and high points help the researcher achieve
"running." This shallow activity classification is apparently due to the in-depth knowledge in vision-based activity recognition using transfer
lack of background knowledge. Relating the semantic features with the learning techniques.
logical description between action and behavior through the learning Apart from transfer learning and all-pervasive applications in vision-
of Natural Language Processing (NLP) may be a possible solution for based activity recognition, other various orientations still lie down on
recognizing these complex activities. This contextual information may the floor to investigate and discover for subsequent research such as
provide additional knowledge that helps classify complex activities cor- detection, tracking, design, and classification. This all-inclusive survey
rectly. is supposed to strengthen further research in activity recognition grass-
land.
7.10. Negative transfer
Declaration of Competing Interest
In transfer learning, the source domain data representation lever-
ages target domain data for enhancing target domain performance ac- No conflict of interest
curacy. However, sometimes, leveraging source domain-specific knowl-
References
edge reduces the transfer learning performance of the target data. So, we
need to keep the knowledge about the origin of negative transfer, fac- Abu-El-Haija, Sami, Kothari, Nisarg, Lee, Joonseok, Natsev, Paul, Toderici, George,
tors influencing negative transfer, and tranquilizing algorithms to pre- Varadarajan, Balakrishnan et al. (2016).Youtube-8m: A large-scale video classifica-
vent negative transfer before applying transfer learning to any tasks. tion benchmark. arXiv preprint arXiv:1609.08675.
Aggarwal, Alankrita, Mittal, Mamta, & Battineni, Gopi (2021). Generative adversarial net-
Rosenstein, Marx, Kaelbling and Dietterich (2005) introduces negative work: An overview of theory and applications. International Journal of Information
transfer after finding that the incongruent nature between the source Management Data Insights, 1(1), Article 100004 pages 9.
12
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Akbari, Ali, & Jafari, Roozbeh (2019).Transferring activity recognition models for new Duan, Lixin, Xu, Dong, Tsang, Ivor Wai-Hung, & Luo, Jiebo (2011). Visual event recog-
wearable sensors with deep generative domain adaptation. In Proceedings of the 18th nition in videos by learning from web data. IEEE Transactions on Pattern Analysis and
International Conference on Information Processing in Sensor Networks, pages 85–96/ Machine Intelligence, 34(9), 1667–1680.
Alam, Mohammad Arif Ul, & Roy, Nirmalya (2017). Unseen activity recognitions: A hier- Fu, Biying, Damer, Naser, Kirchbuchner, Florian, & Kuijper, Arjan (2021). Generalization
archical active transfer learning approach. In 2017 IEEE 37th International Conference of fitness exercise recognition from doppler measurements by domain-adaption and
on Distributed Computing Systems (ICDCS) (pp. 436–446). IEEE. pages. few-shot learning. In International Conference on Pattern Recognition (pp. 203–218).
Alsheikh, Mohammad Abu, Selim, Ahmed, Niyato, Dusit, Doyle, Linda, Lin, Shaowei, & Springer. pages.
Tan, Hwee-Pink (2016). Deep activity recognition models with triaxial accelerome- Gan, Chuang, Lin, Ming, Yang, Yi, De Melo, Gerard, & Hauptmann, Alexander G.
ters. Workshops at the Thirtieth AAAI Conference on Artificial Intelligence. (2016).Concepts not alone: Exploring pairwise relationships for zero- shot video ac-
Al-Sulaiman, Talal (2022). Predicting reactions to anomalies in stock movements using a tivity recognition. In Thirtieth AAAI conference on artificial intelligence.
feed-forward deep learning network. International Journal of Information Management Gang, Zhao, Wenjuan, Zhu, Biling, Hu, Jie, Chu, Hui, He, & Qing, Xia (2021). A simple
Data Insights, 2(1), Article 100071 pages 11. teacher behavior recognition method for massive teaching videos based on teacher
Alwassel, Humam, Heilbron, Fabian Caba, & Ghanem, Bernard (2018).Action search: Spot- set. Applied Intelligence, 51(12), 8828–8849.
ting actions in videos and its application to temporal action localization. In Proceed- Ghadiyaram, Deepti, Tran, Du, & Mahajan, Dhruv (2019).Large-scale weakly- supervised
ings of the European Conference on Computer Vision (ECCV), pages 251–266. pre-training for video action recognition. In Proceedings of the IEEE/CVF Conference
An, Sizhe, Bhat, Ganapati, Gumussoy, Suat, & Ogras, Umit (2020).Transfer learning for on Computer Vision and Pattern Recognition, pages 12046–12055.
human activity recognition using representational analysis of neural networks. arXiv Ghosal, Deepanway, & Kolekar, Maheshkumar H. (2018).Music genre recognition using
preprint arXiv:2012.04479. deep neural networks and transfer learning. In Interspeech, pages 2087–2091.
Anand, Kartik, Urolagin, Siddhaling, & Mishra, Ram Krishn (2021). How does hand ges- Gonegandla, Pranesh, & Kolekar, Maheshkumar H. (2022). Automatic song indexing by
tures in videos impact social media engagement-insights based on deep learning? In- predicting listener’s emotion using EEG correlates and multi-neural networks. Multi-
ternational Journal of Information Management Data Insights, 1(2), Article 100036. media Tools and Applications, 81, 1–11 pages.
Arif Ul Alam, Mohammad, Mahmudur Rahman, Md, & Widberg, Jared Q. (2021). Palmar: Gorban, A., Idrees, H., Jiang, Y.-.G., Roshan Zamir, A., Laptev, I., Shah, M. et al.
Towards adaptive multi-inhabitant activity recognition in point- cloud technology. In (2015).THUMOS challenge: Action recognition with a large number of classes.
IEEE INFOCOM 2021-IEEE Conference on Computer Communications (pp. 1–10). IEEE. http://www.thumos.info/.
pages. Goyal, Raghav, Kahou, Samira Ebrahimi, Michalski, Vincent, Materzynska, Joanna, West-
Aslam, Nazia, & Kolekar, Maheshkumar H. (2022). Unsupervised anomalous event detec- phal, Susanne, Kim, Heuna et al. et al. (2017).The" something something" video
tion in videos using spatio-temporal inter-fused autoencoder. Multimedia Tools and database for learning and evaluating visual common sense. In Proceedings of the IEEE
Applications, 1–26 pages. international conference on computer vision, pages 5842–5850.
Aslam, Nazia, Rai, Prateek Kumar, & Kolekar, Maheshkumar H. (2022). A3N: Atten- Gu, Chunhui, Sun, Chen, Ross, David A., Vondrick, Carl, Pantofaru, Caroline, Li, Yeqing
tion-based adversarial autoencoder network for detecting anomalies in video se- et al.,(2018a). et al. AVA: A video dataset of spatio-temporally localized atomic visual
quence. Journal of Visual Communication and Image Representation, 87, Article 103598 actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
pages 15. tion, pages 6047–6056.
Ben-Ari, Rami, Shpigel Nacson, Mor, Azulai, Ophir, Barzelay, Udi, & Rotman, Daniel Gupta, Saurabh (2021). Deep learning based human activity recognition (HAR) using
(2021).Taen: Temporal aware embedding network for few-shot action recognition. wearable sensor data. International Journal of Information Management Data Insights,
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- 1(2), Article 100046 pages 18.
nition, pages 2786–2794. Hadfield, Simon, & Bowden, Richard (2013).Hollywood 3D: Recognizing actions in 3D
Bhattacharya, Sourav, Nurmi, Petteri, Hammerla, Nils, & Plötz, Thomas (2014). Using natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern
unlabeled data in a sparse-coding framework for human activity recognition. Pervasive Recognition, pages 3398–3405.
and Mobile Computing, 15, 242–262. Haresamudram, Harish, Beedu, Apoorva, Agrawal, Varun, Grady, Patrick L., Essa, Irfan,
Bux Sargano, Allah, Wang, Xiaofeng, Angelov, Plamen, & Habib, Zulfiqar (2017). Human Hoffman, Judy et al. (2020).Masked reconstruction based self- supervision for human
action recognition using transfer learning with deep representations. In 2017 Interna- activity recognition. In Proceedings of the 2020 International Symposium on Wearable
tional joint conference on neural networks (IJCNN) (pp. 463–469). IEEE. pages. Computers, pages 45–49.
Cabrera, Maria E., Sanchez-Tamayo, Natalia, Voyles, Richard, & Wachs, Juan P. (2017). Heilbron, Fabian Caba, Escorcia, Victor, Ghanem, Bernard, & Niebles, Juan Carlos
One-shot gesture recognition: One step towards adaptive learning. In 2017 12th (2015).Activitynet: A large-scale video benchmark for human activity understanding.
IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) In Proceedings of the ieee conference on computer vision and pattern recognition, pages
(pp. 784–789). IEEE. pages. 961–970.
Cabrera, Maria Eugenia, & Wachs, Juan Pablo (2017). A human-centered approach to Imran, Javed, & Raman, Balasubramanian (2020). Evaluating fusion of RGB-D and inertial
one-shot gesture learning. Frontiers in Robotics and AI, 4(8) pages 18. sensors for multimodal human action recognition. Journal of Ambient Intelligence and
Careaga, Chris, Hutchinson, Brian, Hodas, Nathan, & Phillips, Lawrence (2019).Metric- Humanized Computing, 11(1), 189–208.
based few-shot learning for video action recognition. arXiv preprint arXiv:1909.09602. Jänicke, Martin, Tomforde, Sven, & Sick, Bernhard (2016a). Towards self-improving ac-
Carreira, Joao, Noland, Eric, Banki-Horvath, Andras, Hillier, Chloe, & Zisserman, Andrew tivity recognition systems based on probabilistic, generative models. In 2016 IEEE
(2018).A short note about kinetics-600. arXiv preprint arXiv:1808.01340. International Conference on Autonomic Computing (ICAC) (pp. 285–291). IEEE. pages.
Carreira, Joao, Noland, Eric, Hillier, Chloe, & Zisserman, Andrew (2019).A short note on Jänicke, Martin, Tomforde, Sven, & Sick, Bernhard (2016b).Towards self-improving ac-
the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. tivity recognition systems based on probabilistic, generative models. In 2016 IEEE
Carreira, Joao, & Zisserman, Andrew (2017).Quo vadis, action recognition? a new model International Conference on Autonomic Computing, pages 285–291. IEEE.
and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision Jasani, Bhavan, & Mazagonwalla, Afshaan (2019).Skeleton based zero shot action recog-
and Pattern Recognition, pages 6299–6308. nition in joint pose-language semantic space. arXiv preprint arXiv:1911.11344.
Chatterjee, Subhamoy, Bhandari, Piyush, & Kolekar, MaheshKumar H. (2016). A novel Ji, Zhong, Liu, Xiyao, Pang, Yanwei, & Li, Xuelong (2020). SGAP-Net: Semantic- guided
krawtchouk moment zonal feature descriptor for user-independent static hand gesture attentive prototypes network for few-shot human-object interaction recognition. Pro-
recognition. In 2016 IEEE Region 10 Conference (TENCON) (pp. 387–392). IEEE. pages. ceedings of the AAAI Conference on Artificial Intelligence, 34, 11085–11092 pages.
Chen, Chen, Jafari, Roozbeh, & Kehtarnavaz, Nasser (2015). UTD-MHAD: A multimodal Karn, Nabin Kumar, & Jiang, Feng (2016). Improved gloh approach for one-shot
dataset for human action recognition utilizing a depth camera and a wearable inertial learning human gesture recognition. In Chinese Conference on Biometric Recognition
sensor. In 2015 IEEE International conference on image processing (ICIP) (pp. 168–172). (pp. 441–452). Springer. pages.
IEEE. pages. Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung, Thomas, Sukthankar, Rahul,
Chen, Yiqiang, Wang, Jindong, Huang, Meiyu, & Yu, Han (2019). Cross-position activity & Fei-Fei, Li (2014a).Large-scale video classification with convolutional neural net-
recognition with stratified transfer learning. Pervasive and Mobile Computing, 57, 1–13. works. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,
Cheng, Yi-Bin, Chen, Xipeng, Chen, Junhong, Wei, Pengxu, Zhang, Dongyu, & Lin, Liang pages 1725–1732.
(2021). Hierarchical transformer: Unsupervised representation learning for skele- Karpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung, Thomas, Sukthankar, Rahul,
ton-based human action recognition. In 2021 IEEE International Conference on Mul- & Fei-Fei, Li (2014b).Large-scale video classification with convolutional neural net-
timedia and Expo (ICME) (pp. 1–6). IEEE. pages. works. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,
Chung, Jihoon, Wuu, Cheng-hsin, Yang, Hsuan-ru, Tai, Yu-Wing, & Tang, Chi-Ke- pages 1725–1732.
ung (2021). HAA500: Human-centric atomic action dataset with curated videos. Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijaya-
In Proceedings of the IEEE/CVF International Conference on Computer Vision narasimhan, Sudheendra et al. et al. (2017a).The kinetics human action video dataset.
(pp. 13465–13474). pages. arXiv preprint arXiv:1705.06950.
Cook, Diane, Feuz, Kyle D., & Krishnan, Narayanan C. (2013). Transfer learning for activity Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijaya-
recognition: A survey. Knowledge and information systems, 36(3), 537–556. narasimhan, Sudheendra et al. et al. (2017b).The kinetics human action video dataset.
Coskun, Huseyin, Zia, M. Z. eeshan, Tekin, Bugra, Bogo, Federica, Navab, Nassir, arXiv preprint arXiv:1705.06950.
Tombari, Federico, et al., (2021). Domain-specific priors and meta learning for Khan, Md Abdullah Al Hafiz, & Roy, Nirmalya (2018). Untran: Recognizing unseen ac-
few-shot first-person action recognition. IEEE Transactions on Pattern Analysis and Ma- tivities with unlabeled data using transfer learning. In 2018 IEEE/ACM Third In-
chine Intelligence pages 14. ternational Conference on Internet-of-Things Design and Implementation (IoTDI)
Deng, Wan-Yu, Zheng, Qing-Hua, & Wang, Zhong-Min (2014). Cross-person activity recog- (pp. 37–47). IEEE. pages.
nition using reduced kernel extreme learning machine. Neural Networks, 53, 1–7. Kolekar, Maheshkumar H. (2011). Bayesian belief network based broadcast sports video
Du, Hao, He, Yuan, & Jin, Tian (2018). Transfer learning for human activities classifi- indexing. Multimedia Tools and Applications, 54(1), 27–54.
cation using micro-doppler spectrograms. In 2018 IEEE International Conference on Kolekar, Maheshkumar H., & Sengupta, Somnath (2015). Bayesian network-based cus-
Computational Electromagnetics (pp. 1–3). IEEE. pages. tomized highlight generation for broadcast soccer videos. IEEE Transactions on Broad-
casting, 61(2), 195–209.
13
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Korbar, Bruno, Tran, Du, & Torresani, Lorenzo (2019).Scsampler: Sampling salient clips Rohrbach, Marcus, Amin, Sikandar, Andriluka, Mykhaylo, & Schiele, Bernt (2012). A
from video for efficient action recognition. In Proceedings of the IEEE/CVF Interna- database for fine grained activity detection of cooking activities. In 2012 IEEE Confer-
tional Conference on Computer Vision, pages 6232–6242. ence on Computer Vision and Pattern Recognition (pp. 1194–1201). IEEE. pages.
Lang, Yue, Wang, Qing, Yang, Yang, Hou, Chunping, Huang, Danyang, & Xiang, Wei Rohrbach, Marcus, Ebert, Sandra, & Schiele, Bernt (2013). Transfer learning in a trans-
(2018). Unsupervised domain adaptation for micro-doppler human motion classifi- ductive setting. Advances in neural information processing systems, 26.
cation via feature fusion. IEEE Geoscience and Remote Sensing Letters, 16(3), 392–396. Rosenstein, Michael T., Marx, Zvika, Kaelbling, Leslie Pack, & Dietterich, Thomas G.
Li, Binlong, Camps, Octavia I., & Sznaier, Mario (2012). Cross-view activity recognition (2005).To transfer or not to transfer. In In NIPS’05 Workshop, Inductive Transfer:
using hankelets. In 2012 IEEE Conference on Computer Vision and Pattern Recognition 10 Years Later.
(pp. 1362–1369). IEEE. pages. Sabater, Alberto, Santos, Laura, Santos-Victor, Jose, Bernardino, Alexandre, Montesano,
Li, Lianwei, Qin, Shiyin, Lu, Zhi, Zhang, Dinghao, Xu, Kuanhong, & Hu, Zhongying Luis, & Murillo, Ana C. (2021).One-shot action recognition in challenging therapy
(2021). Real-time one-shot learning gesture recognition based on lightweight 3D scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
inception-ResNet with separable convolutions. Pattern Analysis and Applications, 24, Recognition, pages 2777–2785.
1–20 pages. Sanabria, Andrea Rosales, & Ye, Juan (2020). Unsupervised domain adaptation for activity
Liu, Chunhui, Hu, Yueyu, Li, Yanghao, Song, Sijie, & Liu, Jiaying (2017).PKU MMD: A recognition across heterogeneous datasets. Pervasive and Mobile Computing, 64, Article
large scale benchmark for continuous multi-modal human action understanding. arXiv 101147.
preprint arXiv:1703.07475. Shahroudy, Amir, Liu, Jun, Ng, Tian-Tsong, & Wang, Gang (2016).NTU RGB+D: A large
Liu, Jingen, Shah, Mubarak, Kuipers, Benjamin, & Savarese, Silvio (2011). Cross-view scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on
action recognition via view knowledge transfer. In Proceedings of IEEE conference on computer vision and pattern recognition, pages 1010–1019.
Computer Vision and Pattern Recognition (CVPR) (pp. 3209–3216). IEEE. pages. Sharma, Vijeta, Gupta, Manjari, Kumar, Ajai, & Mishra, Deepti (2021). EduNet: A new
Liu, Jun, Shahroudy, Amir, Perez, Mauricio, Wang, Gang, Duan, Ling-Yu, & Kot, Alex video dataset for understanding human activity in the classroom environment. Sen-
C. (2019). NTU RGB+D 120: A large-scale benchmark for 3d human activity un- sors, 21(17), 5699.
derstanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), Shi, Zhenguo, Zhang, Jian Andrew, Xu, Yi Da Richard, & Cheng, Qingqing (2020). Environ-
2684–2701. ment-robust device-free human activity recognition with channel- state-information
Liu, Wu, Mei, Tao, Zhang, Yongdong, Che, Cherry, & Luo, Jiebo (2015).Multi- task deep enhancement and one-shot learning. IEEE Transactions on Mobile Computing.
visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Soomro, Khurram, Zamir, Amir Roshan, & Shah, Mubarak (2012).Ucf101: A dataset of
conference on computer vision and pattern recognition, pages 3707–3715. 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Loey, Mohamed, Manogaran, Gunasekaran, Taha, Mohamed Hamed N., & Khalifa, Nour Tran, Du, Bourdev, Lubomir, Fergus, Rob, Torresani, Lorenzo, & Paluri, Manohar
Eldeen M. (2021). A hybrid deep transfer learning model with machine learning meth- (2015).Learning spatiotemporal features with 3d convolutional networks. In Proceed-
ods for face mask detection in the era of the covid-19 pandemic. Measurement, 167, ings of the IEEE international conference on computer vision, pages 4489–4497.
Article 108288 pages 11. Tran, Du, Wang, Heng, Torresani, Lorenzo, Ray, Jamie, LeCun, Yann, & Paluri, Manohar
Luo, Manman, & Mu, Xiangming (2022). Entity sentiment analysis in the news: A case (2018).A closer look at spatiotemporal convolutions for action recognition. In Pro-
study based on negative sentiment smoothing model (NSSM). International Journal of ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages
Information Management Data Insights, 2(1), Article 100060. 6450–6459.
Ma, Chunyong, Zhang, Shengsheng, Wang, Anni, Qi, Yongyang, & Chen, Ge (2020). Tricco, Andrea C., Lillie, Erin, Zarin, Wasifa, O’Brien, Kelly K., Colquhoun, Heather,
Skeleton-based dynamic hand gesture recognition using an enhanced network with Levac, Danielle, et al., (2018). PRISMA extension for scoping reviews (PRISMA-ScR):
one-shot learning. Applied Sciences, 10(11), 3680. Checklist and explanation. Annals of internal medicine, 169(7), 467–473.
Mishra, Ashish, Pandey, Anubha, & Murthy, Hema A. (2020). Zero-shot learning for action Vondrick, Carl, Pirsiavash, Hamed, & Torralba, Antonio (2016). Generating videos with
recognition using synthesized features. Neurocomputing, 390, 117–130. scene dynamics. Advances in neural information processing systems, 29, 613–621.
Mohsen Amiri, S., Pourazad, Mahsa T., Nasiopoulos, Panos, & Leung, Victor C. M. (2013). Wang, Jiang, Nie, Xiaohan, Xia, Yin, Wu, Ying, & Zhu, Song-Chun (2014).Cross- view
Non-intrusive human activity monitoring in a smart home environment. In 2013 IEEE action modeling, learning and recognition. In Proceedings of the IEEE conference on
15th International Conference on e-Health Networking, Applications and Services (Health- computer vision and pattern recognition, pages 2649–2656.
com 2013) (pp. 606–610). IEEE. pages. Wang, Jindong, Chen, Yiqiang, Hu, Lisha, Peng, Xiaohui, & Philip, S. Y. u (2018a). Strat-
Mutegeki, Ronald, & Han, Dong Seog (2019). Feature-representation transfer learning for ified transfer learning for cross-domain activity recognition. In 2018 IEEE interna-
human activity recognition. In 2019 International Conference on Information and Com- tional conference on pervasive computing and communications (PerCom) (pp. 1–10). IEEE.
munication Technology Convergence (ICTC) (pp. 18–20). IEEE. pages. pages.
Ng, Joe Yue-Hei, Hausknecht, Matthew, Vijayanarasimhan, Sudheendra, Vinyals, Oriol, Wang, Jindong, Chen, Yiqiang, Hu, Lisha, Peng, Xiaohui, & Philip, S. Y. u (2018b). Strati-
Monga, Rajat, & Toderici, George (2015).Beyond short snippets: Deep networks for fied transfer learning for cross-domain activity recognition. In 2018 IEEE International
video classification. In Proceedings of the IEEE conference on computer vision and pattern Conference on Pervasive Computing and Communications (PerCom) (pp. 1–10). IEEE.
recognition, pages 4694–4702. pages.
Ntalampiras, Stavros, & Potamitis, Ilyas (2018). Transfer learning for improved audio- Wang, Jindong, Zheng, Vincent W., Chen, Yiqiang, & Huang, Meiyu (2018c).Deep transfer
based human activity recognition. Biosensors, 8(3), 60 pages 12. learning for cross-domain activity recognition. In proceedings of the 3rd International
Pan, Sinno Jialin, & Yang, Qiang (2009). A survey on transfer learning. IEEE Transactions Conference on Crowd Science and Engineering, pages 1–8.
on knowledge and data engineering, 22(10), 1345–1359. Wang, Limin, Xiong, Yuanjun, Wang, Zhe, Qiao, Yu, Lin, Dahua, Tang, Xiaoou, et al.,
Parmar, Paritosh, & Morris, Brendan (2022).Win-Fail action recognition. In Proceedings of (2016b). Temporal segment networks: Towards good practices for deep action recog-
the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 161–171. nition. In European conference on computer vision (pp. 20–36). Springer. pages.
Perera, Asanka G., Law, Yee Wei, Ogunwa, Titilayo T., & Chahl, Javaan (2020). A mul- Wang, Xiaolong, Farhadi, Ali, & Gupta, Abhinav (2016a).Actions˜ transformations. In Pro-
tiviewpoint outdoor dataset for human action recognition. IEEE Transactions on Hu- ceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2658–
man-Machine Systems, 50(5), 405–413. 2667.
Perrett, Toby, Masullo, Alessandro, Burghardt, Tilo, Mirmehdi, Majid, & Damen, Dima Wei, Bin, & Pal, Christopher (2011). Heterogeneous transfer learning with rbms. In Pro-
(2021).Temporal-relational crosstransformers for few-shot action recognition. In Pro- ceedings of the AAAI Conference on Artificial Intelligence: 25 (pp. 531–536). pages.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages Wen, Jiahui, & Zhong, Mingyang (2015). Activity discovering and modelling with labelled
475–484. and unlabelled data in smart environments. Expert Systems with Applications, 42(14),
Piergiovanni, A.J., .& Ryoo, Michael S. (2018).Fine-grained activity recognition in base- 5800–5810.
ball videos. In Proceedings of the ieee conference on computer vision and pattern recogni- Xing, Yang, Lv, Chen, Wang, Huaji, Cao, Dongpu, Velenis, Efstathios, & Wang, Fei- Yue
tion workshops, pages 1740–1748. (2019). Driver activity recognition for intelligent vehicles: A deep learning approach.
Qin, Xin, Chen, Yiqiang, Wang, Jindong, & Yu, Chaohui (2019). Cross-dataset activity IEEE transactions on Vehicular Technology, 68(6), 5379–5390.
recognition via adaptive spatial-temporal transfer learning. Proceedings of the ACM on Xing, Yang, Tang, Jianlin, Liu, Hong, Lv, Chen, Cao, Dongpu, Velenis, Efstathios, et al.,
Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4), 1–25. (2018). End-to-end driving activities and secondary tasks recognition using deep con-
Qiu, Zhaofan, Yao, Ting, & Mei, Tao (2017).Learning spatio-temporal representation with volutional neural network and transfer learning. In 2018 IEEE Intelligent Vehicles Sym-
pseudo-3d residual networks. In proceedings of the IEEE International Conference on posium (IV) (pp. 1626–1631). IEEE. pages.
Computer Vision, pages 5533–5541. Xu, Xun, Hospedales, Timothy M., & Gong, Shaogang (2016). Multi-task zero- shot action
Rahmani, Hossein, & Mian, Ajmal (2015).Learning a non-linear knowledge transfer model recognition with prioritised data augmentation. In European Conference on Computer
for cross-view action recognition. In Proceedings of the IEEE conference on computer Vision (pp. 343–359). Springer. pages.
vision and pattern recognition, pages 2458–2466. Yamada, Makoto, Sigal, Leonid, & Raptis, Michalis (2013). Covariate shift adaptation for
Roder, Mateus, Almeida, Jurandy, Rosa, Gustavo H. D. e, Passos, Leandro A., Rossi, André discriminative 3d pose estimation. IEEE transactions on pattern analysis and machine
L. D., & Papa, João P. (2021). From actions to events: A transfer learning approach intelligence, 36(2), 235–247.
using improved deep belief networks. In 2021 IEEE Symposium Series on Computational Yan, Yan, Liao, Tianzheng, Zhao, Jinjin, Wang, Jiahong, Ma, Liang, Lv, Wei et al.
Intelligence (SSCI) (pp. 01–08). IEEE. pages. (2022).Deep transfer learning with graph neural network for sensor-based human
Rodriguez, Mario, Orrite, Carlos, Medrano, Carlos, & Makris, Dimitrios (2017a).Fast activity recognition. arXiv preprint arXiv:2203.07910.
simplex-hmm for one-shot learning activity recognition. In Proceedings of the IEEE Zaher Md Faridee, Abu, Chakma, Avijoy, Misra, Archan, & Roy, Nirmalya (2022). STran-
Conference on Computer Vision and Pattern Recognition Workshops, pages 41–48. GAN: Adversarially-learnt spatial transformer for scalable human activity recognition: 23.
Rodriguez, Mario, Orrite, Carlos, Medrano, Carlos, & Makris, Dimitrios (2017b).Fast Smart Health.
simplex-hmm for one-shot learning activity recognition. In Proceedings of the IEEE Zhang, Hongguang, Zhang, Li, Qi, Xiaojuan, Li, Hongdong, Torr, Philip H. S., & Koniusz, Pi-
Conference on Computer Vision and Pattern Recognition Workshops, pages 41–48. otr (2020a). Few-shot action recognition with permutation-invariant attention. In
Computer Vision–ECCV 2020: 16th European Conference (pp. 525–542). Glasgow, UK:
Springer. August 23–28, 2020Proceedings, Part V 16pages.
14
A. Ray, M.H. Kolekar, R. Balasubramanian et al. International Journal of Information Management Data Insights 3 (2023) 100142
Zhang, Jing, Li, Wanqing, & Ogunbona, Philip (2017a).Joint geometrical and statistical Zhou, Luowei, Xu, Chenliang, & Corso, Jason J. (2018).Towards automatic learning of pro-
alignment for visual domain adaptation. In Proceedings of the IEEE conference on com- cedures from web instructional videos. In Thirty-Second AAAI Conference on Artificial
puter vision and pattern recognition, pages 1859–1867. Intelligence.
Zhang, Ke, Chao, Wei-Lun, Sha, Fei, & Grauman, Kristen (2016).Summary transfer: Zhu, Fan, & Shao, Ling (2014). Weakly-supervised cross-domain dictionary learning for
Exemplar-based subset selection for video summarization. In Proceedings of the IEEE visual recognition. International Journal of Computer Vision, 109(1–2), 42–59.
conference on computer vision and pattern recognition, pages 1059–1067. Zhu, Xiatian, Toisoul, Antoine, Perez-Rua, Juan-Manuel, Zhang, Li, Martinez, Brais, & Xi-
Zhang, Lei, Zhang, Shengping, Jiang, Feng, Qi, Yuankai, Zhang, Jun, Guo, Yuliang, et al., ang, Tao (2021).Few-shot action recognition with prototype- centered attentive learn-
(2017b). Bomw: Bag of manifold words for one-shot learning gesture recognition ing. arXiv preprint arXiv:2101.08085.
from kinect. IEEE Transactions on Circuits and Systems for Video Technology, 28(10), Zhu, Yi, & Newsam, Shawn (2017). Efficient action detection in untrimmed videos via
2562–2573. multi-task learning. In 2017 IEEE Winter Conference on Applications of Computer Vision
Zhang, Wen, Deng, Lingfei, Zhang, Lei, & Wu, Dongrui (2020b).A survey on negative (WACV) (pp. 197–206). IEEE. pages.
transfer. arXiv preprint arXiv:2009.00909.
15