Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Ahmet Alp Kindiroglu1,2,*, Ozgur Kara3,*, Ogulcan Ozdemir1 and Lale Akarun1
1
Department of Computer Engineering, Bogazici University, Istanbul , Turkey
2 Huawei Turkey R&D Center, Istanbul , Turkey 3 Georgia Institute of Technology, USA
The numerical calculations reported in this paper were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).
Abstract

Sign language recognition (SLR) has recently achieved a breakthrough in performance thanks to deep neural networks trained on large annotated sign datasets. Of the many different sign languages, these annotated datasets are only available for a select few. Since acquiring gloss-level labels on sign language videos is difficult, learning by transferring knowledge from existing annotated sources is useful for recognition in under-resourced sign languages. This study provides a publicly available cross-dataset transfer learning benchmark from two existing public Turkish SLR datasets. We use a temporal graph convolution-based sign language recognition approach to evaluate five supervised transfer learning approaches and experiment with closed-set and partial-set cross-dataset transfer learning. Experiments demonstrate that improvement over finetuning based transfer learning is possible with specialised supervised transfer learning methods.

**footnotetext: Equal contribution

I Introduction

\Acp

SL are visual languages that use movements and expressions of hands, arms, and faces to convey meaning. Having developed naturally out of local deaf communities, more than 103 sign languages are known to exist and show considerable variability from one another [47].

Isolated sign language recognition (SLR) from sign language videos is an active research topic. In recent years, deep learning methods have shown great success in challenging video classification problems such as activity recognition owing to the emergence of huge annotated datasets. Similar high recognition accuracies were observed in isolated SLR for some large annotated public datasets [22, 7, 38]. For sign languages, annotating videos is a time-consuming task. Glosses, which are transcribed approximations of signs in spoken languages, are annotated by expert signers in the order they appear in the video. This difficult task consumes 10-30 minutes for a minute-long sign video. While these annotation efforts exist for some languages such as German Sign Language (DGS) [21], British Sign Language (BSL) [3] or American Sign Language (ASL) [33], most of the remaining languages are still considered under-resourced languages. In order to train successfully working isolated SLR models with such languages, the most commonly used approach is to utilize transfer learning from similar datasets or tasks to improve recognition accuracy.

Transfer learning approaches aim to use data from data-rich sources to improve task performance on a data-poor target task. For isolated SLR, our focus is on Turkish Sign Language (TID), where two medium-scale public datasets exist. These are the BosphorusSign22k (BSign22k) [34] dataset with 22k videos for 745 distinct signs and the Ankara University Turkish Sign Language Dataset (AUTSL) [41] dataset with 32k videos for 216 signs. Compared to common video classification datasets such as Kinetics [25], or MomentsInTime [32] with thousands of samples per class, these datasets, along with most other isolated SLR datasets, can be considered as under-resourced datasets.

It is reported that transferring knowledge from larger datasets often improves accuracy, while a little benefit is observed when a transfer is attempted between two similarly sized datasets [38, 22]. Although transfer learning datasets exist for video classification [12], there is no SL transfer dataset to test different transfer strategies.

In this paper, we establish a sign language recognition baseline for cross-dataset sign language recognition using two public isolated TID datasets. These datasets share 57 common signs. However, not all signs are identical, as their performance varies from dataset to dataset due to differences in interpretation and style. Cross-dataset sign language recognition can approximate and measure such differences in sign and signer characteristics. Among these two datasets, BSign22k contains fewer samples per class, so transfer learning from AUTSL to BSign22k is chosen as the use case that makes more sense. Using this setting, we experiment with two different transfer learning settings. We construct a transfer learning pipeline using the sign language graph convolution network (SL-GCN) algorithm as our baseline feature extractor.

Existing transfer learning methods make use of combined training and finetuning based methods [22, 38]. In contrast, to transfer signs between two supervised source domains, we use five different transfer learning techniques, namely, finetuning, Domain Adversarial Neural Network (DANN), Minimum Class Confusion (MCC), Joint Adaptation Network (JAN) and Domain Specific Batch Normalization (DSBN) under two settings. In setting 1, common signs between source and target datasets are unknown. In setting 2, a closed set transfer learning problem is observed where source and target class labels are identical and known. Setting 3 is a partial set transfer learning problem where the target set contains a subset of class labels belonging to the source set. In all settings, we try to find transfer learning approaches that yield the largest improvement compared to baseline approaches. The contributions of this study can be summarized as follows:

  • We propose a shared vocabulary sign language subset from two publicly available \AcfTID datasets on which supervised transfer learning approaches can be tested. [40, 34].

  • Based on the SL-GCN based sign language recognition model proposed by [22], we propose a baseline transfer learning experiment protocol. We use Minimum Class Confusion (MCC), Domain Adversarial Neural Network (DANN), Joint Adaptation Network (JAN) and Domain Specific Batch Normalization (DSBN as well as finetuning approaches. Results are presented for closed set and partial-set transfer learning cases.

  • Signer Independent experiments using MCC, DANN, JAN and DSBN show that transferring cross-dataset class knowledge with the proposed approach outperforms baseline finetuning and combined training approaches. Our work will be a baseline study for researchers who are working on under-resourced sign language recognition and video classification tasks.

This paper is organized as follows: In Section II, we review the sign language literature. The implemented transfer learning methods, and our SL-GCN-based sign language recognition framework is presented in Section III. Section IV describes the proposed dataset as well as the experiments and analysis. Finally, conclude the paper in Section V.

II Related Work

Isolated SLR involves the task of recognizing performed signs from a controlled vocabulary of signs in short video snippets. In the recent decade, several large isolated SLR datasets such as Devisign [10], BosphorusSign22k [34], WL-ASL [29], AUTSL [40], CSL [20], MS-ASL [24] have been collected. Sign language recognition may be performed using different modalities from these datasets, such as RGB frames, depth images, pose information, and motion flow. Among the RGB based state-of-the-art methods, 3D convolution-based methods such as I3D [24, 29], 3D Resnets [18] or temporal models that model frame-level features such as HMMs [27, 5] or LSTMs [42, 4] are popular. In this study, we focus on pose-based Sign Language Recognition methods. There exist a variety of pose-based methods that can model sign languages as good as RGB-based recognition approaches. These methods often use pose information extracted through popular open-source pose extraction libraries that can extract hand joints such as OpenPose [9], MMpose[14] or MediaPipe Holistic[19]. In isolated SLR there are pose based methods that print coordinate locations on images to recognize them with 2D Convolutions. Among these, SSTCN [22] uses temporal convolution networks and Temporal Accumulative Features [26] constructs motion energy features from hand and body joints. In contrast, methods that model coordinates directly have recently become popular. [24] uses hierarchical co-occurrence networks(HCN) to model signs, [7] uses transformers, while several other works use Graph Convolutional Networks(GCN). One GCN based method that is currently state-of-the-art in SLR is SL-GCN [22, 38]. It is based on the Spatio-temporal graph convolutional networks (ST-GCN) method from the action recognition domain, that combines temporal convolutional neural networks with spatial graph based convolutions of joints [49, 16]. A method called drop-graph [13] enhances the ST-GCN method by using an attention-based node drop mechanism called drop-graph for regularization. In isolated SLR, SL-GCN is used [22, 38] by further adding spatial and temporal attention mechanisms to each layer. Variations of the method such as [1] have shown promising SLR performance on other datasets using 3DGCN layers with a spatial attention mechanism. Other studies on improving graph convolution networks include Modulated GCN where weights and node affinities are modulated to optimize edges between nodes beyond skeletal connections [50]. GCN based methods were also extended to continuous sign language recognition [35], and have shown promising results on datasets such as RWTH-PHOENIX Weather.

Research on isolated SLR with the current state-of-the-art video classification methods yields high recognition accuracies when there is a sufficiently large dataset [38].Transfer learning approaches may make it possible to carry this success to under-resourced sign languages where large annotated datasets are not available. Transfer Learning is a machine learning approach that utilizes labeled data from relevant source domains to train a model in the target domain [46]. In this work, we focus on supervised cross-dataset transfer learning where the performed tasks are similar, but datasets show a difference in data distributions, label-space differences, or in the amount of available training data samples per class. Deep domain adaptation methods aim to create representations that are robust to transfer by embedding domain adaptation in the pipeline of deep learning [17, 31, 23]. In most of these methods, an encoder and a classifier layer are the primary deep learning model components whose weights are shared for both domains. Various distribution distance minimization approaches such as loss functions [45], layer mechanisms [11], and optimization functions are developed to promote domain confusion in the feature space. One line of research uses maximum mean discrepancy (MMD) [45] and correlation alignment [44] to reduce distribution shift between models learned from different sources. One problem with these approaches is that distribution alignment of the data does not automatically lead to the semantic alignment of different domains. Another line of research uses domain-specific weights to maximize target domain performance. Bousmalis et al. [8] jointly learn the domain-shared encoder and domain-specific private encoders with domain separation networks. Chang et al. [11] create domain specific batch normalization layers for the feature extractor model while sharing all other model components. Recent studies such as MCC [23] and Batch nuclear-norm maximization [15] have shown that loss functions based on regularization of unlabeled data can improve transfer learning performance without modifying the base feature extractor of the model.

Compared to image-based domain adaptation (DA), Video-based DA attracted fewer studies. In video classification, models have to take into account the temporal variations on top of variations in the image space. Only a few works focus on small-scale video DA with only a few overlapping categories [12, 48]. Several methods aim to use image-based domain adaptation methods with 3D classification networks, such as adding a gradient reversal layer for domain invariance [6]. The TA3N study[12] proposes a method using domain-specific attention while learning to align frames across domains. They demonstrate the improvement in performance by introducing the UCF-HMDBfull and Kinetics-Gameplay datasets, providing a benchmark for cross-dataset transfer learning in action recognition.

In isolated SLR, the use of transfer learning can be seen in cross-task transfer (from image recognition, action recognition [37], pose estimation [2] or continuous SLR [30]) or cross-lingual [38] settings. However, as noted in several studies such as [38], finetuning-based cross-dataset transfer between similarly sized isolated SLR datasets of different sign languages shows little to no benefit. In order to understand what information can be transferred in isolated SLR successfully with which method, a dataset that would allow us to experiment with closed and partial set transfer learning settings was necessary. Using such a dataset, it would become easier to independently test cross-dataset sign language transfer using several state-of-the-art transfer learning approaches. Such datasets exist for similar domains such as image classification [36], video classification [12], but not sign language recognition. As such a dataset existed for video classification with the HMDB [28] and UTF-101[43] datasets, we follow similar procedures to construct a baseline subset of the BSign22k and AUTSL datasets.

III Technical Approach

III-A SL-GCN based Isolated Sign Language Recognition

In this section, we describe the individual components of our isolated SLR pipeline. We perform coordinate-based Sign Language Recognition using coordinates extracted with the OpenPose library [9] from each dataset.

In our isolated sign language recognition datasets, we have a set of videos composed of varying lengths of videos. Each video contains an RGB component containing a single user and a set of coordinates belonging to different joints of that user. Each coordinate consists of a triplet of values: the X and Y coordinates within the image, as well as the confidence level of joint detection for each timestep. We obtain J=30𝐽30J=30italic_J = 30 joints belonging to fingertips, finger bases, wrists, arms, neck, mouth, nose, and eyes for each frame of a sign language video.

Having obtained the joints for each sign, we apply several normalizations and augmentations to make our models more robust to small changes in user performance. A typical property of the isolated SLR datasets we use is that they both share the same rest pose where hands rest to the side of the users’ legs. Frames that contain stationary hands in this rest pose are trimmed from the beginning and end of each video segment. In addition, further sampling is done from the sampled frames to bring the number of total frames sampled from each video to J𝐽Jitalic_J frames. This was done using a random uniform sampling from a set of fixed intervals. This approach created duplicates of frames for shorter clips and provided temporal variation when sampling from longer clips.

For each coordinate, we apply a spatial normalization where the origin of the 2D coordinate system is moved to the temporal mean of the neck joint for that sign. In addition, random horizontal mirroring and random spatial coordinate translation augmentations were found to improve recognition performance and were thus added to our pipeline.

The classifier base of our model is the SL-GCN model proposed by Jiang et al. [22]. The model takes as input a sequence of fixed length coordinates and outputs class prediction probabilities for sign language gestures.

We model the problem so that nodes of the graph correspond to joint landmark locations. The spatio-temporal graph adjacency matrix A𝐴Aitalic_A is constructed in the spatial domain according to anatomical spatial ordering, where neighboring joints are assigned a value of 1 and all other joints are assigned a value of 0. In the temporal domain, all the joints are only connected to themselves.

The architecture of the model consists of ten SL-GCN blocks for node and edge processing. Each SL-GCN block consists of a spatial convolution layer, multiplication with the adjacency matrix, and temporal convolution layer. The proposed architecture can be seen in Figure 1.

Refer to caption
Figure 1: Representative Image of the proposed SL-GCN architecuture. The proposed baseline method takes as input coordinates and outputs classification results for each video.

The system uses the video modality to obtain coordinates that provide joint information for both hands. As shown in the figure, the system obtains coordinates from the OpenPose library, which is then used for training. During training, target domain videos are used, and in transfer learning settings, source domain videos are also added to the training process by adding an equal number of videos from each source domain to each batch. The baseline model is then trained using cross entrophy loss to perform classification.

The model includes decoupling with drop graph module proposed in [13] and spatio temporal channel (STC) attention modules proposed in [39]. The decoupling adds increased recognition power by dropping random joints along with their neighbors closer than an adjacency of K𝐾Kitalic_K. Similarly, the STC attention module increases recognition power by focusing on important joints, frames, and coordinates for certain signs.

III-B Supervised Transfer Methods

For each different transfer learning setup, this baseline feature extractor is used with different loss functions and classifiers in our experiments. We use several different transfer learning methods that aim to perform transfer learning by focusing on different characteristics of the data.

The Domain Adversarial Neural Network (DANN) approach proposed in [17] promotes the use of a gradient reversal layer to learn domain-independent features. In transfer learning setting, the model takes half the samples in each mini-batch from the source and target domains. The model has two classifiers, one for class prediction and the other for dataset prediction. The model learns to make accurate label predictions from samples of both datasets while adversarially trying to suppress any features that can allow the model to distinguish between samples of data from different datasets, such as user physical or performance-related characteristics.

Refer to caption
Figure 2: Using gradient reversal on domain prediction loss, sign language recognition accuracy across source and target domains are improved.

The Domain Specific Batch Normalization (DSBN) approach shares all model parameters except batch normalization layers within the feature encoder. The batch normalization layer in a neural network regularizes feature representations from different domains without taking into account class or domain information. However, when domain discrepancy is significant, the effect of batch normalization is diminished. In this approach, individual batch normalization layers keep track of unique normalization parameters and batch statistics for each domain as seen in Figure 3.

Refer to caption
Figure 3: Domain specific batch normalization layers are learned to perform batch specific normalization.

The Joint Adaptation Network (JAN) approach proposed in [31] maps features from different domains into a new data space where inter-class features have a more significant similarity. The method proposed, named joint maximum mean discrepancy (JMMD), minimizes the joint probability distribution distance of the source and target class-specific layers. The approach adds task-specific layers on top of the base SL-GCN network to learn mapping to a common domain as seen in Figure 4.

Refer to caption
Figure 4: JMMD loss is usded to learn domain specific weights for the final layers of the network.

In a different approach, rather than using straightforward entropy minimization, the Minimum Class Confusion (MCC) approach proposed by [23] attempts to utilize the structure of the classification output matrix to perform transfer learning. The approach minimizes classification error by minimizing pairwise class confusion among unlabeled target data within a mini-batch. The model takes as input the logit matrix and multiplies the logit matrix by its transpose to approximate a confusion matrix where correlated classes yield higher pairwise scores for each other more often. Probability rescaling and uncertainty reweighing are used to emphasize samples that are more important for classification. In contrast, category normalization is used to balance the effect of each class in a mini-batch. Finally, a loss function is calculated that maximizes the diagonal of the correlation matrix and minimizes the confusion of each class as seen in Figure 5.

Refer to caption
Figure 5: Calculation of minimum class confusion loss for supervised SLR. Class confusion matrix is calculated by multiplying the logit matrix by its transpose. Confident predictions increase weight of values on matrices diagonal.

IV Experiments

We evaluate our methods on the BSign22k and AUTSL datasets. Below we introduce and detail the transfer learning subsets derived from these datasets for each experiment.

IV-A Datasets and Setup

In this study, we have created a selection of similar isolated sign language signs from two publicly available sign language datasets. Going over the signs based on meaning and visual similarity, we selected 57 common signs from among 216 AUTSL and 744 BSign22k signs. For BosphorusSign22k, we also divided the dataset by number of users to create two distinct test settings with single (BSign22ksingle) and multi-user(BSign22kshared) target training settings. The details of each dataset are given in Table I.

TABLE I: Number of identical sign pairs and total number of videos from the AUTSL and BSign22k datasets.
# signs # train videos # val.videos # train users # val. users
BSign22k 744 17090 5452 4 2
AUTSL 216 28139 3742 31 6
BSign22kshared 57 1496 498 4 2
BSign22ksingle 57 377 498 1 2
AUTSLshared 57 7076 935 31 6

The training and validation splits of the datasets follow the same convention proposed in their identical papers in order to make experimental results compatible with other studies on those datasets [34, 40]. In the dataset selection process, 40 of the 57 common signs are composed of signs that are performed in nearly perfect similarity. To increase domain difference between the two sets, the remaining 17 signs selected from the datasets show slight differences in hand orientation, movement direction, or the presence of additional morphemes as part of a compound sign. The dataset subsets will be made publicly available at https://github.com/alpk/tid-supervised-transfer-learning-dataset/.

Refer to caption
Figure 6: Representative Images from the BSign22kshared (images to the left with green background) and AUTSLshared (images to the right with white & complex backgrounds) datasets

IV-B Experimental Results and Discussion

First, we evaluate the existing datasets and baseline recognition methods to choose hyperparameters and optimize our ISLR pipeline. Next, we evaluate the cross-domain adaptation experiments on the BSign22kshared BSign22ksingle and AUTSLshared datasets. Then, we move on to transfer from the larger dataset, and we evaluate whether partial set transfer learning from the AUTSL dataset is beneficial.

IV-B1 Baseline Isolated Recognition Results

To thoroughly evaluate the isolated recognition performance, we first evaluate the baseline recognition performance of the SL-GCN-based recognition method without involving transfer learning. We use several different single source and finetuning-based approaches to obtain baseline results.

TABLE II: Baseline accuracy values for isolated sign language recognition when no transfer learning is used during training.
AUTSL BSign AUTSLshared BSignshared𝐵𝑆𝑖𝑔subscript𝑛𝑠𝑎𝑟𝑒𝑑BSign_{shared}italic_B italic_S italic_i italic_g italic_n start_POSTSUBSCRIPT italic_s italic_h italic_a italic_r italic_e italic_d end_POSTSUBSCRIPT
Target Only 91.22 89.25 97.78 92.97
Finetuning on AUTSL - 89.86 - 95.78
Finetuning on BSign22K 91.38 - 97.96 -

In Table II, single dataset and cross dataset accuracy results from each dataset are reported. Since we assume that no knowledge of shared classes is present, the final fully connected layer of the model, which acts as the classifier, is discarded when transferring the pretrained layer. Results show that when transferring from the larger AUTSL dataset to the minor BSign22k dataset, improvement is observed. Improvement is more significant on the BSign22kshared dataset, where the total number of training videos is even smaller. On the other hand, transfer from the smaller BSign22k to AUTSL yields minimal improvement.

IV-B2 Supervised Closed Set Transfer Learning on BSign22k

In this experimental setting, both the source and target datasets contain identical labels from 57 classes. Knowing the shared classes between the two datasets enables us to use supervised transfer learning approaches to facilitate better knowledge transfer between the datasets. The target only setting explores the accuracy of the baseline method without any input from the larger source domain. Likewise, the source only setting reports the accuracy of the model trained on the source domain without seeing the target domain. The two initial baseline transfer methods are the combined and finetuning approaches. In combined, half of the samples in each mini-batch are sampled from respective source and target domains, and the model is trained with a single cross-entropy loss. In finetuning, differing from IV-B1 where we discarded the classifier fully connected layer, all layers are utilized during transfer. During finetuning, we initially freeze the feature extractor for ten epochs before unfreezing all layers and resuming training.

TABLE III: Accuracy results for closed set transfer learning where only shared signs are present in training and evaluation. Each method uses AUTSLshared as source and BSign22kshared as target domains. Performance of different transfer learning methods are evaluated.
Train Validation Train-Target
Source Target BSign22ksingle BSign22kshared
target only BSign22kshared 68.28 92.97
source only AUTSLshared 65.26 65.26
combined AUTSLshared BSign22kshared 85.11 95.8
finetuning AUTSLshared BSign22kshared 85.34 96.19
DANN AUTSLshared BSign22kshared 86.41 96.54
MCC AUTSLshared BSign22kshared 86.84 97.15
JAN AUTSLshared BSign22kshared 88.48 94.23
DSBN AUTSLshared BSign22kshared 84.34 96.82

In Table III, we report on closed set transfer learning results for the transfer learning setting where AUTSLshared is set as the source task and BSign22kshared and BSign22ksingle subsets of the dataset are set as the target. The benefit of closed set transfer learning is higher when the number of samples in the target training set is minimal (the single-user case). In such a setting, the JAN method surpasses the baseline finetuning accuracy with 88.48%percent88.4888.48\%88.48 %. Although the baseline performance on BSign22kshared is higher, transfer learning still yields significant gains. In that setting, MCC beats baseline finetuning and combined approaches, achieving 97.15%percent97.1597.15\%97.15 % accuracy, while methods such as JAN and DANN show a negligible increase.

IV-B3 Partial Set Transfer Learning on BSign22k

Utilizing transfer from a dataset with a larger vocabulary is also a common use case in supervised transfer learning. The source domain contains 216 signs of which 57 are common with the target dataset and 159 are additional. Class labels for both datasets are aligned so that all methods in this setting use the same 159 class classifier. The results of these experiments are presented in Table IV. With all baseline and transfer learning approaches, partial-set transfer from the larger AUTSL training set surpasses the transfer results from the AUTSLshared subset. For the BSign22ksingle and BSign22kshared subsets, the accuracy figures reach 90.56% and 98.63 %, respectively with the MCC algorithm. These results show a five and one percent improvement over transfer learning baselines with combined and finetuning-based training approaches with the same experimental setting.

TABLE IV: Accuracy results for partial-set transfer learning are reported. Each method uses AUTSLshared as source and BSign22kshared as target domains. Performance of different Transfer learning methods are evaluated.
Train Validation Train-Target
Source Target BSign22ksingle BSign22kshared
target only BSign22kshared 68.28 92.97
source only AUTSL 62.79 71.23
combined AUTSL BSign22kshared 85.11 98.12
finetuning AUTSL BSign22kshared 88.75 97.14
DANN AUTSL BSign22kshared 88.75 98.19
MCC AUTSL BSign22kshared 90.56 98.63
JAN AUTSL BSign22kshared 90.16 96.72

In the baseline methods in Table IV, the source-only method achieves lower accuracy than the same experiment with the closed set transfer learning method. As the base source, only classifier in partial-set transfer learning has more classes, it can make more mistakes. In addition, as the size of the source dataset increases, the transfer approaches DANN, MCC and JAN yield scores that are closer to baseline approaches such as combined and finetuning approaches. In Table V, we explore the combinations of these classifiers with finetuning and each other on the BSign22kshared dataset. The fusion of these algorithms is achieved by initializing the feature extractor and classifier layers of the algorithm, freezing them for the first five epochs, and then applying respective model architecture loss combinations to a single model. Of the attempted methods, finetuning and MCC approaches yield the most significant gains. In a greedy fashion, we combined this method with several other methods, which gave us 98.8%percent98.898.8\%98.8 % accuracy on the BSign22kshared task of the dataset.

TABLE V: Top-1 Accuracy results for fusion of transfer learning methods are presented.
Training Validation
Target Target BSign22kshared
finetuning + DANN AUTSL BSign22kshared 97.14
finetuning + MCC AUTSL BSign22kshared 98.59
finetuning + JAN AUTSL BSign22kshared 96.91
finetuning + DSBN AUTSL BSign22kshared 98.19
finetuning + DANN + MCC AUTSL BSign22kshared 92.57
finetuning + JAN + MCC AUTSL BSign22kshared 84.33
finetuning + DSBN + MCC AUTSL BSign22kshared 98.8

V Conclusions

In this paper, we establish a common sign language vocabulary subset from two publicly available Turkish Sign Language datasets and introduce experimental protocols for supervised transfer learning experiments. We believe the dataset will be a useful benchmark for testing novel supervised and unsupervised transfer learning methods for video classification.

We also propose a sign language classification method that uses graph convolutional neural networks and deep transfer learning mechanisms to improve isolated SLR performance on under-resourced sign language datasets. We experiment with two protocols, namely, closed set transfer learning and partial-set transfer learning. Experimental results show that when shared class knowledge is present, supervised transfer learning techniques improve the performance of isolated SLR. The observed improvement is more significant in the single-user test. In the case of closed set transfer learning, the improvements in performance over baseline methods can be attributed to the increased numbers of samples per class. Improvements with transfer methods such as MCC, DSBN, ADDA, and JAN show that improved transfer efficiency from the same class samples of a different domain also leads to a further improvement. In addition, in the case of partial-set transfer learning, the benefit observed from the addition of source data belonging to out of vocabulary signs becomes apparent as the number of signs in the training set increases and class knowledge is preserved during transfer.

Finally, observed improvements with the fusion of MCC, finetuning, and DSBN show that applying transfer methods that focus on different aspects of neural network models such as normalization layers, initialization, and loss functions further increase the benefit that can be observed from utilizing shared class knowledge.

References

  • [1] M. Al-Hammadi, M. A. Bencherif, M. Alsulaiman, G. Muhammad, M. A. Mekhtiche, W. Abdul, Y. A. Alohali, T. S. Alrayes, H. Mathkour, M. Faisal, et al. Spatial attention-based 3d graph convolutional neural network for sign language recognition. Sensors, 22(12):4558, 2022.
  • [2] S. Albanie, G. Varol, L. Momeni, T. Afouras, J. S. Chung, N. Fox, and A. Zisserman. Bsl-1k: Scaling up co-articulated sign language recognition using mouthing cues. In European conference on computer vision, pages 35–53, 2020.
  • [3] S. Albanie, G. Varol, L. Momeni, H. Bull, T. Afouras, H. Chowdhury, N. Fox, B. Woll, R. Cooper, A. McParland, and A. Zisserman. BOBSL: BBC-Oxford British Sign Language Dataset. 2021.
  • [4] S. Aly and W. Aly. Deeparslr: A novel signer-independent deep learning framework for isolated arabic sign language gestures recognition. IEEE Access, 8:83199–83212, 2020.
  • [5] S. G. Azar and H. Seyedarabi. Trajectory-based recognition of dynamic persian sign language using hidden markov model. Computer Speech I& Language, 61:101053, 2020.
  • [6] G. Bellitto, F. Proietto Salanitri, S. Palazzo, F. Rundo, D. Giordano, and C. Spampinato. Hierarchical domain-adapted feature learning for video saliency prediction. International Journal of Computer Vision, 129(12):3216–3232, 2021.
  • [7] M. Boháček and M. Hrúz. Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 182–191, 2022.
  • [8] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. Advances in neural information processing systems, 29, 2016.
  • [9] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [10] X. Chai, H. Wang, and X. Chen. The devisign large vocabulary of chinese sign language database and baseline evaluations. In Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS). Institute of Computing Technology, 2014.
  • [11] W.-G. Chang, T. You, S. Seo, S. Kwak, and B. Han. Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 7354–7362, 2019.
  • [12] M.-H. Chen, Z. Kira, G. AlRegib, J. Yoo, R. Chen, and J. Zheng. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6321–6330, 2019.
  • [13] K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu. Decoupling gcn with dropgraph module for skeleton-based action recognition. In European Conference on Computer Vision, pages 536–553. Springer, 2020.
  • [14] M. Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
  • [15] S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian. Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [16] C. C. de Amorim, D. Macêdo, and C. Zanchettin. Spatial-temporal graph convolutional networks for sign language recognition. In International Conference on Artificial Neural Networks, pages 646–657. Springer, 2019.
  • [17] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
  • [18] Ç. Gökçe, O. Özdemir, A. A. Kındıroğlu, and L. Akarun. Score-level multi cue fusion for sign language recognition. In European Conference on Computer Vision, pages 294–309. Springer, 2020.
  • [19] I. Grishchenko and R. Valentin Bazarevsky. Mediapipe holistic—simultaneous face, hand and pose prediction, on device. Retrieved June, 15:2021, 2022.
  • [20] J. Huang, W. Zhou, H. Li, and W. Li. Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2822–2832, 2018.
  • [21] E. Jahn, R. Konrad, G. Langer, S. Wagner, and T. Hanke. Publishing deutsche gebärdensprache (dgs) corpus data: Different formats for different needs. In Proceedings of the Workshop on the Representation and Processing of Sign Languages at LREC, volume 2, 2018.
  • [22] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3413–3423, 2021.
  • [23] Y. Jin, X. Wang, M. Long, and J. Wang. Minimum class confusion for versatile domain adaptation. In European Conference on Computer Vision, pages 464–480. Springer, 2020.
  • [24] H. R. V. Joze and O. Koller. Ms-asl: A large-scale data set and benchmark for understanding american sign language. arXiv preprint arXiv:1812.01053, 2018.
  • [25] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [26] A. A. Kindiroglu, O. Ozdemir, and L. Akarun. Temporal accumulative features for sign language recognition. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 1288–1297. IEEE Computer Society, 2019.
  • [27] O. Koller, S. Zargaran, and H. Ney. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4297–4305, 2017.
  • [28] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pages 2556–2563. IEEE, 2011.
  • [29] D. Li, C. Rodriguez, X. Yu, and H. Li. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In The IEEE Winter Conference on Applications of Computer Vision, pages 1459–1469, 2020.
  • [30] D. Li, X. Yu, C. Xu, L. Petersson, and H. Li. Transferring cross-domain knowledge for video sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3 2020.
  • [31] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208–2217. PMLR, 2017.
  • [32] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508, 2019.
  • [33] C. Neidle, A. Thangali, and S. Sclaroff. Challenges in development of the american sign language lexicon video dataset (asllvd) corpus. In 5th workshop on the representation and processing of sign languages: interactions between corpus and Lexicon, LREC, Istanbul, Turkey, may 2012.
  • [34] O. Özdemir, A. A. Kındıroğlu, N. C. Camgöz, and L. Akarun. Bosphorussign22k sign language recognition dataset. arXiv preprint arXiv:2004.01283, 2020.
  • [35] M. Parelli, K. Papadimitriou, G. Potamianos, G. Pavlakos, and P. Maragos. Spatio-temporal graph convolutional networks for continuous sign language recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8457–8461. IEEE, 2022.
  • [36] T. Ringwald and R. Stiefelhagen. Adaptiope: A modern benchmark for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 101–110, January 2021.
  • [37] N. Sarhan and S. Frintrop. Transfer learning for videos: From action recognition to sign language recognition. In 2020 IEEE International Conference on Image Processing (ICIP), pages 1811–1815, 2020.
  • [38] P. Selvaraj, G. NC, P. Kumar, and M. Khapra. Openhands: Making sign language recognition accessible with pose-based pretrained models across languages. arXiv preprint arXiv:2110.05877, 2021.
  • [39] L. Shi, Y. Zhang, J. Cheng, and H. Lu. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29:9532–9545, 2020.
  • [40] O. M. Sincan and H. Y. Keles. Autsl: A large scale multi-modal turkish sign language dataset and baseline methods. IEEE Access, 8:181340–181355, 2020.
  • [41] O. M. Sincan and H. Y. Keles. Using motion history images with 3d convolutional networks in isolated sign language recognition. arXiv preprint arXiv:2110.12396, 2021.
  • [42] O. M. Sincan, A. O. Tur, and H. Y. Keles. Isolated sign language recognition with multi-scale features using lstm. In 2019 27th Signal Processing and Communications Applications Conference (SIU), pages 1–4. IEEE, 2019.
  • [43] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [44] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450. Springer, 2016.
  • [45] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [46] M. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
  • [47] B. Woll, R. Sutton-Spence, and F. Elton. Multilingualism: The global approach to sign languages. The Sociolinguistics of Sign Languages, pages 8–32, dec 2001.
  • [48] Y. Xu, J. Yang, H. Cao, K. Mao, J. Yin, and S. See. Aligning correlation information for domain adaptation in action recognition. arXiv preprint arXiv:2107.04932, 2021.
  • [49] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence, 2018.
  • [50] Z. Zou and W. Tang. Modulated graph convolutional network for 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11477–11487, 2021.