Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
108 views

Deep Learning For Sign Language Recognition Current Techniques Benchmarks and Open Issues

Uploaded by

Karan Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views

Deep Learning For Sign Language Recognition Current Techniques Benchmarks and Open Issues

Uploaded by

Karan Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Received August 8, 2021, accepted September 2, 2021, date of publication September 7, 2021, date of current version September 20,

2021.
Digital Object Identifier 10.1109/ACCESS.2021.3110912

Deep Learning for Sign Language Recognition:


Current Techniques, Benchmarks,
and Open Issues
MUHAMMAD AL-QURISHI , (Member, IEEE), THARIQ KHALID, AND RIAD SOUISSI
Research and Innovation Division, Research Department, Elm Company, Riyadh 12382, Saudi Arabia
Corresponding author: Muhammad Al-Qurishi (mualqurishi@elm.sa)
This work was supported by the Research Department in Elm Company under the initiative of developing Saudi Sign language recognition
system for emergency.

ABSTRACT People with hearing impairments are found worldwide; therefore, the development of effective
local level sign language recognition (SLR) tools is essential. We conducted a comprehensive review of
automated sign language recognition based on machine/deep learning methods and techniques published
between 2014 and 2021 and concluded that the current methods require conceptual classification to
interpret all available data correctly. Thus, we turned our attention to elements that are common to almost
all sign language recognition methodologies. This paper discusses their relative strengths and weaknesses,
and we propose a general framework for researchers. This study also indicates that input modalities bear
great significance in this field; it appears that recognition based on a combination of data sources, including
vision-based and sensor-based channels, is superior to a unimodal analysis. In addition, recent advances
have allowed researchers to move from simple recognition of sign language characters and words towards
the capacity to translate continuous sign language communication with minimal delay. Many of the presented
models are relatively effective for a range of tasks, but none currently possess the necessary generalization
potential for commercial deployment. However, the pace of research is encouraging, and further progress is
expected if specific difficulties are resolved.

INDEX TERMS Sign language, deep learning, continuous model, machine learning, pose estimation.

I. INTRODUCTION systems with real-time translation capacities appear to be


For millions of people, sign language communication is within reach, a large number of exciting and innovative
the primary means of interacting with the world, and it is solutions have been proposed and tested in recent years
not difficult to imagine the potential applications involving [5]–[9]with the objective of building fully functional systems
effective sign language recognition (SLR) tools [1], [2]. For that can understand sign language and respond to commands
example, we could translate broadcasts that include sign given in this format. However, before any truly practical
language, create devices that react to sign language com- applications can be considered, it is imperative to perfect the
mands, or even design advanced systems to assist impaired interpretation algorithms to the point where false positives are
people in conducting routine jobs. In particular, deep neural rare [6], [10]–[13].
networks (DNNs) have emerged as a potentially ground- Owing to the numerous challenges inherent in this task,
breaking asset for researchers, and the full impact of their at this stage, it is not yet possible to design SLR tools that
application to the problem of SLR will likely be felt in approach 100% accuracy on a large vocabulary [14], [15].
the near future [3], [4]. SLR is a field dedicated to the Thus, it is very important to continue developing new meth-
automated interpretation of hand gestures and other signs ods and evaluate their relative merits, gradually arriving at
used in communications between people with a speech or increasingly reliable solutions. While most researchers agree
hearing impairment. Because hardware and software compo- that deep learning models are the most suitable approach,
nents have evolved to the point where developing advanced the optimal network architecture remains a point of con-
tention, with several competing designs achieving promising
The associate editor coordinating the review of this manuscript and results. Detailed experimental evaluations are the only way
approving it for publication was Arianna D’Ulizia . to identify the best performing algorithms and refine these

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 126917
M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

further using discoveries from other research teams when language recognition is beyond question, although dis-
applicable. As most countries use their own variations of sign cussions regarding the most promising directions of
language, much of the research is conducted locally with research continue. There is consensus that deeper mod-
persons skilled in using regional signs. With this in mind, els hold more promise for the eventual development
it is not surprising that a large number of scientific papers of real-life SLR applications than traditional machine
are targeting SLR problems and that the performance level learning approaches, but at present, even the most
of the proposed solutions is rapidly increasing from year to sophisticated models fall considerably short of the nec-
year [16], [17]. essary reliability.
In the current literature, the various SLR solutions can 3) Benchmark datasets and performance: An analysis
essentially be divided into two major groups, depending on of the benchmark datasets and performance used in
the primary data collection method. One group of methods the literature is conducted. The quality of available
relies on external sensors to gather insights regarding the sign language datasets is essential for ensuring that
actions of the signer, for example, through data gloves worn SLR tools built and tested with them return relevant
by the signer. Starner et al. [18] provided early example of a predictions. However, the availability of high-quality
system based on wearable sensors, while many other authors datasets of this kind is limited, and in some cases barely
have exploited this concept since then. However, there are sufficient for serious testing. Some of the datasets
practical considerations regarding sensor-based techniques, mentioned in literature include the Corpus VGT con-
and therefore a majority of recent research has been directed sisting of over 140 hours of video input and including
toward vision-based methods, which rely on images, video, approximately 100 classes, PHOENIX14T dataset with
and depth data to determine the semantic content of hand video recordings of 9 different signers using more than
signs. For example, Chen et al. [19] pioneered a hand gesture 1000 unique signs, PHOENIX-Weather2014T with
recognition method based on skin color, while many alterna- vocabulary related to weather, and ASLG-PC12 which
tive techniques have since been proposed, some of which are includes various English-language versions of signs.
based on filtering principles [20]. Datasets are usually split into training, validation, and
In particular, the commercial launch of the Microsoft testing portions, so the models can be evaluated with
Kinect device has unlocked a completely new level of the same type of input that was used to optimize them.
insight [21]–[23],and researchers are still exploring how to However, due to different datasets used in different
leverage the power of depth vision to develop more accurate studies, direct comparison of the results across studies
SLR tools. In terms of the type of neural network most is not possible.
suitable for SLR purposes, the convolutional neural net- 4) Identifying open Issues and challenges: After analyzing
work (CNN) model [24] was one of the first to gain major and discussing the existing methodologies, we draw
attention [25]–[28]. In addition to CNNs, other architectures some conclusions with respect to their limitations, open
such as hidden Markov models (HMMs) [19] and recurrent issues, and potential challenges. Differences between
neural networks (RNNs) are frequently applied [29]. The regional variations of sign language alphabets and
support vector machine (SVM) model is frequently used for vocabularies greatly complicate cross-border collab-
this purpose as well [30], [31],while random forest (RF) oration, especially considering the scarcity of high-
and K-nearest neighbor (k-NN) are sometimes chosen for quality datasets for languages with smaller numbers of
the classification task [29], [32]. We summarize our work speakers. This also makes it very difficult to develop
contributions in this paper as follows: and test more advanced applications, which require
1) Comprehensive review and taxonomy of automated much larger training vocabularies. Most of the pro-
sign language recognition (ASLR) literature: We posed methods are conceptually sound, yet they lack
conducted a comprehensive review of automated sign the level of accuracy and reliability that would be
language recognition using machine/deep learning desired for a final solution. These problems are exacer-
methods and techniques published between 2014 and bated in the continuous SLR sub-field, where semantic
2021. We concluded that several SLR methods cur- content is far more complex and thus more difficult to
rently in existence require some conceptual classifica- capture through statistical analysis.
tion to make sense of all available data. Thus, we focus The remainder of this paper is organized as follows.
on elements that are common to almost all sign lan- In Section II, we provide a brief background regarding some
guage recognition methodologies and discuss their rel- of the basic concepts discussed in this paper, such as deep
ative strengths and weaknesses regarding specific SLR learning, machine learning. Section III presents the review
tasks and functionalities as part of this study. method used in this study. Machine learning and deep learn-
2) Establishment of a general framework for creating SLR ing methods to design sign language recognition models
models: We propose a general framework based on the are discussed in detail in Section IV along with the pro-
challenges and limitations we have identified in the posed framework. Types of models and languages related
literature. At this point, the value of machine learn- to the recognition process are discussed in Section V and
ing/deep learning (ML/DL) methodologies for sign Section VI, respectively. The related studies and surveys have

126918 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

been discussed in Section VII. Section VIII introduces the on data input from wearable sensors, which provide a very
benchmark SLR datasets used for ML/DL and provides a direct translation of a user’s movements. The data can be
comparative analysis of the ML/DL methods performance for filtered using techniques such as SVM to provide a reason-
sign language recognition. Section IX discusses open issues, ably accurate recognition of the intended sign. Some of the
challenges, and opportunities for future research. Finally, aforementioned machine learning methods are used primarily
the conclusions of our study are presented in Section X. to analyze static content (i.e., individual signs isolated in time
and space), while in some cases, there have been attempts
II. BACKGROUND to interpret continuous segments of sign language speech,
In recent years, there have been ongoing efforts to develop necessitating the use of dynamic models such as dynamic
automated methods for the completion of numerous linguistic time warping or relevance vector machines. In general, basic
tasks using advanced algorithms that can ‘learn’ based on past stochastic models are better suited for simple SLR tasks,
experience [33]. Sign language recognition (SLR) is an area which is why they were extensively used in the early stages of
where automation can provide tangible benefits and improve research. These statistical models typically require less com-
the quality of life for a significant number of people who rely puting power than more complex architectures, although this
on sign language to communicate on a daily basis [34]. The depends on the number of analyzed features as well as the size
successful introduction of such capabilities would allow for of the dataset. As more complex ASLR applications naturally
the creation of a wide array of specialized services, but it is require the inclusion of additional variables and sometimes
paramount that automated SLR tools are sufficiently accu- additional modalities, the simplicity of basic models remains
rate to avoid creating confusing or dysfunctional responses. attractive. Thus, simpler machine learning methods remain
In this section, we provide a brief background regard- valuable tools and often serve as comparison benchmarks
ing some important approaches that have been utilized for that can be used to evaluate the properties of newly proposed
automated SLR. methods.

A. MACHINE LEARNING (ML) B. DEEP LEARNING


The machine learning concept encompasses a number of Recently, basic machine learning approaches have been
stochastic procedures that can be used to predict the value largely replaced with deeper architectures that employ several
of a certain parameter based on similar examples that the layers and pass information in vector format between layers,
algorithm was previously exposed to. A simple example, gradually refining the estimation until positive recognition
illustrated by Algorithm 1, shows how a general formaliza- is achieved. Such algorithms are usually described as ‘‘deep
tion of the learning process takes place. There are many learning’’ systems or deep neural networks, and they oper-
different methodologies that belong to this group; some of ate on principles similar to the machine learning strategies
the best-known methods include naïve Bayes, random forest, described above, although with far greater complexity. Based
K-nearest neighbor, logistic regression, and the support vec- on the structure of the network, two architectures are com-
tor machine [33], [35]. All of these methods undergo a train- monly used for a number of different tasks: recurrent neural
ing phase, which can be either supervised (using labeled input networks (RNNs) that include at least one recurrent layer,
data) or unsupervised (without labeled data), and use input and convolutional neural networks (CNNs) that include at
features to establish connections among variables and acquire least one convolutional layer. Depending on the number and
predictive power. However, owing to their simplicity, such type of layers, these networks can exhibit different proper-
methods have limitations when there is a need to capture ties and are generally suitable for different types of tasks,
nuanced semantic hints, as is the case with most linguistic while the training phase decisively impacts the performance
tasks. On the other hand, they can often provide the founda- of the algorithm. The general rule is that larger and more
tion for the development of more powerful analytic tools and specific datasets allow for more robust network training, and
serve as a measuring stick to evaluate progress. therefore the quality of the training set is an important factor.
Additional fine-tuning of a model can usually be achieved by
Algorithm 1 LEARNING PROCESS changing some of the relevant hyper-parameters that define
the training procedure [36].
Input: x, is a d dimensional vector of features
The majority of research involving the automation of SLR
Output: y, is the output decision
tasks is currently based on methods that rely on a combina-
1: Target function f : X ⇒ Y the ideal formula (Unknown)
tion of images and depth data, which generate a tremendous
2: Data: (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) training examples
amount of information that often requires analysis in real
3: Hypothesis g : X ⇒ Y formula to be used
time (or at least taking the temporal dimension into account).
4: Learning algorithm g ≈ f final hypothesis
With larger and more diverse datasets, simple machine learn-
ing methods tend to underperform, which is why many of
Machine learning techniques are used to aid in sign lan- the more sophisticated models are based either on RNN or
guage recognition and have achieved some degree of suc- CNN design. Deep networks can be trained using multi-
cess. Some of the earliest studies in this field were based modal input (e.g., skeletal data combined with depth images

VOLUME 9, 2021 126919


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

from Microsoft Kinect), and in some applications, they can obtain the results, sorting out quantitative output, and defin-
achieve a recognition accuracy of over 98% under optimal ing the principles for generalization and summarization.
conditions. The advantages of deep learning were demon-
strated by Konstantinidis et al. [37], who successfully used B. INCLUSION AND EXCLUSION CRITERIA
data from disparate sources to identify sign language words During the collection of scientific works, a set of well-defined
in isolated form, although their model displayed uneven per- parameters were used to decide which works to include. Since
formance depending on the dataset used. More demanding the subject of this paper is SLR, only papers from this field
SLR tasks, such as interpretation of continuous speech or were taken into consideration. The period covered is between
real-time translation, require even more sophisticated models, 2014 and 2021, as shown in Fig 1, as the idea was to provide
which in some cases require an increased number of layers a systematization of contemporary research. In the following
(depth). While deep models appear to be a safe choice for Table 1, we provide a complete set of rules for selecting the
the role of empowering automated SLR applications in the research papers in a succinct format.
future, it remains to be seen whether the current architectures
will survive in their present form or will evolve into new
models that can ‘understand’ the semantic aspects of sign
communication more astutely. Possible models that could be
more widely used in the future include deep belief networks
with a very large number of layers, as well as networks based
on autoencoders.

III. REVIEW METHOD


In this study, we summarize and organize scientific data
about the subject of Sign Language Recognition (SLR) for
the benefit of the entire research community. In order to
assist anyone interested in the fundamental knowledge in
this field, we complemented the basic facts about each study FIGURE 1. Number of publications on sign language recognition by year.
with an impartial assessment of its quality and potential for
positive contributions. We attempt to answer the following
main research questions:
C. SEARCH STRATEGY
Question 1 – Which studies have been conducted address-
Finding the most relevant research material required an ardu-
ing automated Sign Language Recognition, and what are the
ous process combing publicly available sources using a com-
available datasets?
bination of automated tools and human workforce. Specific
Question 2– What techniques in Automatic Sign Language
keywords drove the automated segment, which are displayed
Recognition for various languages are applied to date?
in Table 2.
Question 3 – Which challenges remain unsolved in this
This collection of studies is continually expanded through
scientific field?
addition of individual papers that match the same level of
One of the ultimate objectives of this paper is to lay the
relevance as those found by the algorithm. We included all of
groundwork for future inquiries about SLR and clarify any
the most significant online repositories of scientific content in
ambiguous elements that might confuse some researchers.
the search, from Google Scholar, MDPI, Springer, Elsevier,
We accomplished this in three phases – preparatory, execution
and IEEE explore to ACM and ArXiv. The proportion of
and presentation, with each stage including several steps.
papers from each source is shown in Fig 2.
These steps included 1) selecting the most relevant research
The overall objective at this stage was to discover as many
questions, 2) setting fundamental rules for the evaluation pro-
works that address the topic of SLR as possible. After com-
cedure, 3) formalizing the selection threshold, 4) assessing
pleting this stage, we carefully analyzed the entire corpus
the quality of the work’s premises and results, 4) looking into
of collected material using the forward/back technique. This
the methodological setup of the experiments, and 5) extract-
allowed for a more detailed understanding of each paper,
ing any bits of information that contribute to answering the
with the ability to track all references and follow the signif-
central questions.
icant lines of research. In this way, it was possible to ensure
that no foundational studies are missing from the study and
A. REVIEW PROTOCOL that the final collection of SLR papers is truly representative
We followed a defined procedure during the literature review, of the most successful research directions. We then processed
allowing for a more objective evaluation of the paper content. the collection based on the Mendeley method, which made it
This procedure consisted of numerous tasks, starting with possible to easily identify and remove identical items from the
selecting relevant variables, isolating the authors’ strategic list, making the content more readily searchable. We noted
approaches, analyzing the methods and techniques used to several trends in this part of the process, which included a

126920 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 1. Inclusion and exclusion criteria for SLR studies.

TABLE 2. Keywords for searching stage.

FIGURE 3. Number of publications on sign language recognition by


language.

FIGURE 2. Number of publications on sign language recognition by


publisher.

breakdown of collected works based on the local variation


of sign language they refer to. A majority of works in the
collection (more than 30%) were related to the American
variation, but French, Argentinian, Arabic, and many other
SL variations are also represented, as seen in Fig. 3.
Another factor that was used to differentiate between
papers is the type of architecture of the proposed solution. FIGURE 4. Number of publications on sign language recognition by
Full overview is available in Fig. 4. architecture.

D. STUDY SELECTION PROCESS


During the initial search, we found 196 different papers, We examined the full text of the studies next, and removed
although 11 of them were duplicates that we immediately any that failed to directly address SLR or to support their
disqualified. All original papers were reviewed using the hypothesis with high-quality data removed as well. Next,
principles outlined in Table5 and information available on the we took the quality of all quotes and correct naming of
first page of the paper. In this manner, we removed all works sources into account, and performed online checks to ascer-
not connected to the research field, collected from unreliable tain the authorship of source papers. In the last phase, we per-
sources, or with other weaknesses. Examination of this kind formed a qualitative evaluation to determine which studies
identified 47 entries that did not meet the inclusion criteria; deserved to be reported on. The entire selection procedure cut
138 core and relevant studies remained. down the number of included works to 84, but their level of

VOLUME 9, 2021 126921


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

scientific value and importance regarding the main research


questions leaves nothing to be desired.

IV. AUTOMATED SIGN LANGUAGE RECOGNITION


FRAMEWORK
Most automated SLR research is concerned with similar
problems, namely the need to interpret hand and body move-
ments associated with sign language characters in a clear
and unambiguous manner. Because the main objectives are
similar, the studies in this area also share similar methodol-
ogy, even if their procedures may not be identical. Figure 5
presents the general model shared among the majority of
researchers in this area. The input layer of the solution con-
sists of an input device based on SLR data collection methods,
as shown in Figure 6, and includes a visual display to present
hand signs. The second layer is the pre-processing layer that
performs gesture data filtering and can decode a sign into
the required data format. In some cases, there are additional
steps, such as sample normalization or merging information
contained in successive frames of a video. The first procedure
performed by the system after receiving sign data is feature
extraction. All proposed methods have to provide solutions
for the two most important tasks: extraction of relevant fea-
tures, and classification of entries to determine the most likely
sign being presented.
There are many different types of features that can be
used as the primary source of information, such as visual
features, hand movement features, 3D skeletal features, and
facial features, among others. The selection of features to be
included for algorithm training is one of the most important
factors that determine the success of the SLR method. The
data are typically processed and transformed into a vector FIGURE 5. Automated sign language recognition framework.

format before being input to the modeling layer, and multiple


channels may be fused together to analyze their joint contri-
bution to sign recognition. respect to wearable sensors and devices, the authors in
[38]–[41] describe how they capture depth and intensity
A. DATA COLLECTION images obtained from a Microsoft Kinect sensor and a SOFT-
The interactive computing domain has evolved extensively KINECT sensor. A similar category of observations fea-
in recent times. Consequently, a need for efficient human– tures direct measurement methods that involve the use of
computer interaction techniques has arisen. Sign language sensors attached to the hands or body, as well as motion
recognition is among the methods that can support fur- capture systems [42]. Huang et al. [43] observed that sensor-
ther development of this domain. Sign language recog- based approaches are never natural because burdensome
nition enables the transfer of well-known gestures to a instruments must be worn. Instead, they propose a novel
receiver. Techniques used to collect sign language recognition approach, Real-Sense, which can detect and track hand loca-
data can be hardware-based, vision-based, or hybrid. tions naturally.
In recent years, the vast popularity of device-based
1) HARDWARE-BASED approaches has resulted in renewed interest toward develop-
Hardware-based approaches are designed to circumvent ing human gesture and action recognition methods. Among
computer vision problems during sign language recogni- the device-based methods, Kinect is more commonly applied
tion. These challenges may develop when recognizing signs than the Leap motion controller (LMC) or Google Tango [13],
from a video, for example. In many cases, hardware- [38]–[40], [43]–[50]. Wang et al. [51] identified Leap Motion
based approaches use devices or wearable sensors. Wear- as an excellent product that uses computer vision to achieve a
able devices used in sign language recognition often use useful interactive function. The significance of LMC is rein-
sensors attached to the user or implement a glove-based forced by the fact that learning and practicing sign language
approach. These devices (whether sensors, gloves, or rings) is not common in society, as discussed in [52]. Some other
can convert sign language into text or speech. With methods rely on specially designed gloves for input such

126922 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

as [53]–[55], while a range of other technological devices recognition, where the configuration of a hand pose is
such as accelerometers [56] and depth recording devices [57]. determined by the positions and angles of the joints. Cur-
Some of the most basic sensor configurations include col- rently, many experiments use these joint positions and angles
oration of the fingers on gloves, as in [58], [59], allowing for because they can be estimated based on depth images and
easy and inexpensive motion tracking. pixel-wise hand segmentation. Other experiments, such as
Gloves equipped with digital capture capacities were intro- those by Zimmermann and Brox [75] use a hand pose esti-
duced by [60] and utilized to deduct hand signs of the Arabic mation system combined with a classifier trained to recognize
sign language variation with a reduced number of sensors. hand gestures.
While the cost of creating and using special equipment of While vision-based methods are non-invasive, they are
this kind is considerable, it is still many times cheaper than constrained by the inadequate performance of conventional
purchasing some of the high-tech products available in the cameras. Another challenge is that uncomplicated hand fea-
market. Authors in [61] chose motion controller as the pri- tures can cause ambiguities, while advanced features require
mary input device, which allowed them to track objects in extra processing time [39].
three dimensions with extreme precision at a 120 fps rate. The
controller they used was developed for the purpose of tracking 3) HYBRID
hand motion, so the researchers were able to follow many key In some instances, hybrid approaches have been used to col-
points on the hands from one frame to the next. The same lect sign language recognition data. Hybrid methods exhibit
device was used by [62] to differentiate between 50 unique similar or better performance compared to other methods
isolated hand signs, with absolute precision attained. with respect to proportional automatic speech or handwrit-
ing recognition. In hybrid approaches, vision-based cameras
2) VISION-BASED together with other types of sensors, such as infrared depth
In recent years, research on sign language recognition sys- sensors, are combined to acquire multi-mode information
tems has focused more on vision-based methods because regarding the shapes of the hands [76]. This approach requires
they provide little to no restraints on users, unlike sensor- calibration between the hardware and vision-based modali-
based approaches. In vision-based techniques, depth and ties, which can be particularly challenging. The fact that this
pose estimation data are collected from users. A discussion method does not require retraining means that it is faster
regarding depth data and pose estimation can be found in and can be used to examine the impact of deep learning
section V. some of the recent SLR studies rely on input in techniques. Koller et al. [77] conducted an experiment and
the visual format. For example, depth information and RBG opted for the cleaner hybrid method, otherwise referred to as
are some of the formats that can be commonly encountered automatic speech recognition (ASR), to examine the direct
in this field as demonstrated by [17]. Previous research by impact of this type of data on a CNN.
Rioux-Maldague and Giguere [44] indicates that use of depth Using still photos or continuous recordings in RGB format
data has increased because of the increased number of 3D has the advantage of good resolution, but depth imaging does
sensors available in the market. A Microsoft Kinect sensor a better job at determining how far an item might be located
was used in their experiment, which has an image resolution from a fixed point. There are certain algorithms that use both
of 640 × 480 and uses a traditional intensity camera to obtain types of visual data in combination [72]. Thermal imaging is
depth images. Recent publications have also obtained depth also an intriguing possibility, even if it is used more rarely
data using vision-based approaches [40], [63], [64]. Depth than the previous two formats. IR heat sensors can leverage
data can be in the form of video sequences [65]–[70] or the emitting of radio waves or reflection of light rays to
images [40], [71]–[74] obtained using a normal camera or a construct an image as well. This type of information has been
mobile device. Oyedotun and Khashman [74] used hand ges- used with success for tasks such as facial recognition or body
ture grayscale images measuring 248 × 256 pixels. Accord- contouring, but has not yet found its way into SLR stud-
ing to Zheng et al. [17], the use of depth data is advantageous ies [78]. Skeletal data can also be used as a source of input,
to maintain privacy and to streamline the human body extrac- mostly in the form of hand joint position during SLR gestures.
tion process. Furthermore, depth data are invariant to changes Another type of input is derived from motion capture, where
in illumination, hair, clothing, skin, and background [17]. information changes are tracked from one image to another.
Aside from depth data, pose estimation has been used Models of this kind usually define the optical sequence as
to facilitate vision-based techniques. Rioux-Maldague and a vector describing the movement of pixels in series of still
Giguere [44] used a combination of regular intensity images images, while so called scene sequence can be tracked in
and depth images to group different hand poses. They tracked video materials referring to the motion of three-dimensional
the hands using functions that are publicly available in the objects within the scene, relative to the distance from the
OpenNI+NITE framework. While using pose estimation, camera lens [79].
computationally heavy heat maps for 2D joint locations While all of the input devices can be effective in the right
were generated, and a 3D hand pose was inferred based scenario, their performance significantly fluctuates depend-
on inverse kinematics and depth channels. Koller et al. [63] ing on the context. Still, more advanced input sources such
further described the state-of-the-art aspect of hand shape as depth sensors and Real Sense/Kinect recording systems

VOLUME 9, 2021 126923


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

can create three-dimensional representations which carry far typically performed during the data pre-processing stage, and
more information than simple two-dimensional images from may include various statistical operations or media process-
a fixed angle [78], [80]. ing tasks. The exact type of normalization procedure that
is optimal for the current implementation depends on the
B. SLR DATA PRE-PROCESSING AND FEATURE format of the input (i.e. text, image, or video), the level of
EXTRACTION FOR DEEP NEURAL NETWORKS variability within the sample, the type of machine learning
SLR data pre-processing plays a critical role in sign lan- architecture, the purpose of the automation tool, etc. Due
guage recognition engineering. As such, data processing may to its impact on performance, normalization is commonly
involve sign representation, normalization and filtering, data included in most contemporary Sign Language Recognition
formatting and organization, feature extraction, and feature methodologies and its contributions are empirically verified
selection. [59], [82]. As SLR studies use a lot of different input modal-
ities and pursue a range of different objectives, it is logical
1) SIGN REPRESENTATION that the scope of normalization techniques found in this field
Sign language is a type of visual language that utilizes is quite broad. Most of the techniques are visual in nature,
grammatically structured manual and non-manual sign repre- and involve changes of images to fit them into a standard
sentations to facilitate the communication process. These rep- format that can be readily interpreted by the algorithm. This
resentations may range from the hand shape to the orientation is frequently done by altering the data on the level of pixels,
of the palm, finger or hand movement and location, as well as since this is how information is encoded in the machine
head tilting, mouthing, and other aspects of facial expression. learning models during the feature extraction and network
Tang et al. [39] used eight representative frames organized in training stages.
a time sequence. Their representations showed the movement Some of the simplest examples of normalization methods
of two hands that began by moving closer to each other before used in SLR are image resizing and re-shaping, as demon-
moving apart. In [40], all gestures used in an experiment were strated in Kratimenos et al. [83] and several other works
represented by the hand of the signer. A hand segmentation [59], [84]. Garurel et al. [85] also normalize the size of each
phase was also used to represent the shape of the hand sign. frame to fit feature map dimensions, using mean values and
Similarly, Koller et al. [63] represented 60 hand shape classes standard deviations obtained during training to find the most
using a double state, while the garbage class was represented optimal size. Cropping is another frequently used method that
by a single state. Another experiment by Zhou et al. [81] can improve the quality of visual data and make sign recogni-
evaluated only right-handed signers. In this case, the right tion more reliable by removing sources of possible confusion
hand was used to represent the dominant hand, while the left for the algorithm. Input images are typically cropped in such
hand was the submissive hand. Hossen et al. [45] focused a way to eliminate all regions except those depicting hands
on the Bengali Sign Language, having 51 letters that were and face, which are crucial for sign language communication.
represented in the experiment using 38 signs, which were In [86], cropped images are normalized based on the average
developed by combining related sound alphabets into single length of the neck, thus negating the impact of the distance
signs. from camera for every image. In [87], a benchmark signer is
In the Bahasa Indonesia Language, one word is repre- selected and input from other signers is standardized based on
sented by at most five signs, as discussed in [69]. This positions of key joints. Contour extraction is used to this end
means that every word and affix has an independent signed as well, for example in [88], with the main focus on the areas
Indonesian (SIBI) representation and is represented by one corresponding to hands, with background removed from the
sign that is consistently performed. Another experiment by image. For SLR methods that rely primarily on video for
Huang et al. [43], used 66 input units and 26 output units to raw input, frame downsampling is frequently used to stan-
represent 26 signs. dardize the quality of various clips and reduce computational
Past experiments have also attempted to compare body demands.
and hand features. In [15], it was observed that body fea- In [44], normalization and filtering processes were applied.
tures make up a somewhat better representation compared to The intensity histogram of an image was equalized and all
hand features for sign language recognition. In essence, using pixels were normalized to the [0, 1] interval. Gabor filters
body features improved the recognition of sign language by were then applied to the processed images using four different
2.27% [70]. These observations can be attributed to the fact scales and orientations. An attempt was made to apply bar
that body joints are more dependable and robust than hand filters to the depth and intensity images to obtain the primary
joints. contours of the hands. Gabor filters were also used in an
experiment by Li et al. [76] to obtain hand features that could
2) NORMALIZATION AND FILTERING be used for classification. While using Gabor filters, images
In machine learning and deep learning, normalization refers were normalized to a size of 96 × 96 pixels. In another
to all actions and procedures aimed at standardizing the input experiment in [40], principal component analysis (PCA) filter
based on a set of predefined rules with the ultimate objective convolutions learned from input images were used. As part
to improve performance of the AI tool. This procedure is of the preprocessing, Koller et al. [63] applied a per-pixel

126924 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

FIGURE 6. Primary data collection methods employed for SLR.

mean normalization to images and used pre-trained convolu- quality of input and consequently boosting the accuracy of
tional filters located in the lower layers of their CNN model. the model. Random sampling or discarding of frames is one
Zhou et al. [81] did not conduct any normalization process of the most straightforward techniques found in literature,
in their experiment because the extracted features occurred where approximately 20% of input is eliminated. In [89], this
naturally in the range of [−1, 1]. Another experiment by Yang technique is complemented by random changes of brightness,
and Zhu [65] set a threshold to filter the minor skin-color area, saturation, and other image parameters. Some of the data
an approach that enhanced the robustness of the system by augmentation methods used in [90] include Gaussian Noise,
using the second layer of their CNN model as a filter. Just Counter, and Future Prediction. The PoseLTSM tool
Other examples of normalization can be observed in also employs some operations aimed at augmenting the input
the experiments by [67], [70], [71]. Balayn et al. [67] nor- images, with rotation of the hands around fixed points in the
malized Japanese sign language (JSL) motion sentences wrists as one of the most original ideas. As with normaliza-
and used them as inputs and outputs for Seq2Seq models. tion, the choice of filtering and data augmentation techniques
Konstantinidis et al. [70] normalized hand positions, which is directly related to the properties of the model and the type
were used as inputs for the classifier together with cropped of input, so it must be made with full understanding of each
hand regions. In their attempt to examine Chinese Sign Lan- individual implementation and its objectives.
guage, [71] obtained a total of 1,260 images of basic signs
in Chinese, which were normalized to 256 × 256 optimized 3) FEATURE EXTRACTION
background samples. Their model used 16 filters in the first Feature extraction is a crucial step in all of SLR models, since
convolutional layer. The filters had a width and height of 7 it impacts how the models are trained and consequently how
and a channel width of 3. Similarly, Koller et al. [77] applied quickly they can become effective at distinguishing between
a global mean normalization process to images before fine- different signs/words. In all cases, features are derived from
tuning their CNN model. raw data, and they refer to positions of body parts (key points
Experiments to format and organize data in various ways in hands and face) relevant for sign language communica-
have been reported. Tang et al. [39] organized the hidden tion. Features are calculated based on statistical operations,
layers of their models using various planes within which and assigned weights proportional to their discriminatory
all units shared similar weights. In another experiment by value [90]. In effect, features are expressed as vectors in
Jiang and Zhang [71], the data were divided into training and the latent space and allow the neural model to learn the
test sets, with the training set containing 80% of the total probabilities of their association with particular classes.
images and the test set containing the remaining 20%. In a Several different feature engineering schemes are dis-
different experiment that used a Kinect sign language dataset, cussed, and in some cases a special tool was used for
Huang et al. [41] formatted and organized their data into their extraction. The final number of features as well as
25 vocabularies that were extensively used in daily life. Each weight distribution between them is typically optimized
word was played by nine signers, and each signer repeated based on their impact on accuracy and scalability of the model
each word three times. Using this approach, each word was [40], [81]. Various authors [38], [39] conducted feature
organized into 27 samples, yielding a total of 25 × 27 sam- extraction processes in their sign language recognition exper-
ples. Eighteen samples were selected for training, and the iments. Wu and Shao [38] carried out high-level feature
remaining samples were used for testing. extraction by fixing the architecture of the network as [NX,
Many studies from this field also include filtering and data N2, 1000, 1000, 1000, 1000, NTC), where NX represents
augmentation steps, which have the purpose of improving the the dimension of the observation domain and N2 represents

VOLUME 9, 2021 126925


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

the number of hidden nodes. In another experiment, it allowed direct image input into the sign language recog-
Rioux-Maldague and Giguere [44] presented a novel feature nition system.
extraction method for recognizing hand poses using depth and
intensity images. Images were de-interlaced by maintaining 4) FEATURE SELECTION
every other line in an image to resize them from 128 × 64 to Feature selection is a crucial step in the design of practically
64 × 64. Each resulting 64 × 64 image was unrolled as all machine-learning-based sign language recognition model.
a 1 × 4096 intensity vector. Tang et al. [39] extracted hand Basically, it involves a reduction of data to a limited number
features by considering the two hands as a whole, making of relevant statistical parameters, which are then fed into
the recognition process much more accurate. A similar exper- the machine learning network as input [91]. The idea is to
iment in [40] used PCANet for feature extraction to solve include only those features that significantly contribute to
the challenges associated with processing different image the ability of the algorithm to distinguish between differ-
modalities. Li et al. [42] exemplified the process of feature ent classes, effectively limiting the number of computations
extraction by transferring sensor signals from both hands into necessary to obtain an accurate prediction. Thus, the exact
feature vectors. Such an approach bypasses the approach of number of selected features may vary from one model to
reconstructing the precise shape of the hand, its orientation, the next, depending on the type of algorithm used, structure
and position. and volume of raw data, and the main tasks that the machine
Likewise, Camgoz et al. [64] used 2D CNNs to conduct learning classifier will be expected to complete [92].
spatial feature extraction. The 2D convolution layers obtained Researchers use many different methodologies to rank fea-
feature maps by using weights to convolve images. Addi- tures based on their relevance and select those that deserve
tionally, observations from [21] also demonstrated that var- to be included. Broadly speaking, there are two major types
ious stages of convolution and subsampling can be used of feature selected techniques – supervised and unsuper-
to extract spatial-temporal features. Based on these princi- vised [91]. In terms of the principles used to rank the fea-
ples, Huang et al. [41] extracted hand-crafted features from tures, we can talk about Filter methods (such as variance
a video containing sign language and used the features threshold, correlation coefficient, or Chi-square test) which
to train a Gaussian mixture model-hidden Markov model capture some of the native properties of each feature, and
(GMM-HMM). Unlike Huang et al. [41], who oversaw the Wrapper methods (i.e. forward feature selection or backward
feature extraction process manually, features such as finger feature elimination), which measure how a proposed set of
length, finger width, and angle of the finger were input features works with a particular algorithm [92]. There are also
directly to the DNN in a separate study [43]. Instead of using Embedded (LASSO regularization or random forest impor-
2D CNNs, some experiments have used 3D-CNNs owing to tance) and Hybrid approaches, which combine some of the
their capability to consider spatial and temporal relationships. main strengths of both Filter and Wrapper methods. With so
For instance, the authors in [11] used a ResNet model rooted many possibilities for feature selection, researchers need to
in a 3D CNN model to generate a representation of each take the specifics of their project into account and use the
video clip considered. Within the same domain, the authors scheme that best suits the classifier, the key tasks, and the
in [45] developed a neural network that uses a stack of layers data [93].
to extract features. In [71], a convolution layer was used to Some experiments that conduct feature selection include
extract various features of the input. The authors in [72] used those in [39], [81]. In [39], a deep neural network was
a trained CNN as the feature extractor for an SVM. used, reducing the need to manually select certain fea-
Another experiment by Konstantinidis et al. [37] extracted tures. The deep neural network autonomously detects and
a mixture of video and skeletal features from video obtains useful features. Another example of the feature selec-
sequences. The video features were the image and opti- tion process was presented in [81], where 215 distinct test
cal flow, while the skeletal features were the body, hand, sentences were selected to represent conventional conver-
and face. The VGG-16 network pre-trained on ImageNet sations in sign language. Another experimental work by
was used to extract video features, whereas FlowNet2 was Konstantinidis et al. [70] selected only 12 out of the total
used for the optical flow images. A similar experiment by of 18 features extracted from body skeleton joints. The selec-
Konstantinidis et al. [70] used a mixture of the ImageNet tion was based on the fact that the signers in a sign language
VCG-19 network and conv44 for feature extraction. The key dataset are usually in a sitting position, and the skeleton joints
features extracted during the experiment included 18 body of their legs are usually not visible. Apart from CNN, some
and 21 hand joints in 2D. Rao and Kishore [68] conducted experiments also used PCA to facilitate the feature selection
human-like feature extraction and recognition. These fea- process. The use of PCA is guided by the fact that PCA is a
tures are those used by human interpreters to recall signs conventional dimensionality reduction approach that can be
accurately. useful when processing image data, which typically involves
There have been a few experiments that seek to avoid high-dimensional space. For instance, the authors in [76] used
or simplify the feature extraction process. For instance, PCA to conduct feature selection and dimensional reduction.
Yang and Zhu [65] used a CNN owing to its capability to A different experiment by Huang et al. [43] illustrated the use
avoid complicated feature extraction processes. Therefore, of a DNN (deep learning or feature learning) in the generation

126926 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

and selection of features. In essence, a DNN has the capability experiment showed that the rate of sign language recognition
to autonomously analyze and generate features from raw data. for 26 letters was 80.30% when using SVM and 93.81% when
using DNN. It was also observed that the recognition rates
C. SIGN LANGUAGE MODELING AND RECOGNITION for a grouping of 26 letters and 10 digits were slightly lower
Sign language modeling focuses on developing an articulate at 72.79% for SVM and 88.79% for DNN. The performance
model from the phonetic to the semantic level for language of the SVM was inferior to that of the DNN in sign language
representation. The modeling process covers various aspects, recognition. Similarly, Huang et al. [49] applied SVM to their
ranging from the use of the signing space to the synchroniza- approach for recognizing a large-vocabulary sign language.
tion of manual and non-manual features such as eye gaze and The SVM scheme used in the experiment facilitated the
facial expressions. On the other hand, sign language recog- process of mean pooling over clipped features to produce a
nition entails pattern matching, computer vision, linguistics, fixed dimension vector as the video feature representation.
and other aspects of natural language processing [94]. The Huang et al. [49] trained an SVM for classification based on
objective of sign language recognition is to establish different video features. Despite the use of SVM, the authors noted that
methods and algorithms that can recognize already developed their machine learning approach disregards temporal infor-
signs and perceive their meaning. The techniques for sign mation during the mean-pooling process. The effectiveness
language modeling and recognition discussed in this section of the SVM in a hybrid system was also evaluated in [95].
include classic, deep learning, SLR continuous models, and The experiment examined the classification accuracy of a
SLR isolated models. HOG+SVM system. The hybrid system included a HOG fea-
ture extractor that produced 64-dimensional features and an
1) MACHINE LEARNING SVM classifier that was fed canonical handshapes. Improve-
Machine learning refers to the science of using computers ments in accuracy with the HOG+SVM system were between
to complete a task without having to program them explic- 14.18% and 18.33% compared to SVM alone.
itly. In many cases, machine learning algorithms are usu-
ally provided with general guidelines that characterize the
model along with the necessary data. The data usually contain b: PRINCIPAL COMPONENT ANALYSIS
information to allow the model to complete a given task. PCA is used in computer vision to reduce dimensionality or
This means that a machine learning algorithm can achieve to extract features. Many recent experiments have used PCA
its task when the model is adjusted based on the associated in sign language recognition as a dimensionality reduction
data. Examples of machine learning algorithms include SVM, mechanism. PCA can best be described as an orthogonal
PCA, and LDA, among many others. linear transformation that converts the original data into a
new coordinate system having a reduced number of dimen-
a: SUPPORT VECTOR MACHINES sions. In [40], a fingerspelling recognition system based on
A support vector machine (SVM) is a supervised machine PCA was proposed. The convolutional layer of the proposed
learning model that uses classification algorithms for clas- PCANet system features PCA. Another investigation focused
sification problems involving two groups. Feeding new sets on training a CNN on 1 million hand images using PCA [63].
of labeled training data into an SVM model can result in Koller et al. [63] utilized 1024-dimensional feature maps and
groupings of new examples. Past experiments have used applied PCA to reduce the dimensionality to 200. Another
SVM for this and many other purposes. Nguyen and Do [72] experiment by [67] used PCA to select data streams exhibit-
applied multiclass SVMs to learn extracted data. The valida- ing a high variance represented by approximately 492 dimen-
tion accuracy of the CNN-SVM model was lower than that of sions. The use of PCA on Kinect data has also helped to
the HOG-LBP-SVM model. However, the CNN-SVM model reduce cases of overfitting. In a different experiment, [51]
had a better chance of avoiding overfitting. The demand used PCA to expand a matrix into a 210-dimensional
for real-time performance was evaluated in [76] to compare vector. These dimensional vectors are useful in the cre-
the most popular classifiers, which combine softmax and ation of an enhanced scheme for the mel frequency cep-
linear SVM. SVM and softmax performed better than other stral coefficient (MFCC), which is useful for sign language
advanced classifiers in terms of accuracy. Additionally, it was recognition.
observed that an SVM classifier featuring a linear kernel Some experiments have compared their proposed
required more training time but performed better than the approaches to a hybrid version of PCA. In [96], the pro-
softmax-based classifier. Similarly, an experiment by [43] posed method was compared to other methods, including
attempted to compare the performance of DNN and SVM SAE+PCA. The outcome of the comparison indicated that
by using the same dataset. The results indicated that DNN SAE+PCA performed better than the proposed method, and
had a better recognition rate than SVM. Similarly, the authors achieved 99.05% accuracy. Other experiments have also
in [46] identified SVM as a suitable classifier for real-time shown interest in a variation of PCA, referred to as recursive
sign language recognition. While exploring American Sign principal component analysis (RPCA) for feature extraction.
Language, Chong and Lee [52] used SVM and a deep neural While exploring the features of SLR systems, [97] reported
network as a sign language recognizer. The outcome of the that using RPCA achieved a classification rate of 98%.

VOLUME 9, 2021 126927


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

2) HIDDEN MARKOV MODEL HMM combination to process training samples with scarce distri-
This method relies on statistical operations that can reveal bution, but with a lot of possible options the discrimination
trends from the complex interaction of motions within a process is still very demanding.
space-time continuum. It was first applied to the field of Many continuous processing tasks experience terrestrial
SLR by [98] in 1996, while [99] used in 1997 to classify alignment challenges, which can often be resolved using
isolated hand gestures based on visual input, achieving solid hidden Markov models. In [63], an EM-based algorithm was
performance with the most optimal parameters. Variations incorporated into HMMs to facilitate weak supervision and
such as dual HMM [100] or factorial HMM [101] were sug- overcome the challenges associated with video processing.
gested at approximately the same time, seeking to build on the Zhou et al. [81] used HMM techniques to develop a model
promising performance of the base model. Those studies con- framework that makes continuous sign language recognition
firmed that the model requires a lot of data during the training possible. The use of HMM allows the resulting system to
stage in order to arrive at sound statistical projections. Soon scale up to a larger vocabulary, allows modeling of signs and
after, Wilson and Bobick [102] proposed a parameter-based of transitions between signs, and decoding and training are
improvement of this method, while authors in [103] proposed possible even with new deep learning algorithms. In another
using parallel computing within this paradigm. The same experiment, the authors in [41] evaluated the Gaussian
principle was developed further by [104] to solve language- mixture model-hidden Markov model (GMM-HMM) as a
based problems. This approach was demonstrably more cost- baseline method. Trajectory and hand-shape features were
efficient than any of the earlier HMM implementations and extracted and used to train the GMM-HMM for recognition.
capable of reaching accuracy in excess of 94% for static signs An average accuracy rate of 90.8% was achieved when using
and over 84% for dynamic signs in continuous speech by trajectory as well as hand-shape features. A similar experi-
using 80% of the sample for training and 20% to test the ment by [49] also used GMM-HMM to facilitate temporal
model. Another class of models from this group is called input pattern recognition (automatic speech recognition as well as
& output HMM, and was first developed by [105] to deal with sign language recognition). Alternatively, combining HMM
material that is less homogenous. The same concept can be and BLSTM-NN yielded an accuracy of 97.85% for single-
applied with success to track positions of hands during sign hand signs and 94.55% for double-hand signs [115].
language communication, as demonstrated by [106], with Another experiment by Cui et al. [3] examined the role of
accuracy of output of more than 70% when distinguishing HMMs in continuous sign language recognition. HMMs are
between 16 signs based exclusively on hand movement. among the most popular temporal models for sign language
Further development of the input/output HMM model was recognition. However, the framework developed in Cui et al.’s
achieved in 2009 by [107], who introduced a cut-off point study performed better than HMMs. Their framework used
a thus managed to push the accuracy over 90%, albeit only recurrent neural networks in the sequence learning module.
when the total number of signs to be recognized was smaller
than 20. An alternative was proposed by [108] in 2003, who 3) DEEP LEARNING TECHNIQUES
called their method Left & Right HMM but were unable to Deep learning is an incipient field of machine learning that
significantly improve SLR performance over earlier version. focuses on learning representations of data [38]. However,
A combination of HMM with GMM models can be useful the ability of deep learning techniques to capture seman-
for hand sign recognition even when the available data is tics contained within data is limited by the complexity of
scarce, as shown by [109], although reliability of the system the models and the underlying details of the input to the
decreases in this case. Hidden Markov Models were also used system [37], [38]. Advances in the field of deep learning
by [110] to analyze data collected with the help of multiple have strong implications and applications for sign language
video cameras. While those methods have certain benefits, interpretation using neural networks. Key deep learning tech-
their application to the field of SLR requires additional work. niques that have been applied in recent experiments include
In recent years, some researchers tried to use HMM alongside backpropagation, convolutional neural networks, recurrent
other methodologies in order to obtain better results. One neural networks, recurrent convolutional neural networks,
such attempt was done by [111] in 2011, where this method attention-based approaches, deep belief networks, PCANets,
was deployed together with PCA to determine key features SubUNets, logistic regression, transfer learning, and hybrid
of hand signs. On the other hand, authors in [112] added deep architecture.
HMM to an RNN model tasked with tracking contours of
hands during sign language communication, but were suc- a: BACKPROPAGATION
cessful only when working with a limited number of already Backpropagation is a supervised learning algorithm used
known signs. Yang et al. [113] developed a variation of HMM to train feedforward neural networks. The basic equations
that was aimed at shortening the calculation time, but this describing the learning process are given by (1) and (2). This
method requires certain conditions to be met, for example classic multilayer perceptron (MLP) technique was used by
the length of each gesture must be limited. In the work of Rioux-Maldague and Giguere [44] to train a translation layer.
Belgacem et al. [114], CRF method and HMM were used in The output layer was trained using normal backpropagation

126928 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

to interpret the activations of various restricted Boltzman has a remarkably high rate of intra-class ambiguity, which
machines (RBMs) into a 24-dimension softmax vector for results in an added burden to acquire training data. In many
every 24 letters. Training was conducted based on 200 epochs cases, only a few specific labeled datasets exist in the ges-
of backpropagation and used both weight decay and early ture and sign language recognition field. As such, CNN has
stopping. A fine backpropagation phase was also conducted been used because it can be trained easily. In [63], a CNN
using the entire network but at a much lower learning rate. was embedded within an iterative expectation-maximization
In addition, Wu et al. [38] adopted the standard backpropa- (EM) algorithm, which allowed the CNN to be trained using
gation method to adjust the weight of each modality a very large number of model images. The CNN achieved
∂E a recognition accuracy of 62.8% on over 3000 hand shape
θ t+1 = θ t − α (1) images that were labeled manually.
∂θ Some experiments focus on American Sign Language,
N
1 X such as in [72]. The methodology applied an end-to-end CNN
E = (yi − yi´)2 (2)
2N architecture to a training dataset for comparison purposes.
i=1
Additionally, CNN and SVM were combined to act as a
where theta is a weight θ, alpha α is the learning rate. feature descriptor, producing acceptable accuracy. Another
related experiment by Li et al. [76] used CNN to process
b: DEEP BELIEF NETWORK images of a large size. The CNN shares the weights of
Some sign language experiments have used a deep belief the images, thereby significantly reducing the number of
network (DBN) to classify learning representations. DBNs parameters that need to be learned. This also reduces the
are comparable to multilayer perceptrons (MLPs), but they risk of overfitting. CNN also finds invariant features that are
have many additional layers in their structure. The extra particularly useful during image processing. By combining
layers in DBNs provide enhanced learning potential, even CNN with various PCA layers, [76] developed a hierarchi-
though these layers are usually difficult to train. However, cal model that proved useful for recognizing fingerspelling
recent work has facilitated DBN training. For instance, in American Sign Language. Similarly, the authors in [73]
Rioux-Maldague et al. [44] used a DBN consisting of three developed a CNN focused on grouping fingerspelling images
restricted Boltzmann machines (RBMs) and a single extra using a mixture of image intensity and depth data. The CNN
translation layer. Tang et al. [39] used DBNs to imple- was evaluated by applying it to American Sign Language with
ment hand posture recognition. Based on the recogni- respect to fingerspelling recognition, and the developed CNN
tion results, the DBN attained a high recognition accuracy performed better than CNNs evaluated in previous studies.
of 98.12%, which was better than the baseline HOG+SVM Specifically, the CNN achieved a precision of 82% and a
approach. Similarly, Huang et al. [43] established a deep neu- recall of 80%.
ral network that can recognize various signs based on Real- Similar observations concerning American Sign Language
Sense. The technique uses 3D coordinates of finger joints were noted in an experiment by Taskiran et al. [116], where
because the model can learn key recognition features from a CNN structure was used to extract and classify features
the raw data. The average rate of recognition of this DNN obtained from the American Sign Language. The CNN
based on Real-Sense was 98.9%, while that of a DNN based model had the following features: an input layer, a pool-
on Kinect was 97.8%. An additional experiment that used ing layer, two 2D convolutional layers, two dense layers,
the deep belief network was conducted in [96]. An American and a flattening layer. The resulting system achieved high
Sign Language dataset was used to examine the structure accuracy, even when evaluating letters that had shared ges-
of a deep belief network and its performance in gesture tures. Daroya et al. [117] used a CNN model to examine
recognition. The experiment compared DBN with other clas- the performance of a framework they proposed. The exper-
sic methods for recognizing gestures (a convolutional neu- iment applied Alexnet (an effective CNN model) and altered
ral network and a stacked denoise auto encoder), and the a few parameters to adapt it to their dataset consisting
results demonstrated a much higher performance by the of 28 × 28 pixel images. In another trial, Shahriar et al. [118]
designed DBN. attempted to recognize American Sign Language using a real-
time approach. Images used in the experiment were cate-
c: CONVOLUTIONAL NEURAL NETWORK (CNN) gorized using CNN and deep learning. A CNN was used
A Convolutional Neural Network receives an input image, to obtain features from the images, while the deep learning
assigns significance to different aspects of the image, and method was used to train a classifier to identify sign lan-
differentiates one image from another. Figure 7 shows the guage. Specifically, the CNN model was trained to produce a
basic CNN architecture mode for sign language recognition. 4096-dimensional feature vector for the following classes:
CNNs require a much lower level of pre-processing com- face, A, palm, and v. Similar to [119], the authors in [118] also
pared to other deep learning algorithms [63]. While these used AlexNet, a built-in neural network consisting of 25 lay-
networks perform strongly in many tasks [65], they require ers that was pre-trained extensively. The output of the image
large amounts of labeled training data [67], [71]. Hand shape features indicated that the CNN model, together with the deep
recognition, a process influenced by the pose of the subject, learning method, managed to classify input images with a

VOLUME 9, 2021 126929


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

FIGURE 7. Basic CNN model used in sign language recognition.

high level of accuracy. Similarly, Cayamcela and Lim [120] application of CNN in the recognition of Arabic sign lan-
used a CNN model to translate American Sign Language guage. In addition, Yasir et al. [122] used the CNN approach
from a real-time perspective. The CNN model was trained to train a dataset obtained from the Bangla Sign Language.
using a dataset consisting of several instances obtained from The use of this data classification technique was guided by the
the American Sign Language alphabet. The CNN obtains fact that CNNs require little pre-processing when compared
features from every pixel and develops precise predictions to other image classification algorithms. The resulting model
based on a translator. In general, the CNN model achieved had a validation accuracy of 94.88%.
higher accuracy than its comparable counterparts. Previous experiments have focused on the Indian Sign
In [65], a CNN was integrated into a novel video-based Language. Rao et al. [123] observed that it is exceedingly
recognition method and used to obtain upper-body images difficult to classify complex head and hand movements owing
precisely from videos. In the experiment, the CNN model to their ever-changing shapes. Because of this, the use of
was trained for recognition, thereby simplifying the fea- CNN was proposed to recognize gestures in the Indian sign
ture extraction process. The CNN circumvented the com- language. They trained the CNN using three varying sample
plicated feature extraction process by allowing direct image sizes, each consisting of different sets of subjects and view-
input. In addition, a Chinese Sign Language recognition ing angles. Different CNN architectures were designed and
method using this CNN model achieved a high accuracy evaluated, from which much better recognition accuracy was
of 99%. Similarly, [45] used deep convolutional neural net- achieved. Specifically, Rao et al. achieved a recognition rate
works (DCNNs) to develop a new method that can facilitate of 92.88% when using CNN. Another test in this domain
Bengali Sign Language recognition. Hossen et al. [45] used was conducted by Sajanraj and Beena [124], who devel-
a network consisting of a convolution layer, a ReLU layer, oped a real-time system to convert Indian Sign Language
a max-pooling layer, a fully connected layer, a dropout layer, into text. A deep learning method (CNN) was introduced
and a softmax layer, which achieved an accuracy of 84.68%. to classify the sign language. The accuracy of the result-
This accuracy is remarkably high, considering that a very ing system was 99.56%. Additionally, the authors in [15]
small dataset was used to train and test their network. used a VGG-19 network to recognize sign language from
A similar experiment focusing on the Chinese Sign Lan- video sequences. VGG-19 is a type of CNN that has been
guage used a CNN consisting of six layers to facilitate fin- trained using more than 1 million images obtained from
gerspelling recognition [71]. The deep learning approach the ImageNet database. Generally, the VGG-19 network is
consisted of various components such as dropout, maxi- 19 layers deep and can categorize images into 1,000 object
mum pooling, and batch normalization. The CNN achieved categories. Konstantinidis et al. [70] used VGG-19 because
an overall accuracy of 88.10 +/− 1.48%, and a maximum it has learned rich feature representations for different image
accuracy of 90.87%, which was higher than other estab- ranges. Within the same scope, the research contribution by
lished approaches. Recent experiments have focused on the Koller et al. [77] demonstrated a scheme that can be used to
Arabic Sign Language. Shahin and Almotairi [121] intro- train a CNN in a supervised manner. The experiment took
duced a system that could recognize Arabic sign language the outputs of the CNN classifier and incorporated them with
using a vision-based approach. In the design of the system, an HMM approach, thereby allowing iterative learning of
deep learning methodologies relying on CNN were used to video data. Through this approach, a significant improvement
attain a high level of accuracy without the need for sensors. was reported in the classification performance of the deep
The results of the experiment were very promising for the learning technique. Huang et al. [41] proposed a 3D CNN

126930 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

approach designed to automatically obtain discriminative layers with varying numbers of filters and filter sizes. Fur-
spatial-temporal features from raw video streams. In their thermore, Papadimitriou and Potamianos [95] used a CNN
CNN architecture, for every type of visual source, nine frames variant to introduce a hybrid, vision-based, two-stage system
measuring 64 × 48 were considered and centered on the that could effectively extract the shape of the hand. The
existing frame as input. The 3D CNN achieved an accuracy convolution operation was changed in the CNN to enhance
of 88.5% when implemented on a gray channel, and 94.2% the learning capacity of the model. The alteration focused
when implemented on multi-channels. on the convolution scheme, leading to nonlinear behavior of
Some experiments have applied deep learning to classify the network output. The AlexNet architecture was used as
RGB images. For instance, in [117], a CNN was used as an part of the CNN. Additionally, the developed model followed
approach that can classify RGB images of static hand poses the normal CNN layer pipeline, which involves convolution,
(representing a letter) associated with sign language. This pooling, and corresponding activation functions. The clas-
method is based on DenseNet. In essence, the approach could sification accuracy of the proposed method was tested and
classify sign language images in real time and performed well outperformed existing alternatives.
with an accuracy of 90.3%. Similarly, Rastgoo et al. [125]
used a CNN model called a faster region-based convolutional d: RECURRENT NEURAL NETWORK (RNN)
neural network (Faster-RCNN) to detect hands in an input RNN is an influential model used to facilitate sequential data
image. The purpose was to examine how a generative deep modeling. This approach has been used extensively and has
model can be used to obtain data from modeled data distribu- been proven successful in a variety of important tasks, such
tion probabilities and whether it can enhance the recognition as speech recognition, natural language processing, video
performance of up-to-the-minute alternatives for recognizing recognition, and language translation. Figure 8 shows the
sign language. The CNN detected input images as either basic RNN Encoder-Decoder architecture used for sign lan-
original, cropped, or noisy cropped images [125]. CNNs have guage recognition. Fang et al. [128] used a bidirectional RNN
also been used in attempts to resolve the challenging task and long short-term memory (LSTM) in their experiment to
of gesture and sign language recognition in a constant video facilitate universal and non-intrusive word and sentence-level
stream. For instance, Pigou et al. [126] used a deep learning translation of sign language. The outcome of the experiment
approach and temporal convolutions to address this problem. indicated that the RNN model could successfully capture the
The CNN model featured certain improvements that made important features of American Sign Language words.
it easier to conduct the classification process. The use of A feature of RNNs that has been applied in some exper-
temporal convolutions was important for coping with the iments is LSTM. For instance, Kavarthapu and Mitra [129]
spatiotemporal nature of the data. Upon evaluation, the CNN applied a bidirectional LSTM as the encoder and a second
model achieved a top-10 frame-wise accuracy of 73.3% when LSTM within the embedding layer as the decoder. The use
trained on the Corpus NGT and 55.7% on the Corpus VGT. of bidirectional LSTM in sign language recognition is sig-
A recent experiment by Gunawan et al. [47] modified a nificant because it allows the collection of information in
CNN model and used the outcome to recognize sign lan- an abstract manner. A standard LSTM was used to mini-
guage. The modified CNN is referred to as the i3d incep- mize the loss function. The results demonstrated that the
tion model and is based on the inception v1 model. The bidirectional LSTM performed very well. Its performance
architecture of this model was used because of its capability could be attributed to the aptitude of the bidirectional LSTM.
to enhance the outcomes of previous experiments that used Correspondingly, Rakun et al. [130] attempted to use LSTM
the ResNet-50 models, Two-Stream Fusion + IDT, and the to recognize Indonesian Sign Language. LSTM was used in
C3D Ensemble. The i3d inception model was composed of the experiment because the model can use full sequences as
67 convolutional layers, including the input and output layers. input and does not depend on pre-clustered per-frame data.
In addition, the model contained nine inception modules. The outcome of the experiment indicated that the 2-layer
The outcome of the experiment indicated that the i3d incep- LSTM model achieved the best performance among the mod-
tion model achieved fair training accuracy but an extremely els compared and was 95.4% accurate in classifying root
low rate of validation. Correspondingly, Soodtoetong and words. However, the LSTM model achieved a much lower
Gedkhaw [48] used a 3D-CNN to assess its efficiency in accuracy of 77% when used on inflectional words, which can
sign language recognition. The 3D-CNN model was used to be attributed to the challenges involved in identifying prefixes
determine the predictive gestures. The results of the experi- and suffixes. The architecture used in [131] featured an RNN
ment demonstrated that the 3D-CNN algorithm could identify consisting of LSTM cells. In the architecture, the feature
gesture motions accurately, with the highest recognition rate vector from every frame was provided as the input at every
being 92.24%. time step. The output layer was composed of a softmax clas-
Another experiment by Nakjai et al. [127] used a CNN sifier. LSTM was used to guarantee the real-time translation
model as part of the base model of YOLO. YOLO was used in of sign language. The resulting model could translate con-
the experiment to detect objects in real time, and CNN was its tinuous sign language videos into comprehensive sentences
support model. The Darknet-19 architecture was used, which in English and was regarded as being highly effective in
consisted of 19 convolution layers as well as 5 max-pooling facilitating communication through sign language.

VOLUME 9, 2021 126931


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

FIGURE 8. Basic RNN encoder–decoder model used in sign language recognition.

A few recent experiments have also used LSTM to the RCNN assisted in the process of continuous sign language
recognize Indonesian sign language gestures. In [132], recognition.
2-layer LSTM neural networks were used to identify Sis-
tem Isyarat Bahasa Indonesia (SIBI) gestures. The neural f: PCANet
network achieved very high accuracy rates of 91.74% for Another type of deep learning technique used in sign lan-
prefix, 98.94% for root, and 97.71% for suffix datasets [132]. guage experiments is PCANet. Although this deep learn-
Attempts to address the challenges associated with sign lan- ing method has been proposed only recently, it is highly
guage translation have led to increased use of hierarchi- effective. As evident in [40], PCANet is very successful in
cal deep recurrent fusion (HRF) networks. Guo et al. [50] solving many problems associated with object recognition
developed a hierarchical recurrent architecture to encrypt and can be used to learn features obtained from intensity
visual semantics with varying visual granularities. The HRF and depth images. Fingerspelling was recognized using two
decodes a sentence by using complementary RGB visemes PCANet models to cover every color and depth input present
as well as skeleton signemes. The steps used were as fol- in images. Empirically, the use of a two-stage PCANet is
lows. The HRF encoded the entire visual content after sufficient to achieve acceptable performance. As a result,
translating the video into various neural languages. Next, developing a deeper architecture may not necessarily enhance
Guo et al. explored the use of adaptive clip summariza- the performance of this deep learning technique. Addition-
tion (ACS) to delve into sign action patterns in sign lan- ally, Aly et al. [13] used the PCANet deep learning architec-
guage translation. They also proposed an adaptive temporal ture to recognize the alphabet in American Sign Language.
segmentation scheme that differed from past models that Unlike [40], Aly et al. [13] proposed two approaches that
obtain key frames or clips over a fixed time interval. In the could be used to train the PCANet models: the single
next step, a hierarchical adaptive temporal encoding network PCANet and user-specific PCANet feature models. The sin-
was developed that condensed the time span. In addition gle PCANet was trained using samples obtained from all
to HRF, LSTM was selected as the basic RNN unit. The users. In contrast, the user-specific PCANet was used to train
top layer of LSTM was responsible for learning the per- various PCANet models, where individual models learned
sistent characteristics of the original features. The medium certain features from individual users. The extracted features
layer was responsible for learning the recurrent features were then identified using a linear SVM classifier. Inspired
of compact visemes or signemes. The bottom layer trans- by the many achievements of the PCANet deep learning
formed the visual information into textual semantics. As men- architecture, the model was used to autonomously learn depth
tioned earlier, the main idea of the suggested model was to features from segmented regions of the hand.
learn the descriptors of sub-visual words, such as visemes
and signemes. Detailed experiments indicated that the HRF
g: SubUNets
framework, working on the basis of LSTM, was highly
effective. A few other experiments have used SubUNets to facili-
tate sequence-to-sequence tasks. In [64], the authors used
SubUNets, which is a new deep learning architecture that
e: RECURRENT CONVOLUTIONAL NEURAL NETWORKS produces a series of outputs from video. Unlike the other
(RCNNs) video-to-text methods, the approach mimics the contextual
Cui et al. [133] introduced a recurrent convolutional neural subunits of a task while simultaneously training the network
network to map video segments to glosses. They used an for the key task. When dealing with the challenges of sign
RCNN to extract features and facilitate sequence learning. language recognition, SubUNets detect and identify individ-
By developing their architecture using RCNN, the perfor- ual signs in a certain video and generate a text translation.
mance could be equated to state-of-the-art models without SubUNet features three tiers of neural networks. The first
having to introduce additional information. In this sense, tier includes CNNs, which take images as inputs and are

126932 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

responsible for extracting spatial features. The second tier selecting clues. In this case, attention-based 3D-CNNs
uses bidirectional LSTM (BLSTM) layers, which model the were assessed using Chinese Sign Language data and the
spatial features obtained from the CNNs [64]. The final tier ChaLearn14 benchmark. The outcome demonstrated the
includes a connectionist temporal classification (CTC) loss higher accuracy of the approach compared to other advanced
layer, which allows training the networks using videos of algorithms. In [115], transfer learning was used to tune an
different lengths and label sequences. After being trained on ASLR model to detect Indian sign language. Transfer learn-
the Deep Hand 1 million hands dataset, SubUNets achieved ing helped in the learning of new classes even in situations
a Top-1 accuracy of 80.3% and a Top-5 accuracy of 93.9%. when new training sets were limited in size.
While focusing on American Sign Language, Oyedotun
h: HYBRID DEEP ARCHITECTURES and Khashman [74] applied a mixture of deep learning-based
In many instances, the use of a single deep learning tech- networks to recognize hand gestures collected from a public
nique is challenging. As a result, some experiments have database. The techniques applied were CNN and a stacked
combined deep learning techniques. For instance, [39] noted denoising autoencoder (SDAE). The recognition rate of CNN
that the process of training DBNs was difficult to parallelize was 91.33%, while that of SDAE was 92.83% when evaluated
across different computers. They evaluated this issue by using on test data that were not part of the training data. Another
CNNs for comparison purposes. The recognition results indi- experiment by Bantupalli and Xie [134] examined American
cated that CNN achieved a high recognition accuracy rate Sign Language using a mixture of CNN and RNN. In this
of 94.17%, although this was lower than the accuracy of the case, Inception, a CNN model, was used to identify spatial
hybrid DBN approach. features from a video stream designated for sign language
Wang et al. [66] proposed a hybrid deep architecture to recognition. Next, the experiment used LSTM and an RNN
address the continuous sign language translation (CSLT) model to obtain temporal features from sequences of videos
problem. The hybrid model featured the combination of a using two approaches: outputs were generated from softmax
temporal convolution (TCOV) module, a bidirectional gated and from the pooling layer of the CNN. Despite the success
recurrent unit (BGRU) module, and a fusion layer (FL) of the experiment, the authors suggested that the use of cap-
module. In the model, TCOV is responsible for capturing sule networks rather than Inception may have yielded better
short-term temporal transitions, whereas BGRU preserves results in sign language recognition.
the long-term context transitions that occur across temporal
dimensions. The FL then links (fuses) the embedded features 4) TRANSFORMER-BASED APPROACH
in both the TCOV and BGRU outputs to learn their cor- A range of different methodological approaches to sign lan-
responding relationships. Experimental results demonstrated guage recognition can be found in the reviewed literature,
that this hybrid deep architecture improved accuracy by 6.1% but there are some basic principles shared by nearly all of
in terms of the word error rate (WER) compared to single them. In particular, the studies are focused on attention-based
deep learning techniques. neural models with transformer architecture [135]. In this
A CNN has also been used in combination with a bidirec- computing paradigm, encoder and decoder stacks are used to
tional recurrent neural network (Bi-RNN). Combining these train the model for the classification of sign language samples
techniques, the authors in [69] used a 3D CNN to obtain as you can see in the diagram in Fig. 9. This approach has
features from every video frame and a Bi-RNN to gener- been proven successful with other types of tasks, and offers
ate unique features from the sequential behavior present in some unique advantages over earlier models. In this case,
individual video frames. On average, the hybrid approach the models are expected to capture the relationship between
exhibited a higher average word error rate and a similar temporal and spatial cues, and deduce the intended sign based
character error rate when compared to the Lipnet model. on them. A tokenization procedure is performed to break
Comparably, Cui et al. [3] combined a deep CNN with a down the input and output into frames/key points and word
Bi-LSTM to extract features. The CNN model proved useful embeddings [136].
in learning spatiotemporal representations from the input of One of the unique limitations of transformer models is that
video streams. Then, Bi-LSTMs were used to learn more they lack positional information for the inspected sequences,
complicated dynamics. Bi-LSTMs iterate LSTM compu- necessitating the introduction of the temporal ordering step.
tations by calculating both forward and backward hidden Therefore, feature extraction is another necessary element of
sequences. The authors employed Bi-LSTMs because unidi- all transformer-based neural models, where the most relevant
rectional RNNs are limited in the sense that they can only features derived from input tokens are selected and later used
calculate hidden steps based on past time steps. for model training [85], [136]. Some of the features delineate
The authors in [49] employed attention-based 3D-CNNs between signs (inter-cue features), while others are useful to
to facilitate the recognition of large vocabularies in sign differentiate the particular gloss from similar ones (intra-cue
language. The attention-based framework has two primary features) [89], [137].
advantages. First, the model can learn spatio-temporal fea- In one hybrid model, a separate neural network of the CNN
tures based on raw video input without having previ- type is used to extract the features from video input, greatly
ous knowledge. Second, attention mechanisms assist in improving the efficiency of the process. The classification

VOLUME 9, 2021 126933


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

FIGURE 9. Transformer-based model used in sign language recognition [136].

step is typically performed by a Bi-directional Long Short other benchmarks by a significant margin on most datasets.
Term Memory (Bi-LSTM) module or an encoder-decoder The most optimal version of the algorithm can typically
stack, comprising several successive layers in both cases. correct predictions in up to 85% of cases for tasks such as
The exact depth of the model and the number of deployed pose estimation, around 70-75% for isolated SLR, and up to
attention heads varies depending on the intended use of the 45% for the more demanding translation task. In some cases,
model and other factors, and can be optimized for best per- the gains over competing methods were small, but in certain
formance based on the empirical evaluations. For example, instances the improvements were quite dramatic. In addition
some studies propose using only two layers in transformer to the task, several other factors such as the size of vocabulary,
models (as opposed to standard six used for Natural Language the size of the training dataset, and exact configuration of the
Processing), while others introduce a linear projection layer network etc., could affect the quality of output [141]. While
and a softmax attention layer on top of the stack [83], [86]. insights gained from those tests are extremely valuable, at this
A normalization procedure is used to improve the efficiency point it’s hard to draw any firm conclusions about the most
of model training, which is driven by maximizing condi- optimal setup that would guarantee high performance regard-
tional probabilities and minimizing cross-entropy loss, while less of factors, such as the identity of sign performer, local
a validation procedure fine-tunes the model for the particular variation of sign language, and environmental influences.
purpose. From the data collected so far, it appears that deep neural
Networks of this type were tested in several roles, including networks of transformer type have a role in this scientific
for isolated [138] and continuous SLR [139], as well as field, but it remains to be seen exactly what that role should
translation of sign language into spoken language. Video be and how it can be leveraged for expanding the range of
footage and skeletal data were used as input modalities, possible SLR applications [82].
but this methodology could conceivably also be used with While the methods based on transformer architecture bring
different modalities [88]. Versatility of the deep learn- tangible improvements over earlier deep learning SLR sys-
ing approach with transformer architecture is very wel- tems, their accuracy is still not near the level where they could
come in this challenging field, since the output can be be used in everyday practice without issues. Low accuracy is
specialized through the selection of training dataset and especially apparent with more complex tasks, and it tends to
features, as well as training hyper parameters. Several inter- decrease as the complexity of analyzed sign language samples
esting ideas were presented in the reviewed literature that grows [139]. It’s possible that gaps in performance are due to
could additionally refine the ability of encoder models to training samples and selected features rather than the funda-
understand sign language, for example gloss-level super- mental data processing approach, but this postulation needs to
vision or the use of specialized pose estimation tools. be ascertained by more comprehensive testing and the pos-
With those improvements, some of the long-standing dif- sible inclusion of additional input modalities and localized
ficulties in the SLR field could finally be permanently sign language variations [89]. Based on the presented results,
resolved [140]. universal autonomous tools capable of continuous SLR that
All studies from this group include an experimental eval- is signer-independent and language-independent remains a
uation of the proposed deep learning model, typically com- distant goal. Evaluation of the proposed encoder models sug-
paring its results with those obtained with alternative SLR gests that a slightly different architecture might be optimal for
approaches. The methods based on transformer architecture SLR than for linguistic tasks, so it would be very interesting
tend to outperform simple sequence to sequence models and to see innovative attempts to redefine transformer models and

126934 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

develop them with the explicit purpose of interpreting sign However, pure hand gesture analysis techniques are likely to
language [142]. be combined with other methods to obtain the best results.
This is exemplified by the increasing number of hybrid
D. USING HAND GESTURE FOR SLR models that consider many different elements of a signer’s
Because of the importance of hand gestures for SLR, and behavior [151], [152].
given the significant amount of scientific research that has
been carried out in this field, we limit our review to the E. USING POSE ESTIMATION FOR SLR
most important points while mentioning the most important Because body configuration plays an important role in sign
studies. Gesture interpretation has been a subject of scientific language recognition, pose estimation techniques are among
research for several decades; consequently, numerous reviews the core tools in this area. The basic idea is to determine
of this field have been conducted at various points in time. the exact pose of the entire body based on the positions of
One of the earliest reviews was performed by Gavrila [143], certain fixed points that can be ascertained through measure-
who considered several 2D and 3D models for the analysis ment. While this can be accomplished in various ways, deep
of human motion. Moeslund and Granum [144] provided learning algorithms have proven effective in this task given
a comprehensive recapitulation of two decades of research sufficient training using well-chosen samples. This is espe-
involving gesture tracking and recognition, while Ribeiro and cially true when high-quality input is provided, preferably
Gonzaga [145] focused primarily on real-time approaches from more than one source/modality [153].
available at the time. Some of the more recent publications
include an updated review of opportunities and challenges 1) POSE ESTIMATION BASED ON RGB IMAGES
in this field undertaken by Rautaray and Agrawal [146]. A method using a convolutional neural network was sug-
Kumar and Bhatia [147] discussed various feature extraction gested by [154] for determining pose of the human body
techniques, while Mohandes et al. [148] presented a survey by analyzing pictures of variable size, thus comparing how
of sensor-based and direct measurement methods for sign individual body parts are spatially organized. The final pre-
language recognition. Because this field has experienced sig- diction required the operations of pooling and up-sampling
nificant progress and undergone many reviews over the past to be repeated in several iterations. When this model was
two decades, we provide a brief overview of the current state experimentally tested with two different datasets, it accom-
of research in the field of hand gesture and sign language plished excellent results that outperformed the baselines for
recognition by automated systems. approximately 1.7-2.4%.
A majority of sign language characters and words can be Another model using the same type of neural network
expressed with simple hand gestures, which makes correct was presented by [155], and it leveraged mutually dependent
recognition of hand shapes a very practical feature for auto- variables to predict the body position. In addition to a CNN
mated systems. However, the process of recognizing hand network, this method also includes already prepared knowl-
gestures involves many difficulties, which may be related to edge maps, and it employs a procedure that doesn’t have to
different hand sizes and shapes among signers, as well as create a graphical representation in order to produce accurate
different skin shades. In addition, various individuals may output. This was confirmed by empirical testing where the so
use unique styles to display certain elements when signing. called Convolutional Pose Machine method outperformed all
Such difficulties can be resolved through the use of advanced alternatives by 9% on the MPII set, 6% on the LSP dataset,
analytic techniques aimed at identifying patterns independent and 3% on FLIC.
of the signer’s identity or the physical properties of their A model with cascading architecture was constructed
hands [149]. by [156] in 2014, using DNN as the basic tool for estimating
Because deep learning networks have the capability to the positions of joints on the body and their mutual relations.
identify latent connections among many different variables, This model frames the problem as a matter of regression,
they can be effectively used to analyze hand gestures in which turned out to be a very suitable paradigm, as evidenced
ASLR. Depending on the regional variations of sign lan- by the performance of the model which was above the marks
guages, both one-handed and two-handed gestures can be set by earlier solutions by 2% and 17% on two commonly
used to express certain words or phrases, with single-handed used datasets. In the work of [157], a new dataset for SLR
signs usually assigned basic meanings, such as letters or research was presented along with a benchmark for body
numbers. Thus, hand gesture analysis alone has the poten- position predictions for the comparison of techniques for pose
tial to correctly recognize simple linguistic content from estimation based on deep learning. Interestingly, they studied
still images or videos, as well as other sources. In other the possibilities for transfer learning and found evidence that
applications, hand gesture analysis may be complemented by this phenomenon applies in the field of SLR [158].
other techniques, such as tracking head movements [150]. A similar model for pose estimation that uses RGB photos
Given that hand motion is the central building block of all and deep learning was suggested by [159], starting from the
sign language communication systems, this aspect of SLR linear SMPL model. In this work, three-dimensional repre-
is unlikely to lose relevance despite increased focus on full- sentations of body joints were deployed as intermediaries
body tracking and continuous sign language interpretation. and a regression of parameters was performed. The model

VOLUME 9, 2021 126935


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

relies on autoencoders to act as the link between the regressed A model named DDP or Deep Depth Pose was proposed
SMPL and a convolutional neural network, guaranteeing by [166], where body positions were approximated by the
that any structural imperfections would be corrected. Those construction of maps based on depth information. Such maps
improvements yielded a tangible performance boost in com- were created in advance and contained many different body
parison with the basic SMPL on Surreal and Human 3.6M positions along with all relevant joints. This approach was
sets. practically proven to be effective, outperforming the bench-
Sign language communication relies on more than one marks by more than 11%.
channel, and in addition to hand movements, facial expres-
sions and body postures are commonly used to express mean- 3) ANALYSIS OF VARIOUS POSE ESTIMATION MODELS
ing. While much work has been performed in the area of Since body position estimation has an important role in many
automatic recognition of hand gestures, the literature con- different fields of research including SLR, there have been
cerning the analysis of body positions is not as abundant. many attempts to formulate a successful model based on
Jain et al. [160] attempted to address this problem by analyz- deep learning networks of convolutional and recurrent types.
ing the relationships among various body parts using a CNN. Introducing three-dimensional imaging and construction of
On the other hand, Yang and Ramanan [30] organized the data depth maps has greatly increased the recognition capacity
in tree-like fashion and used an SVM as the classifier. of such models. Some of the techniques aimed at achiev-
Another notable study in this area was conducted by ing further gains in terms of accuracy comprise cascade or
Chen and Yuille [161], who used a graphical model to rep- tree-like structures, imposing of certain limits, etc. From
resent the spatial configuration of body joints. This approach experimental evaluations, it is clear that recent models are
was improved by Charles [162], who extracted temporal far more powerful and reliable than their predecessors, but
information from successive images to improve the capabil- even the most complex solutions are still far from universal
ity of the system to interpret body positions, while Toshev applicability [82].
and Szegedy [156] provided an alternative method for body Research directed towards better interpretation of
joint position evaluation. Although there are many competing body positions remains a central topic for scientific
principles and ideas, the latter two approaches were chosen by research. In particular, researchers are working to ensure that
the authors of this study as the starting points for conducting exact positions of each joint can be determined even when
experiments on a new dataset, with the objective of establish- ambient noise is present in the images or parts of the body are
ing the most optimal methodology for body posture detection. blocked from view. There has been a lot of progress with 3D
One more recent solution based on a convolutional network mapping of the body positions, but a part of the complication
for analyzing graphs was proposed by [163], where a human arises from the fact that multiple 3D positions can correspond
body is presented in three dimensions by multiple points to a single 2D pose. An additional complication is caused
and links between them. This model deploys an attention by the difficulty of labeling 3D joint images, necessitating
mechanism to discriminate between data and contextualize the use of technologically advanced input devices. On the
this schematic representation. According to the results of other hand, effective regression of 3D information requires
experimental testing, this model can bring some modest gains highly precise mapping of spatial relations between key body
in the range between 0.7% and 3.4% compared to alternative points. Many existing models track multiple aspects includ-
methods on various SLR datasets. ing precise location of every joint in three dimension, from
various angles and with regard to specific body shapes. Such
2) POSE ESTIMATION BASED ON DEPTH IMAGING models represent the foundation for future SLR research that
A model formulated by [164] combines components of new methods can build upon. Technological advancement of
convolutional and recurring neural networks with a self- capturing devices also contributes to improved pose recogni-
correcting feature that can improve previous predictions. This tion and shape prediction abilities of new systems. Fusion
model builds a 3D vector space constructed from local data of different types of data (i.e. thermal imaging or hybrid
and extracts partial body poses from it while accounting for data) with vision-based indicators can make the systems
the noise. The authors tested the model on an original dataset more reliable under real life conditions and thus represents
and found that it is indeed superior to any of the existing a promising direction of research.
alternatives. Depth imaging also has a central role in the Using sensor technology, the positions of the key points
solution suggested by [165], which is named Depth Ranking (i.e., limbs and joints) are transferred directly, whereas in
Pose Estimation for 3D images. In this concept, CNN network image-based methods, those positions are inferred based on
is used for deciding between candidate pairs in the initial 2D images. Because of this essential difference, the methods
phase, followed by another step in which 3D pose estimations chosen for completion of this task must reflect the type of
are made, thus combining depth data with two-dimensional input as well as the intended outcome [167]. Deducing the
images to great effect. This model was evaluated using the pose is instrumental in correcting the interpretation of the
standard Human 3.6M dataset, yielding a significant accuracy content of sign language communication [152]. This aspect is
improvement on the scale of over 6 mm over any competing particularly important for continuous SLR, where individual
3D pose estimation methods. signs are displayed in a non-stop stream, and changes in body

126936 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

position can be indicative of the intended meaning of the continuous videos containing gestures and signs. The use of
entire expression. deep residual networks can minimize the need for preprocess-
Many factors can affect the performance of pose estimation ing. In [17], a model was developed that can enhance existing
algorithms, from the choice of features that can include both sign language recognition methods by between 15% and
2D and 3D data points, to the classifier depth and architecture. 38% relatively, and by 13.3% absolutely. Cui et al. [133] also
Many of the latest tools for pose estimation achieve relatively suggested a weakly supervised approach that could recognize
high accuracy, but they are too prone to false recognition to sign language continuously with the help of deep neural
consider them suitable for immediate practical application on networks. This approach achieved an outcome comparable to
a mass level [168]. In recent studies, there has been a tendency state-of-the-art approaches.
to use advanced technology, including the Microsoft Kinect
device, to detect body poses based on multiple parameters, B. SLR ISOLATED MODELS
which is clearly a direction that will be exploited further Until recently, a majority of sign language recognition
in the near future as better sensors and tracking devices experiments have been carried out on isolated sign sam-
become available [151], [156]. Finally, high-quality resources ples. These models examine a sequence of images or
necessary for evaluating new methods are emerging. The signals based on hand movements obtained from sensor
existence of readily available large SLR datasets stimulates gloves [97]. Sensor gloves often represent a complete sign.
more meticulous testing and brings us a step closer to the For instance, Koller et al. [63] used a dataset that featured
commercialization stage for this technology. isolated signs from Danish and New Zealand sign languages.
Another experiment by [37] proposed an isolated SLR system
V. SLR MODEL TYPES designed to extract discriminative aspects from videos, where
There are two types of models related to the recognition pro- each signed video corresponded to one word. After evaluating
cess of sign languages: the isolated model and the continuous the challenges of continuous translation, Escudeiro et al. [46]
model. In the following sections we will show the work that resorted to an isolated approach. In essence, every gesture
has been done in this aspect. was created separately, making it easier to use animations
with ease. Different observations by Fang et al. [128] sug-
A. SLR CONTINUOUS MODELS gested the use of a hierarchical model reliant on deep recur-
As part of sign language recognition and modeling, some rent neural networks. The model successfully combined the
experiments have used continuous models. For example, isolated low-level American Sign Language characteristics
Wu and Shao [38] proposed a new bimodal dynamic network into an organized high-level representation that could be used
suitable for continuous recognition of gestures. The model for translation.
relied on the positions of the 3D joints, as well as audio Recent developments in sign language experiments have
utterances of the gesture tokens. Koller et al. [63] demon- also suggested that the use of regions of interest (ROIs) to
strated the use of an EM-based algorithm for continuous sign isolate hand gestures and sign language features can enhance
language recognition. The EM-based algorithm was designed the accuracy of recognition [134]. In [131], the authors
to address the temporal alignment problem associated with used an isolated gloss recognition system to facilitate real-
continuous video processing tasks. Similarly, Li et al. [42] time sign language translation. The isolated gloss recog-
proposed a framework that addresses some of the scalability nition system included video pre-processing as well as a
challenges associated with continuous sign language recog- time-series neural network module. Another experiment by
nition. Another experiment by Camgoz et al. [64] developed Latif et al. [169] also considered video segments based on
an end-to-end system designed for continuous sign language an estimated ‘‘gloss-level.’’ While making their observations,
alignment and recognition. The model is based on explicit Cui et al. [3] set their receptive field to the estimated length of
subunit modeling. Similarly, Wang et al. [66] suggested a an isolated sign. A recent study by Huang et al. [49] focused
connectionist temporal fusion method having the capability on a basic isolated sign language recognition task. The use
to translate continuous visual languages in videos into textual of an attention-based 3D-CNN was proposed to recognize a
language sentences. large vocabulary. The model was advantageous because of it
Additional studies on continuous SLR models have been took advantage of the spatio-temporal feature learning capa-
conducted by Rao and Kishore [68]. A system was developed bilities of the 3D-CNN. Papadimitriou and Potamianos [95]
and evaluated at various times using continuous Indian Sign used the American Sign Language Lexicon Video Dataset,
Language sentences developed from 282 words. Similarly, which consists of video sequences of isolated American Sign
Koller et al. [77] used a database consisting of continuous Language signs.
signing in German Sign Language. In [46], animations were
processed continuously. However, this approach proved to be C. DELIBERATIONS ABOUT CONTINUOUS AND
extremely challenging because the animations were difficult ISOLATED SLR
to work with after processing. While exploring the chal- SLR comprises two distinct modes – isolated and continuous,
lenges of continuous translation, Pigou et al. [126] observed each of which requires a different approaches and is asso-
that deep residual networks can be used to learn patterns in ciated with very specific challenges. In particular, one key

VOLUME 9, 2021 126937


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

distinction is that direct supervision is much more essential Arabic Sign Language dataset. In some cases, sign language
for continuous SLR. In isolated SLR, all the relevant content experiments have focused on Chinese. In [81], vocabulary
is concentrated in a limited area of a single image, but in con- was adopted from 510 distinct words obtained from Chinese
tinuous SLR it is necessary to carefully align the sections of Sign Language. Among these words, 353 were single-sign
the video in chronological order and ensure that each sentence words, while the remaining were multi-sign words. Yang and
is properly tagged. That’s one example of the complexities Zhu [65] also showed interest in Chinese Sign Language
associated with continuous sign language recognition, which and used the instructional video We Learn Sign Language to
is far more demanding in terms of computing efficiency. This meet the objective of their experiment. Another experiment
must be taken into account during the evaluation of method- by Jiang and Zhang [71] used Chinese Sign Language to
ologies, as well as the feature selection process. If sequential facilitate the fingerspelling process. Furthermore, the authors
labeling is done correctly and the most predictive features are in [49]–[51], [97] used Chinese Sign Language in their
selected, the resulting model has a higher chance of being experiments.
accurate with continuous video analysis. A few other experiments evaluated Argentine Sign Lan-
Over the last several years, smart applications of deep guage. An example would be [37], where an initial dataset
learning systems have removed many obstacles in this field was obtained from 10 subjects speaking Argentine Sign Lan-
as well as many other related automation tasks, but real guage. Konstantinidis et al. [70] also used Argentine Sign
breakthroughs that could lead to broad application by general Language featuring 10 subjects to explore hand and body
population are still ahead. The attention mechanism is an skeletal recognition. Rather than focusing on a single lan-
intriguing element that works well with different types of guage, some experiments use a mixture of sign languages. For
data, and can be used to describe complex interactions in example, Koller et al. [63] employed a mixture of Danish and
space and time (for example, Graph Neural Networks appli- New Zealand sign languages in their effort to examine how
cations). Further research will show whether this approach to train a CNN on 1 million hand images. The sign languages
is the most optimal for resolving the issues complicating were obtained from two representative videos based on pub-
continuous SLR at the moment. licly available lexicons. The Danish data did not have any
motion blur, while the New Zealand version had some motion
VI. SIGN LANGUAGE RECOGNITION BASED ON blur. In another experiment, Camgoz et al. [64] focused on
LOCALIZATION Danish, New Zealand, and German sign languages to evaluate
Many basic concepts surround the use of sign language. First, the role of SubUNets in sign language recognition.
sign languages are never international. Most, but not all,
nations use different sign languages. Sign language is popular VII. RELATED STUDIES
in American, British, Arabic, and Chinese settings, among The importance of hand gestures and sign recognition is
many others. Table 3 provides an overview of various stud- indisputable; well-designed technologies of this type have
ies undertaken using different sign languages. For instance, the potential to impact millions of lives in a positive
American Sign Language (ASL), the most popular localiza- manner. This is reflected by the amount of new research
tion, includes independent grammar rules that are not a visual in this area [143], [144], [146], [170], [171], which is
form of English. Application of this localization was evident growing rapidly as new technological platforms become
in the experiment by Rioux-Maldague and Giguere [44], available. By compiling a comprehensive list of SLR meth-
where the authors applied their proposed technique and clas- ods that are currently being discussed in research circles
sified ASL based on grammatical rules. Another experiment [6], [16], [172], we aim to provide the foundation for future
by Tang et al. [39] considered 36 hand postures obtained researchers who are searching for references and inspira-
from American Sign Language to facilitate posture train- tion. We discuss two major groups of solutions: vision-based
ing and recognition. However, there are other systems that (including both static and dynamic) methods and sensor-
derive non-ASL signs and use them in English order. Such based SLR methods. Regarding the first group, various seg-
examples include experiments that have focused on Italian mentation and feature extraction techniques are reviewed
Sign Language. In [38], a dataset consisting of 20 Italian in [6], along with examples of successful neural classifiers.
cultural or anthropological signs was used to evaluate a novel The latter group is discussed primarily in the context of a
bimodal dynamic network designed to recognize gestures. specific sensor device that enables data capture, while data
The Italian dataset consisted of 393 labeled sequences and processing is explained only briefly.
a total of 7,754 gestures. In terms of performance evaluation, the two metrics that
Arabic Sign Language is also considered the preferred are highly relevant to all of the discussed studies are clas-
communication approach among many hearing impaired peo- sification accuracy and sample size. Classification accuracy
ple. In [40], depth and intensity images in the Arabic lan- is the percentage of correctly recognized signs and can be
guage were used to develop a system that can recognize within the 0–100% range, with a higher percentage indi-
associated signs. The proposed system was tested using a cating better recognition results. Some of the methods in
dataset obtained from three different users, resulting in an this field reached very high accuracy levels above 98%
accuracy of 99.5%. The authors in [121], [169] also used an [173]–[178], but it is important to understand the conditions

126938 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 3. Categorization of sign language studies.

under which an algorithm can be expected to perform to its sign recognition would open the doors for more advanced
full potential as well as the scope of its possible applications. linguistic operations, including translation into spoken lan-
Most importantly, reviewing these studies can direct any guage, and prediction of the following signs. Since there are
interested researchers toward the most relevant research that many regional variations of sign language, it is preferable
may contain further information regarding a topic of interest. to develop methods with universal potential. Deep learning
The sample size represents a combination of the total num- networks can be trained on specific language corpora, which
ber of gestures that were displayed during the experimental illustrates why this approach is so promising. It is hoped
evaluation and the number of classes to which a particular that the challenging issue of continuous SLR could be deci-
sign could belong [179], [180]. A larger sample size indicates sively solved with further perfection of currently researched
more reliable results, and is always preferable. The sample methods based on artificial intelligence and deep learning.
size used for model training was listed in cases where it was In this part, we will try to compare previous work with each
specified in the original study. other from several aspects such as datasets, features, and
The majority of reviews in this field lack sufficient space performance analysis.
to provide in-depth discussions of all methods, and instead
provide only a general snapshot [145], [181]–[184]. Addi- A. DATASETS
tionally, some aspects of the sign recognition algorithms were This section presents some of the most important and avail-
not significantly discussed; for example, data preprocessing able datasets containing hand gestures that can be used for the
methods have been omitted because of the uneven presence evaluation of SLR tools. The emphasis is on ensuring large
of such information in the studies that were reviewed. In addi- enough dictionaries to facilitate more robust testing and more
tion, methods that rely on sensors or customized input devices sophisticated applications. Currently, some high-quality sets
were not given proper consideration. Individual applications can be used for this purpose, depending on the chosen geo-
of SLR technology were also presented in a very succinct graphic variation of sign language. For UK version of sign
form, despite their relevance for the large number of users. language, researchers have multiple datasets at their disposal,
This topic definitely deserves more attention in order to create including RWTH-Boston-1, RWTH-Boston-50, and RWTH-
innovation space that would allow for addressing numerous Boston-400, ranging in number of different signs from
practical and philosophical issues that were not adequately 10 to 400.
covered in the previous period [59], [82]. The completeness High quality data corpus is also available for German
of a literature survey is also relative, as new studies are sign language, with DGS Kinect-40, SIGNUM, and RWTH-
rapidly published such that the solutions listed in any survey PHOENIX-Weather as the most prominent examples. Those
will eventually become obsolete. Hence, the value of this sets contain between 35 and 1225 unique signs, have a large
scientific resource will gradually decrease, and it will have number of authentic sentences by up to 9 skilled signers,
to be replaced with a more updated version at some point. and are labeled with the first and last frame of each sign
clearly defined in terms of facial and hand features. ASLLVD
VIII. BENCHMARKS proposed by Thangali et al. [198] is the most significant
As we have seen in the previous sections, advanced learn- resource for studying American Sign Language, and contains
ing algorithms have been used in the context of SLR with over 30 thousand signs performed by 6 different persons.
various degrees of success. As new deep learning architec- This is also a labeled set with designated frames marking the
tures are being devised, some could bring improvements to beginning and end of every sentence.
studies of sign language interpretation and push them a step Studies of Polish Sign Language variation can use one of
closer to practical application. Improved accuracy of basic the 3 high-quality data sets, including PSL Kinect 30, PSL

VOLUME 9, 2021 126939


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

ToF 84, and PSL 101. Those datasets contain only isolated TABLE 4. Datasets Arabic sign language recognition.
words (totaling between 30 and 101 signs) and have the limi-
tation of being performed by only one person. Sign corpus
IITA-ROBITA ISL is available for Indian researchers, and
it was developed collaboratively between 2010 and 2017 by
several research teams. Unfortunately, the entire set is per-
formed by a single signer and contains only 23 signs. From
all the aforementioned datasets, two stand out for their uni-
versal usability – ASLLVD and RWTH-PHOENIX-Weather.
Those publically available sign language sets are suitable
for interpretation of sign language in conditions that most the distinction between isolated and continuous SLR and the
resemble the real world, which is why they are often used as types of datasets suitable to each approach – for example,
benchmarks in SLR studies for determining the effectiveness alphanumerical characters or words are typically used for
of proposed computing techniques. recognition of isolated language elements, while sentences
Access to specialized datasets is currently one of the or even longer segments of speech are necessary for con-
limiting factors in SLR research, which is why almost all tinuous SLR experiments. The datasets also greatly differ in
researchers focus on this issue. The problem is exacerbated terms of size and complexity, which is important to consider
by the fact that separate datasets are required for each regional when attempting to evaluate the generalization potential of a
variation of sign language and for each different type of given model. However, even the largest datasets are far from
linguistic task. In some studies, the authors constructed new exhaustive and are typically limited by available resources
datasets from scratch by making video recordings of sign and practical concerns.
language users and obtaining other measurements, while in One encouraging trend is that additional datasets doc-
other cases, well-known local datasets were used instead. umenting many different regional variations of sign lan-
A typical dataset contains multiple repetitions of the same guage are becoming available. This is important because SLR
sign by several signers, with the objective of facilitating research is universally relevant, so building automated tools
signer-independent recognition capacity after training. Some capable of recognizing local versions of hand signs should
datasets presented in the literature are considerably larger be a priority. Multi-modal datasets are also becoming more
than others, and this aspect should be taken into consideration common, which is a positive development signaling the next
when assessing the reliability of results. stage of SLR research and opening additional possibilities for
Our examination of the datasets in all the reviewed research innovative ideas. On the negative side, most datasets were cre-
papers was conducted based on firmly defined criteria as ated using a very small number of signers and feature a small
we can see in Tables 4-12, and relied on the discussions in number of classes, which brings into question their represen-
the literature. Given that all papers are primarily interested tative value. Consequently, the accuracy of any automated
in decoding sign language elements of various complexity tools that rely on those datasets could be compromised when
levels, the databases used have many common features and faced with slightly different presentations of sign language
can effectively be classified based on these features. The gestures. In any AI-related research field, the availability of
criteria were selected with the idea of providing a framework high-quality datasets for model training and testing is a cru-
for direct comparison between studies, although in some cial factor that can affect the pace of progress. As a relatively
cases, certain categories may not be applicable or some data new area of interest, SLR research initially suffered from
may not have been reported by the authors. In this manner, this problem, but the studies reviewed offer evidence that
our overview attempts to illustrate both the commonalities the situation is steadily improving in this respect. There are
and differences among datasets upon which the conclusions several widely used datasets that can be considered ‘standard’
of each study were based. Owing to space constraints and and can be used whenever broad compatibility of the exper-
the need for clarity, training, testing, and evaluation datasets imental results is desired. On the other hand, new datasets
were typically merged together, so in some of the studies, focused on local sign language systems are quickly emerging
the actual structure of a particular dataset may be more and could potentially be re-used to fuel additional research in
complex than is apparent from the size listed in the table. the same geographic location. Despite the optimistic outlook,
A more focused examination of each particular example is it is necessary to recognize that currently available datasets
recommended for those interested in the practical use of any differ greatly in terms of quality, size, and structure, poten-
SLR dataset belonging to this group. tially necessitating the compilation of new datasets to support
A quick glance at Tables 4–12 is sufficient to note the specific directions of research.
large degree of diversity among the datasets with respect
to the parameters used. This is a natural result of the fact B. PERFORMANCE EVALUATION
that sign language studies employ a variety of methodolog- A vast majority of research papers are concerned with accu-
ical concepts and may explore mutually unrelated aspects rate recognition of sign language material, and the pri-
of sign language recognition. It is important to understand mary metrics they use attempt to measure this capability.

126940 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 5. Datasets for American sign language.

TABLE 6. Datasets for sign language recognition for EU countries languages.

Consequently, some studies use common percentage-based effectively the algorithm could differentiate between sign lan-
accuracy indicators, such as precision and recall, as well as guage words or sentences, often in comparison with several
their combination, known as the F1 score. Depending on the benchmark methods. Because of the diverse nature of the
stage of an experiment, some authors differentiate between tests, the results can generally be compared across differ-
training accuracy, testing accuracy, and validation accuracy. ent studies with only some reservations; in general, many
The training time necessary to accomplish reasonable accu- methods performed reasonably well and recognized more
racy is another factor that was tracked in some studies, and it than 90% of the displayed signs. In some cases, the reported
was most commonly expressed in epochs. Processing time effectiveness was above 97%, but this usually involved less
and input video length were less frequently considered to complex tasks and often could not be maintained over mul-
be sufficiently relevant to warrant direct measurement, but tiple datasets. For continuous SLR tasks, recognition rates
could be expressed in seconds and/or the number of frames. above 80% can be considered very strong, especially when
Tables 13 and 14 provide overviews of two performance they are consistent over multiple datasets.
benchmarks used in ASLR studies. A notable trend found was that almost all algorithms exhib-
Almost every research paper reviewed includes a quanti- ited mixed performance from one sign to another, and in
tative evaluation of the proposed sign recognition algorithm. general, only a handful of confusing signs were typically
Testing varies greatly in scope and complexity, with particular responsible for a large portion of false recognitions. These
tests being administered depending on the objectives of the frequent mistakes often persisted regardless of the classifier
study. In general, the tests were designed to estimate how or training procedure, and were caused either by the similarity

VOLUME 9, 2021 126941


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 7. East Asian countries sign language recognition datasets.

TABLE 8. Other sign language recognition datasets.

between hand gestures or other systemic factors. This finding one correct answer. A BLUE score was used to assess the
implies that certain difficulties in the structure and form of quantitative output of translation models with values between
sign language, rather than methodological deficiencies, are 0 and 100 as depicted in Table 15, while qualitative analysis
impeding the construction of more effective tools and serves was based on comparison with ground truth as interpreted
as a reminder that currently available SLR algorithms are still by human operators. A combination of accuracy and training
prone to error and must be constantly compared with human- sample size is used to construct the learning curve, which
created estimations to avoid miscommunication. demonstrates how the performance changes as the volume of
In general, the performance of the suggested models is typ- training sample increases. Word error rate (WER) analysis is
ically evaluated in terms of capacity for correct execution of conducted in some studies, as shown in Table 16, to determine
the primary task, i.e., sign language recognition or translation. which glosses are most confused with each other.
Average accuracy for the entire dataset is given as the main To be effective, the neural classifier must first be trained
indicator of model performance, with a higher percentage on data resembling the samples it will encounter during test-
indicative of a more accurate system. In some cases, top-1, ing and/or practical use. The training data usually involve
top-5, and top-10 accuracy were calculated, expressing the a basic group of sign language characters, words, or sen-
model’s ability to identify ‘most likely’ candidates rather than tences presented in a format that the system was designed to

126942 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 9. Datasets for sign language based on alphabetic linguistic TABLE 11. Datasets for sign language based on linguistic content in
content. words and sentences.

TABLE 12. Datasets for sign language based on other linguistic content.

TABLE 10. Datasets for sign language based on hand gesture linguistic
content.

decipher, which is typically annotated by human observers.


After training is conducted, the model can be used to deduce
the sign language elements in the same format with varying
degrees of accuracy. In some studies, several classifiers were
tested on the same tasks to evaluate their relative strengths and
weaknesses, while in others the focus was on discovering the
most suitable combinations of features. The capability of the
neural model is generally limited to the signs learned from isolated words, and they frequently have to analyze multiple
the training set, but some generalization regarding different signs together to understand the meaning behind a given
people displaying the same sign can be achieved. Therefore, sequence. In response, researchers have to deploy hybrid
the optimization of training parameters is one of the most architectures and sophisticated sequence-to-sequence models
important elements of SLR research and can have tremen- intended to capture semantic nuances and avoid confusing
dous impact on the utility value of the proposed solutions. similar signs.
More advanced systems aim to develop real-time translation
capacity and to interpret more complex segments of continu- IX. OPEN ISSUES AND FUTURE DIRECTIONS
ous sign language speech. Such applications are vastly more After reviewing numerous studies related to SLR, the most
complex than simple recognition of alphabetic characters or obvious weakness is the fragmented nature of research in this

VOLUME 9, 2021 126943


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 13. State-of-the-art sign language recognition accuracy results. TABLE 14. State-of-the-art sign language recognition accuracy results.

field. Many research teams have achieved promising results


using a wide variety of approaches, but there is little overlap
among these studies, and the joint utilization of multiple
effective tools is slow to emerge. The lack of a general con- struggles to handle continuous streams of communications
sensus regarding the most valuable features and the optimal such as conversations and narratives.
neural network architecture may be an obstacle to achieving These shortcomings will almost certainly be remedied in
better practical results. Recognition of continuous sign lan- the future, as this field is regarded as socially relevant and
guage speech remains a considerable challenge, and even the consequently receives significant attention from some of the
best automated systems struggle with linguistic nuances that world’s most accomplished research teams. It may be argued
can be expressed in sign language sentences. This may be that the upcoming period is critical for overcoming some of
partially a consequence of the fact that most available datasets the obstacles that stand in the way of more rapid progress.
contain only limited vocabularies and simple sentences, while Some of the main areas where it would be reasonable to
training models for advanced linguistic tasks requires far expect significant changes over the next few years include
more extensive libraries containing diverse examples. the following.
Understanding sign language communication remains a
formidable challenge for automated systems. On closer A. TYPE OF INPUT
observation, the reasons for the continued inability of A majority of reviewed models make use of depth imaging,
machines to consistently and accurately interpret sign lan- although some are focused on the RGB images with a higher
guage sequences are not as mysterious as they appear to be at amount of details to facilitate efficient SLR. Sequential infor-
first glance. Any natural language features a complex inter- mation has been useful as well, most commonly for tracking
play of many rules and relationships, which are difficult to objects and scenes, along with information about the skele-
summarize in a mathematical format that can be programmed ton (i.e. joint positions). Thermal imaging is less frequently
into computers. This explains why the current generation used for SLR, but can bring additional gains when combined
of sign language recognition (SLR) tools fares quite well with some of the basic types of information such as images.
with alphabetic characters and simple words and phrases, yet On the level of signs, there is a distinction between static

126944 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

TABLE 15. Bilingual evaluation understudy score comparison.

TABLE 16. Word error rate comparison.

and dynamic signs, with the latter group having a subgroup to reality. Real communications are continuous and without
used in continuous SLR. Based on the current trends, it can artificial limits, and modern SLR tools must be able to handle
be assumed that continuous video and complex signs will long sequences of signs without problems. With use of deep
become a key focus of study in the next period. It appears learning networks entering a mature stage, this ambitious goal
that all the preconditions are in place for this shift of focus to could be reached soon.
occur.

B. GLOBAL RESOURCE BASE C. COMBINING DIFFERENT FEATURES


One of the overarching themes in SLR research is the chronic This issue has been addressed by many studies, but many
lack of high-quality input databases. Large and diverse difficulties still need to be worked out. It can be desirable
datasets are available only for American Sign Language to combine features when trying to describe multiple parts of
and a few other variations, such as Chinese or Indian, but the human body. This issue is typically complicated by the
researchers in smaller countries lack the samples needed fact that data can be in different formats and include textual
for model training and testing. This is slowly changing as elements, images of different types, depth and skeleton data,
the volume of study into SLR continues to grow, and the etc. Fusing some of this data together can result in improved
accumulated resources are becoming sufficient to support feature engineering and consequently a more accurate model.
the next wave of research. While the situation is certainly Three main areas of the body where such features are concen-
improving, it remains difficult to test more advanced appli- trated include hands, facial region, and the torso. Limiting the
cations that require large vocabularies to demonstrate the full attention to hands only can result in imperfect models that fail
capacities of existing or future methods. On the other hand, to properly interpret some of the signs.
direct collaboration among research teams and more proac- Specific areas relevant for success in this regard include
tive sharing of available resources could alleviate current detection of hand position, estimation of hand shapes and
issues to a considerable degree and provide a blueprint for gestures, real time following of hand movement and simi-
more impactful networking. Sign language recognition is a lar tasks, and in many ways all of those tasks can present
worldwide problem, and the only way to resolve it requires a problems. For example, there is extremely high variability
truly global effort. in the size and shapes of hands of different signers, while
On the other hand, there are numerous regional variations on the other hand different fingers can look very similar and
of sign language, relying on unique combinations of hand sometimes block each other from view. Ambient conditions
and facial gestures to express meaning. For this reason, there including the amount of available light also come into play.
is a clear need for high-quality datasets including all rele- Those difficulties are magnified when input images are in
vant input modalities. With regard to hand signs, there is low resolution of there are interfering objects, and when
currently a lack of adequate labeled sets that would enable complicated gestures need to be analyzed. To alleviate some
testing of SLR tools under natural conditions, but this has of those concerns, researchers resort to feature fusion and
been changing recently. Hopefully, improving datasets will include facial characteristics into the mix.
eventually facilitate development of practically applicable On the other hand, rapid motion of the face and neck during
SLR models. For this, it’s necessary to label longer parts sign language use present their own challenges, including par-
of sign language speech, not just individual elements as is tial blocking of key areas. The third group of features – those
mostly the case right now. Basically, new datasets need to related to signer’s body – can be added as well and bring an
reflect the diversity of communication with sign language additional recognition improvement. Hence, versatile models
so that newly developed methods could be a step closer capable of leveraging features from different parts of the body

VOLUME 9, 2021 126945


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

have an edge and present a better starting point for future that will have to be considered more seriously in the future,
research. with the idea of providing the user with a level of control over
the software used by a system. It is equally important to create
D. SEQUENCE SLR MODELS feedback mechanisms that allow for the instant discovery of
While notable success has been attained in the realm of common errors while ensuring that user opinions are valued.
isolated SLR, where the algorithms merely have to recognize The previous period of discoveries has renewed interest in
the single alphabetic sign or word, the same cannot be said for SLR research and has resulted in many competing theoretical
continuous SLR, which involves interpretation of longer seg- postulations.
ments of speech. Contextual relationships among signs have While deep learning networks are broadly accepted as the
a strong impact on the meaning of the sentences, so this task most suitable technology for tackling this difficult linguistic
cannot be reduced to the recognition of individual gestures. problem, there is much work needed before fully automated
Contemporary attempts to develop continuous SLR capac- systems capable of understanding continuous streams of sign
ity have demonstrated only limited effectiveness and language communication can be created. In the upcoming
frequently make mistakes when sophisticated analysis of decade, we can expect some of the already known solutions to
semantic details is required. This is obviously one of the hot mature to the point where their accuracy is nearly perfect, and
topics of SLR research that will continue to be examined it is possible that a major breakthrough will occur at any given
in many different ways, searching for a configuration that point. As the SLR methods become more reliable, it is nearly
can overcome the difficulties preventing the emergence of a given that more creative and meaningful mainstream appli-
highly effective tools. Based on current findings, we believe cations will emerge, delivering direct benefits to the entire
that research in this area will focus on deeper and more population of hearing/speech-impaired people anywhere in
complex neural network models that employ additional layers the world.
and combine several types of layer compositions to gain
additional processing power. X. CONCLUSION
There is little doubt that the current momentum in the field
E. IMPROVING RECOGNITION ACCURACY of sign language recognition will continue into the foresee-
To be used commercially and trusted by everyday users, able future – the number of potential beneficiaries of such
any technology must be highly reliable (>99%) and highly solutions is simply too great to ignore. Recent advances in
consistent. This is not the case with the SLR applications this area have been largely fueled by the use of deep learning
available today, as they typically still report a small but models, which are currently being perfected and will only
persistent percentage of false positives and false negatives. become more widely accepted in the coming years. Over the
The rate of incorrectly recognized items increases as the size past decade, many original and highly innovative suggestions
of the vocabulary and the complexity of the tasks increases, have been used to build SLR tools by extracting features from
which is why very few SLR tools are currently deployed in sensor data or visual streams and feeding them into neural
practice. Some of the proposed solutions are conceptually classifiers.
sound and appear suitable for further development, but they In this paper, we covered most of the currently known
are often created by small teams that lack the resources to methods for SLR tasks based on deep neural architectures that
conduct large-scale testing and refine the training procedures. were developed over the past several years, and divided them
For the next phase, it is necessary to summon broader support into clusters based on their chief traits. There is a multitude of
and gather sufficient funds and resources to make high-level options in this regard, as this area of research has been attract-
accuracy optimization possible. The systems will have to be ing a lot of attention lately. The most common design deploys
tested under a wide variety of settings and deliver reasonably a CNN network to derive discriminative features from raw
useful results even when the external conditions are less than data, since this type of network offers the best properties for
ideal (for example, input images taken under poor lighting this task. When information is collected in multimedia for-
conditions). mat, some of the architectures that can be used include Long
Short Term Memory, Recurrent Neural Networks, and GRU.
F. IMPROVING THE EFFICIENCY OF SLR SOLUTIONS In many cases, multiple types of networks were combined in
In the past, the focus of scientific research has been on devel- order to improve final performance. Those models are capa-
oping the fundamental capability to meaningfully connect ble of processing information from different sources and in
observed hand and body gestures and fixed units of sign different formats, including still images, depth information,
language. While this is understandable for an early stage thermal scans, skeletal data and sequential information have
of scientific examination, it is will necessary to increase all been used with success.
attention on the usability dimension in the future. Some of the Some of the proposed models were demonstrated to be
earliest SLR solutions required body-worn sensors and other very effective, albeit on tasks with limited scope. Studies
bulky equipment, but newer systems are considerably less come from all parts of the world and address many different
demanding and may include only a few miniature cameras. variations of sign language, which is very important toward
Interaction between the user and the system is another topic ensuring global coverage. Despite some remaining issues,

126946 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

it is fair to conclude that the scientific community is mak- [15] S. Wei, X. Chen, X. Yang, S. Cao, and X. Zhang, ‘‘A component-based
ing steady progress toward developing real-time, two-way vocabulary-extensible sign language gesture recognition framework,’’
Sensors, vol. 16, no. 4, p. 556, Apr. 2016.
translation systems that can eventually be deployed in the [16] N. B. Ibrahim, H. H. Zayed, and M. M. Selim, ‘‘Advances, challenges and
real world. Before this occurs, it will be necessary to achieve opportunities in continuous sign language recognition,’’ J. Eng. Appl. Sci.,
more consistent performance and eliminate some common vol. 15, no. 5, pp. 1205–1227, Dec. 2019.
[17] L. Zheng, B. Liang, and A. Jiang, ‘‘Recent advances of deep learning
confusion points (where most algorithms tend to misinterpret for sign language recognition,’’ in Proc. Int. Conf. Digit. Image Comput.,
an intended sign). Techn. Appl. (DICTA), Nov. 2017, pp. 1–7.
Some hybrid models are emerging that combine the best [18] T. Starner, J. Weaver, and A. Pentland, ‘‘Real-time American sign lan-
guage recognition using desk and wearable computer based video,’’
characteristics of several types of neural networks, and solu- IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375,
tions of this type may represent the most logical path forward Dec. 1998.
with respect to advanced SLR applications. It is reasonable [19] F.-S. Chen, C.-M. Fu, and C.-L. Huang, ‘‘Hand gesture recognition using
a real-time tracking method and hidden Markov models,’’ Image Vis.
to expect breakthroughs in this field in the future, and many Comput., vol. 21, no. 8, pp. 745–758, Aug. 2003.
of the research studies may include key elements that will [20] X. Wang, M. Xia, H. Cai, Y. Gao, and C. Cattani, ‘‘Hidden-Markov-
eventually become a part of the final solution to automated models-based dynamic hand gesture recognition,’’ Math. Problems Eng.,
vol. 2012, pp. 1–11, Feb. 2012.
sign language recognition. Even at this stage, many SLR [21] G. Marin, F. Dominio, and P. Zanuttigh, ‘‘Hand gesture recognition with
tools can be practically used to some extent, and can provide leap motion and kinect devices,’’ in Proc. IEEE Int. Conf. Image Process.
immediate relief to disabled people as well as point to future (ICIP), Oct. 2014, pp. 1565–1569.
[22] X. Lu, B. Qi, H. Qian, Y. Gao, J. Sun, and J. Liu, ‘‘Kinect-based human
directions of research. finger tracking method for natural haptic rendering,’’ Entertainment Com-
put., vol. 33, Mar. 2020, Art. no. 100335.
[23] S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, ‘‘Text2sign: Towards
REFERENCES sign language production using neural machine translation and generative
[1] T. Kim, J. Keane, W. Wang, H. Tang, J. Riggle, G. Shakhnarovich, adversarial networks,’’ Int. J. Comput. Vis., vol. 128, no. 4, pp. 1–18,
D. Brentari, and K. Livescu, ‘‘Lexicon-free fingerspelling recognition 2020.
from video: Data, models, and signer adaptation,’’ Comput. Speech Lang., [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
vol. 46, pp. 209–232, Nov. 2017. ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
[2] M. A. Ahmed, B. B. Zaidan, A. A. Zaidan, M. M. Salih, and pp. 2278–2324, Nov. 1998.
M. M. B. Lakulu, ‘‘A review on systems-based sensory gloves for sign [25] E. Escobedo, L. Ramirez, and G. Camara, ‘‘Dynamic sign language
language recognition state of the art between 2007 and 2017,’’ Sensors, recognition based on convolutional neural networks and texture maps,’’
vol. 18, no. 7, p. 2208, 2018. in Proc. 32nd SIBGRAPI Conf. Graph., Patterns Images (SIBGRAPI),
[3] R. Cui, H. Liu, and C. Zhang, ‘‘A deep neural framework for continuous Oct. 2019, pp. 265–272.
sign language recognition by iterative training,’’ IEEE Trans. Multimedia, [26] S. Hayani, M. Benaddy, O. El Meslouhi, and M. Kardouchi, ‘‘Arab sign
vol. 21, no. 7, pp. 1880–1891, Jul. 2019. language recognition with convolutional neural networks,’’ in Proc. Int.
Conf. Comput. Sci. Renew. Energies (ICCSRE), Jul. 2019, pp. 1–4.
[4] P. S. Santhalingam, P. Pathak, J. Košecká, and H. Rangwala, ‘‘Sign
[27] Y. Liao, P. Xiong, W. Min, W. Min, and J. Lu, ‘‘Dynamic sign lan-
language recognition analysis using multimodal data,’’ in Proc. IEEE Int.
guage recognition based on video sequence with BLSTM-3D residual
Conf. Data Sci. Adv. Anal. (DSAA), Oct. 2019, pp. 203–210.
networks,’’ IEEE Access, vol. 7, pp. 38044–38054, 2019.
[5] A. C. Duarte, ‘‘Cross-modal neural sign language translation,’’ in Proc. [28] P. Witoonchart and P. Chongstitvatana, ‘‘Application of structured
27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1650–1654. support vector machine backpropagation to a convolutional neural
[6] M. J. Cheok, Z. Omar, and M. H. Jaward, ‘‘A review of hand gesture network for human pose estimation,’’ Neural Netw., vol. 92, pp. 39–46,
and sign language recognition techniques,’’ Int. J. Mach. Learn. Cybern., Aug. 2017. [Online]. Available: https://www.sciencedirect.com/
vol. 10, no. 1, pp. 131–153, Jan. 2019. science/article/pii/S0893608017300321
[7] Q. Xiao, Y. Zhao, and W. Huan, ‘‘Multi-sensor data fusion for sign [29] S. He, ‘‘Research of a sign language translation system based on deep
language recognition based on dynamic Bayesian network and con- learning,’’ in Proc. Int. Conf. Artif. Intell. Adv. Manuf. (AIAM), Oct. 2019,
volutional neural network,’’ Multimedia Tools Appl., vol. 78, no. 11, pp. 392–396.
pp. 15335–15352, Jun. 2019. [30] Y. Yang and D. Ramanan, ‘‘Articulated pose estimation with flexible
[8] E. K. Kumar, P. V. V. Kishore, M. T. K. Kumar, and D. A. Kumar, mixtures-of-parts,’’ in Proc. CVPR, Jun. 2011, pp. 1385–1392.
‘‘3D sign language recognition with joint distance and angular coded [31] P. Q. Thang, N. T. Thuy, and H. T. Lam, ‘‘The SVM, SimpSVM and RVM
color topographical descriptor on a 2–stream CNN,’’ Neurocomputing, on sign language recognition problem,’’ in Proc. 7th Int. Conf. Inf. Sci.
vol. 372, pp. 40–54, Jan. 2020. Technol. (ICIST), Apr. 2017, pp. 398–403.
[9] J. Wu and R. Jafari, ‘‘Wearable computers for sign language recognition,’’ [32] R. A. A. R. Agha, M. N. Sefer, and P. Fattah, ‘‘A comprehensive study
in Handbook of Large-Scale Distributed Computing in Smart Healthcare. on sign languages recognition systems using (SVM, KNN, CNN and
Cham, Switzerland: Springer, 2017, pp. 379–401. ANN),’’ in Proc. 1st Int. Conf. Data Sci., E-Learn. Inf. Syst., Oct. 2018,
pp. 1–6.
[10] J. Shang and J. Wu, ‘‘A robust sign language recognition system with mul-
[33] E. Alpaydin, Introduction to Machine Learning. Cambridge, MA, USA:
tiple wi-fi devices,’’ in Proc. Workshop Mobility Evolving Internet Archit.,
MIT Press, 2020.
New York, NY, USA, 2017, pp. 19–24, doi: 10.1145/3097620.3097624.
[34] A. Wadhawan and P. Kumar, ‘‘Sign language recognition systems:
[11] J. Pu, W. Zhou, and H. Li, ‘‘Iterative alignment network for continuous
A decade systematic literature review,’’ Arch. Comput. Methods Eng.,
sign language recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
vol. 28, no. 3, pp. 1–29, 2019.
Recognit., Jun. 2019, pp. 4165–4174.
[35] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From
[12] Q. Xiao, M. Qin, and Y. Yin, ‘‘Skeleton-based Chinese sign language Data, vol. 4. New York, NY, USA: AMLBook, 2012.
recognition and generation for bidirectional communication between deaf [36] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
and hearing people,’’ Neural Netw., vol. 125, pp. 41–55, May 2020. pp. 436–444, May 2015.
[13] W. Aly, S. Aly, and S. Almotairi, ‘‘User-independent American sign lan- [37] D. Konstantinidis, K. Dimitropoulos, and P. Daras, ‘‘A deep learning
guage alphabet recognition based on depth image and PCANet features,’’ approach for analyzing video and skeletal features in sign language
IEEE Access, vol. 7, pp. 123138–123150, 2019. recognition,’’ in Proc. IEEE Int. Conf. Imag. Syst. Techn. (IST), Oct. 2018,
[14] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and P. Presti, ‘‘American pp. 1–6.
sign language recognition with the kinect,’’ in Proc. 13th Int. Conf. [38] D. Wu and L. Shao, ‘‘Multimodal dynamic networks for gesture recogni-
Multimodal Interfaces (ICMI), 2011, pp. 279–286. tion,’’ in Proc. 22nd ACM Int. Conf. Multimedia, Nov. 2014, pp. 945–948.

VOLUME 9, 2021 126947


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

[39] A. Tang, K. Lu, Y. Wang, J. Huang, and H. Li, ‘‘A real-time hand posture [61] N. Rossol, I. Cheng, and A. Basu, ‘‘A multisensor technique for ges-
recognition system using deep neural networks,’’ ACM Trans. Intell. Syst. ture recognition through intelligent skeletal pose analysis,’’ IEEE Trans.
Technol., vol. 6, no. 2, pp. 1–23, May 2015. Hum.-Mach. Syst., vol. 46, no. 3, pp. 350–359, Jun. 2015.
[40] S. Aly, B. Osman, W. Aly, and M. Saber, ‘‘Arabic sign language finger- [62] L. Quesada, G. López, and L. Guerrero, ‘‘Automatic recognition of the
spelling recognition from depth and intensity images,’’ in Proc. 12th Int. American sign language fingerspelling alphabet to assist people living
Comput. Eng. Conf. (ICENCO), Dec. 2016, pp. 99–104. with speech or hearing impairments,’’ J. Ambient Intell. Humanized Com-
[41] J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Sign language recognition using put., vol. 8, no. 4, pp. 625–635, 2017.
3D convolutional neural networks,’’ in Proc. IEEE Int. Conf. Multimedia [63] O. Koller, H. Ney, and R. Bowden, ‘‘Deep hand: How to train a CNN on 1
Expo (ICME), Jun. 2015, pp. 1–6. million hand images when your data is continuous and weakly labelled,’’
[42] K. Li, Z. Zhou, and C.-H. Lee, ‘‘Sign transition modeling and a scalable in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
solution to continuous sign language recognition for real-world applica- pp. 3793–3802.
tions,’’ ACM Trans. Accessible Comput., vol. 8, no. 2, pp. 1–23, Jan. 2016. [64] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, ‘‘Sub-
[43] J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Sign language recognition using UNets: End-to-end hand shape and continuous sign language recog-
real-sense,’’ in Proc. IEEE China Summit Int. Conf. Signal Inf. Process. nition,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
(ChinaSIP), Jul. 2015, pp. 166–170. pp. 3075–3084.
[44] L. Rioux-Maldague and P. Giguere, ‘‘Sign language fingerspelling clas- [65] S. Yang and Q. Zhu, ‘‘Video-based Chinese sign language recognition
sification from depth and color images using a deep belief network,’’ in using convolutional neural network,’’ in Proc. IEEE 9th Int. Conf. Com-
Proc. Can. Conf. Comput. Robot Vis., May 2014, pp. 92–97. mun. Softw. Netw. (ICCSN), May 2017, pp. 929–934.
[45] M. A. Hossen, A. Govindaiah, S. Sultana, and A. Bhuiyan, ‘‘Bengali sign [66] S. Wang, D. Guo, W.-G. Zhou, Z.-J. Zha, and M. Wang, ‘‘Connectionist
language recognition using deep convolutional neural network,’’ in Proc. temporal fusion for sign language translation,’’ in Proc. 26th ACM Int.
Joint 7th Int. Conf. Informat., Electron. Vis. (ICIEV) 2nd Int. Conf. Imag., Conf. Multimedia, 2018, pp. 1483–1491.
Vis. Pattern Recognit. (icIVPR), Jun. 2018, pp. 369–373. [67] A. Balayn, H. Brock, and K. Nakadai, ‘‘Data-driven development of
[46] P. Escudeiro, N. Escudeiro, R. Reis, J. Lopes, M. Norberto, A. B. Baltasar, virtual sign language communication agents,’’ in Proc. 27th IEEE
M. Barbosa, and J. Bidarra, ‘‘Virtual sign—A real time bidirectional Int. Symp. Robot Hum. Interact. Commun. (RO-MAN), Aug. 2018,
translator of Portuguese sign language,’’ Proc. Comput. Sci., vol. 67, pp. 370–377.
pp. 252–262, Jan. 2015. [68] G. A. Rao and P. Kishore, ‘‘Selfie sign language recognition with mul-
[47] Suharjito, H. Gunawan, N. Thiracitta, and A. Nugroho, ‘‘Sign language tiple features on adaboost multilabel multiclass classifier,’’ J. Eng. Sci.
recognition using modified convolutional neural network model,’’ in Technol., vol. 13, no. 8, pp. 2352–2368, 2018.
Proc. Indonesian Assoc. Pattern Recognit. Int. Conf. (INAPR), Sep. 2018, [69] M. C. Ariesta, F. Wiryana, Suharjito, and A. Zahra, ‘‘Sentence level
pp. 1–5. Indonesian sign language recognition using 3D convolutional neural
[48] N. Soodtoetong and E. Gedkhaw, ‘‘The efficiency of sign language network and bidirectional recurrent neural network,’’ in Proc. Indonesian
recognition using 3D convolutional neural networks,’’ in Proc. 15th Int. Assoc. Pattern Recognit. Int. Conf. (INAPR), Sep. 2018, pp. 16–22.
Conf. Electr. Eng./Electron., Comput., Telecommun. Inf. Technol. (ECTI- [70] D. Konstantinidis, K. Dimitropoulos, and P. Daras, ‘‘Sign language recog-
CON), Jul. 2018, pp. 70–73. nition based on hand and body skeletal data,’’ in Proc. 3DTV-Conf.,
[49] J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Attention-based 3D-CNNs for True Vis. Capture, Transmiss. Display 3D Video (3DTV-CON), Jun. 2018,
large-vocabulary sign language recognition,’’ IEEE Trans. Circuits Syst. pp. 1–4.
Video Technol., vol. 29, no. 9, pp. 2822–2832, Sep. 2019. [71] X. Jiang and Y.-D. Zhang, ‘‘Chinese sign language fingerspelling via
[50] D. Guo, W. Zhou, A. Li, H. Li, and M. Wang, ‘‘Hierarchical recurrent deep six-layer convolutional neural network with leaky rectified linear units
fusion using adaptive clip summarization for sign language translation,’’ for therapy and rehabilitation,’’ J. Med. Imag. Health Informat., vol. 9,
IEEE Trans. Image Process., vol. 29, pp. 1575–1590, 2019. no. 9, pp. 2031–2090, Dec. 2019.
[51] N. Wang, Z. Ma, Y. Tang, Y. Liu, Y. Li, and J. Niu, ‘‘An optimized [72] H. B. D. Nguyen and H. N. Do, ‘‘Deep learning for American sign
scheme of mel frequency cepstral coefficient for multi-sensor sign lan- language fingerspelling recognition system,’’ in Proc. 26th Int. Conf.
guage recognition,’’ in Proc. Int. Conf. Smart Comput. Commun. Cham, Telecommun. (ICT), Apr. 2019, pp. 314–318.
Switzerland: Springer, 2016, pp. 224–235. [73] S. Ameen and S. Vadera, ‘‘A convolutional neural network to classify
[52] T.-W. Chong and B.-G. Lee, ‘‘American sign language recognition using American sign language fingerspelling from depth and colour images,’’
leap motion controller with machine learning approach,’’ Sensors, vol. 18, Expert Syst., vol. 34, no. 3, Jun. 2017, Art. no. e12197.
no. 10, p. 3554, Oct. 2018. [74] O. K. Oyedotun and A. Khashman, ‘‘Deep learning in vision-based
[53] W. Jingqiu and Z. Ting, ‘‘An ARM-based embedded gesture recognition static hand gesture recognition,’’ Neural Comput. Appl., vol. 28, no. 12,
system using a data glove,’’ in Proc. 26th Chin. Control Decis. Conf. pp. 3941–3951, 2017.
(CCDC), May 2014, pp. 1580–1584. [75] C. Zimmermann and T. Brox, ‘‘Learning to estimate 3D hand pose from
[54] A. Z. Shukor, M. F. Miskon, M. H. Jamaluddin, F. B. Ali, M. F. Asyraf, single RGB images,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
and M. B. B. Bahar, ‘‘A new data glove approach for Malaysian sign lan- Oct. 2017, pp. 4903–4911.
guage detection,’’ Procedia Comput. Sci., vol. 76, pp. 60–67, Jan. 2015. [76] S.-Z. Li, B. Yu, W. Wu, S.-Z. Su, and R.-R. Ji, ‘‘Feature learning based
[55] N. B. Ibrahim, M. M. Selim, and H. H. Zayed, ‘‘An automatic Arabic sign on SAE–PCA network for human gesture recognition in RGBD images,’’
language recognition system (ArSLRS),’’ J. King Saud Univ.-Comput. Inf. Neurocomputing, vol. 151, pp. 565–573, Mar. 2015.
Sci., vol. 30, no. 4, pp. 470–477, 2018. [77] O. Koller, H. Ney, and R. Bowden, ‘‘Deep learning of mouth shapes
[56] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, ‘‘Sign for sign language,’’ in Proc. IEEE Int. Conf. Comput. Vis. Workshop
language recognition using convolutional neural networks,’’ in Proc. Eur. (ICCVW), Dec. 2015, pp. 85–91.
Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, pp. 572–578. [78] S. Kim, Y. Ban, and S. Lee, ‘‘Tracking and classification of in-air hand
[57] S. G. M. Almeida, F. G. Guimarães, and J. A. Ramírez, ‘‘Feature gesture based on thermal guided joint filter,’’ Sensors, vol. 17, no. 12,
extraction in Brazilian sign language recognition based on phonological p. 166, Jan. 2017.
structure and using RGB-D sensors,’’ Expert Syst. Appl., vol. 41, no. 16, [79] T. Xu, D. An, Z. Wang, S. Jiang, C. Meng, Y. Zhang, Q. Wang, Z. Pan,
pp. 7259–7271, 2014. and Y. Yue, ‘‘3D joints estimation of the human body in single-frame point
[58] B. Hisham and A. Hamouda, ‘‘Arabic sign language recognition using cloud,’’ IEEE Access, vol. 8, pp. 178900–178908, 2020.
ada-boosting based on a leap motion controller,’’ Int. J. Inf. Technol., [80] W. K. Wong, F. H. Juwono, and B. T. T. Khoo, ‘‘Multi-features capacitive
vol. 13, no. 3, pp. 1221–1234, Jun. 2021. hand gesture recognition sensor: A machine learning approach,’’ IEEE
[59] U. Farooq, M. S. M. Rahim, N. Sabir, A. Hussain, and A. Abid, Sensors J., vol. 21, no. 6, pp. 8441–8450, Mar. 2021.
‘‘Advances in machine translation for sign language: Approaches, lim- [81] Y. Zhou, G. Jiang, and Y. Lin, ‘‘A novel finger and hand pose estimation
itations, and challenges,’’ Neural Comput. Appl., pp. 1–43, 2021. technique for real-time hand gesture recognition,’’ Pattern Recognit.,
[60] M. I. Sadek, M. N. Mikhael, and H. A. Mansour, ‘‘A new approach for vol. 49, pp. 102–114, Jan. 2016.
designing a smart glove for Arabic sign language recognition system [82] R. Rastgoo, K. Kiani, S. Escalera, and M. Sabokrou, ‘‘Sign language
based on the statistical analysis of the sign language,’’ in Proc. 34th Nat. production: A review,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Radio Sci. Conf. (NRSC), Mar. 2017, pp. 380–388. Recognit., Jun. 2021, pp. 3451–3461.

126948 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

[83] A. Kratimenos, G. Pavlakos, and P. Maragos, ‘‘Independent sign language [107] C. Keskin and L. Akarun, ‘‘STARS: Sign tracking and recognition system
recognition with 3D body, hands, and face reconstruction,’’ in Proc. using input–output HMMs,’’ Pattern Recognit. Lett., vol. 30, no. 12,
IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2021, pp. 1086–1095, Sep. 2009.
pp. 4270–4274. [108] N. Liu and B. C. Lovell, ‘‘Gesture classification using hidden Markov
[84] D. Bansal, P. Ravi, M. So, P. Agrawal, I. Chadha, G. Murugappan, models and Viterbi path counting,’’ in Proc. 7th Digit. Image Comput.,
and C. Duke, ‘‘CopyCat: Using sign language recognition to help deaf Techn. Appl., C. Sun, H. Talbot, S. Ourselin, and T. Adriaansen, Eds.
children acquire language skills,’’ in Proc. Extended Abstr. CHI Conf. Sydney, NSW, Australia, Dec. 2003. [Online]. Available: http://citeseerx.
Hum. Factors Comput. Syst., May 2021, pp. 1–10. ist.psu.edu/viewdoc/download?doi=10.1.1.93.6800&rep=rep1&
[85] K. Gajurel, C. Zhong, and G. Wang, ‘‘A fine-grained visual atten- type=pdf
tion approach for fingerspelling recognition in the wild,’’ 2021, [109] M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis, ‘‘A hidden
arXiv:2105.07625. [Online]. Available: https://arxiv.org/abs/2105.07625 Markov model-based continuous gesture recognition system for hand
[86] M. De Coster, M. Van Herreweghe, and J. Dambre, ‘‘Sign language motion trajectory,’’ in Proc. 19th Int. Conf. Pattern Recognit., Dec. 2008,
recognition with transformer networks,’’ in Proc. 12th Int. Conf. Lang. pp. 1–4.
Resour. Eval. (ELRA), 2020, pp. 6018–6024. [110] J. Appenrodt, A. Al-Hamadi, and B. Michaelis, ‘‘Data gathering for
[87] L. Meng and R. Li, ‘‘An attention-enhanced multi-scale and dual sign gesture recognition systems based on single color-, stereo color- and ther-
language recognition network based on a graph convolution network,’’ mal cameras,’’ Int. J. Signal Process., Image Process. Pattern Recognit.,
Sensors, vol. 21, no. 4, p. 1120, Feb. 2021. vol. 3, no. 1, pp. 37–50, 2010.
[88] P. P. Roy, P. Kumar, and B.-G. Kim, ‘‘An efficient sign language recog- [111] M. M. Zaki and S. I. Shaheen, ‘‘Sign language recognition using a com-
nition (SLR) system using camshift tracker and hidden Markov model bination of new vision based features,’’ Pattern Recognit. Lett., vol. 32,
(HMM),’’ Social Netw. Comput. Sci., vol. 2, no. 2, pp. 1–15, Apr. 2021. no. 4, pp. 572–577, 2011.
[89] N. M. Adaloglou, T. Chatzis, I. Papastratis, A. Stergioulas, [112] P. V. A. Barros, N. T. M. Júnior, J. M. M. Bisneto, B. J. T. Fernandes,
G. T. Papadopoulos, V. Zacharopoulou, G. Xydopoulos, K. Antzakas, B. L. D. Bezerra, and S. M. M. Fernandes, ‘‘An effective dynamic gesture
D. Papazachariou, and P. N. Daras, ‘‘A comprehensive study on recognition system based on the feature vector reduction for SURF and
deep learning-based methods for sign language recognition,’’ LCS,’’ in Proc. Int. Conf. Artif. Neural Netw. Berlin, Germany: Springer,
IEEE Trans. Multimedia, early access, Apr. 1, 2021, doi: 2013, pp. 412–419.
10.1109/TMM.2021.3070438. [113] W. Yang, J. Tao, and Z. Ye, ‘‘Continuous sign language recognition using
[90] B. Saunders, N. C. Camgoz, and R. Bowden, ‘‘Continuous 3D multi- level building based on fast hidden Markov model,’’ Pattern Recognit.
channel sign language production via progressive transformers and Lett., vol. 78, pp. 28–35, Jul. 2016.
[114] S. Belgacem, C. Chatelain, and T. Paquet, ‘‘Gesture sequence recognition
mixture density networks,’’ Int. J. Comput. Vis., vol. 2021, pp. 1–23,
with one shot learned CRF/HMM hybrid model,’’ Image Vis. Comput.,
May 2021.
[91] M. Kuhn and K. Johnson, Applied Predictive Modeling, vol. 26. vol. 61, pp. 12–21, May 2017.
[115] J. Joy, K. Balakrishnan, and M. Sreeraj, ‘‘SignQuiz: A quiz based tool for
New York, NY, USA: Springer, 2013.
[92] B. Butcher and B. J. Smith, Feature Engineering and Selection: A Prac- learning fingerspelled signs in Indian sign language using ASLR,’’ IEEE
tical Approach for Predictive Models, K. Johnson and M. Kuhn, Eds. Access, vol. 7, pp. 28363–28371, 2019.
[116] M. Taskiran, M. Killioglu, and N. Kahraman, ‘‘A real-time system for
Boca Raton, FL, USA: Chapman and Hall, 2020.
recognition of American sign language by using deep learning,’’ in Proc.
[93] A. J. Ferreira and M. A. T. Figueiredo, ‘‘Efficient feature selection filters
41st Int. Conf. Telecommun. Signal Process. (TSP), Jul. 2018, pp. 1–5.
for high-dimensional data,’’ Pattern Recognit. Lett., vol. 33, no. 13, [117] R. Daroya, D. Peralta, and P. Naval, ‘‘Alphabet sign language image
pp. 1794–1804, Oct. 2012. classification using deep learning,’’ in Proc. IEEE Region 10 Conf. (TEN-
[94] K. Yin, A. Moryossef, J. Hochgesang, Y. Goldberg, and M. Alikhani,
CON), Oct. 2018, pp. 0646–0650.
‘‘Including signed languages in natural language processing,’’ 2021, [118] S. Shahriar, A. Siddiquee, T. Islam, A. Ghosh, R. Chakraborty,
arXiv:2105.05222. [Online]. Available: https://arxiv.org/abs/2105.05222 A. I. Khan, C. Shahnaz, and S. A. Fattah, ‘‘Real-time American sign
[95] K. Papadimitriou and G. Potamianos, ‘‘Fingerspelled alphabet sign recog- language recognition using skin segmentation and image category classi-
nition in upper-body videos,’’ in Proc. 27th Eur. Signal Process. Conf. fication with convolutional neural network and deep learning,’’ in Proc.
(EUSIPCO), Sep. 2019, pp. 1–5. IEEE Region 10 Conf. (TENCON), Oct. 2018, pp. 1168–1171.
[96] M. Ma, X. Xu, J. Wu, and M. Guo, ‘‘Design and analyze the structure [119] M. A. Jalal, R. Chen, R. K. Moore, and L. Mihaylova, ‘‘American sign
based on deep belief network for gesture recognition,’’ in Proc. 10th Int. language posture understanding with deep neural networks,’’ in Proc. 21st
Conf. Adv. Comput. Intell. (ICACI), Mar. 2018, pp. 40–44. Int. Conf. Inf. Fusion (FUSION), Jul. 2018, pp. 573–579.
[97] S. M. Kamal, Y. Chen, S. Li, X. Shi, and J. Zheng, ‘‘Technical approaches [120] M. E. M. Cayamcela and W. Lim, ‘‘Fine-tuning a pre-trained convolu-
to Chinese sign language processing: A review,’’ IEEE Access, vol. 7, tional neural network model to translate American sign language in real-
pp. 96926–96935, 2019. time,’’ in Proc. Int. Conf. Comput., Netw. Commun. (ICNC), Feb. 2019,
[98] R.-H. Liang and M. Ouhyoung, ‘‘A sign language recognition system pp. 100–104.
using hidden Markov model and context sensitive search,’’ in Proc. ACM [121] A. Shahin and S. Almotairi, ‘‘Automated Arabic sign language recogni-
Symp. Virtual Reality Softw. Technol. (VRST), 1996, pp. 59–66. tion system based on deep transfer learning,’’ Int. J. Comput. Sci. Netw.
[99] K. Grobel and M. Assan, ‘‘Isolated sign language recognition using Secur., vol. 19, no. 10, pp. 144–152, 2019.
hidden Markov models,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern., [122] F. Yasir, P. W. C. Prasad, A. Alsadoon, A. Elchouemi, and S. Sreedharan,
Comput. Cybern. Simul., vol. 1, Oct. 1997, pp. 162–167. ‘‘Bangla sign language recognition using convolutional neural network,’’
[100] M. Brand, N. Oliver, and A. Pentland, ‘‘Coupled hidden Markov models in Proc. Int. Conf. Intell. Comput., Instrum. Control Technol. (ICICICT),
for complex action recognition,’’ in Proc. IEEE Comput. Soc. Conf. Jul. 2017, pp. 49–53.
Comput. Vis. Pattern Recognit., Jun. 1997, pp. 994–999. [123] G. A. Rao, K. Syamala, P. V. V. Kishore, and A. S. C. S. Sastry,
[101] Z. Ghahramani and M. I. Jordan, ‘‘Factorial hidden Markov models,’’ ‘‘Deep convolutional neural networks for sign language recognition,’’ in
Mach. Learn., vol. 29, pp. 245–273, Nov. 1997. Proc. Conf. Signal Process. Commun. Eng. Syst. (SPACES), Jan. 2018,
[102] A. D. Wilson and A. F. Bobick, ‘‘Parametric hidden Markov models for pp. 194–197.
gesture recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, [124] T. D. Sajanraj and M. Beena, ‘‘Indian sign language numeral recognition
no. 9, pp. 884–900, Sep. 1999. using region of interest convolutional neural network,’’ in Proc. 2nd
[103] C. Vogler and D. Metaxas, ‘‘Parallel hidden Markov models for American Int. Conf. Inventive Commun. Comput. Technol. (ICICCT), Apr. 2018,
sign language recognition,’’ in Proc. IEEE ICCV, vol. 1, Sep. 1999, pp. 636–640.
pp. 116–122. [125] R. Rastgoo, K. Kiani, and S. Escalera, ‘‘Multi-modal deep hand sign lan-
[104] E.-J. Ong, O. Koller, N. Pugeault, and R. Bowden, ‘‘Sign spotting using guage recognition in still images using restricted Boltzmann machine,’’
hierarchical sequential patterns with temporal intervals,’’ in Proc. IEEE Entropy, vol. 20, no. 11, p. 809, 2018.
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 1923–1930. [126] L. Pigou, M. Van Herreweghe, and J. Dambre, ‘‘Gesture and sign lan-
[105] Y. Bengio and P. Frasconi, ‘‘Input-output HMMs for sequence process- guage recognition with temporal residual networks,’’ in Proc. IEEE Int.
ing,’’ IEEE Trans. Neural Netw., vol. 7, no. 5, pp. 1231–1249, Sep. 1996. Conf. Comput. Vis. Workshops, Oct. 2017, pp. 3086–3093.
[106] A. Just, O. Bernier, and S. Marcel, ‘‘HMM and IOHMM for the [127] P. Nakjai, P. Maneerat, and T. Katanyukul, ‘‘Thai finger spelling local-
recognition of mono-and bi-manual 3D hand gestures,’’ Dalle Molle ization and classification under complex background using a YOLO-
Inst. Perceptual Artif. Intell., Tech. Rep., 2004. [Online]. Available: based deep learning,’’ in Proc. 11th Int. Conf. Comput. Modeling Simul.
https://infoscience.epfl.ch/record/83136 (ICCMS), 2019, pp. 230–233.

VOLUME 9, 2021 126949


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

[128] B. Fang, J. Co, and M. Zhang, ‘‘DeepASL: Enabling ubiquitous and non- [150] H. Zhou, W. Zhou, Y. Zhou, and H. Li, ‘‘Spatial-temporal multi-cue
intrusive word and sentence-level sign language translation,’’ in Proc. network for sign language recognition and translation,’’ IEEE Trans. Mul-
15th ACM Conf. Embedded Netw. Sensor Syst., 2017, pp. 1–13. timedia, early access, Feb. 15, 2021, doi: 10.1109/TMM.2021.3059098.
[129] D. C. Kavarthapu and K. Mitra, ‘‘Hand gesture sequence recognition [151] R. Rastgoo, K. Kiani, and S. Escalera, ‘‘Sign language recognition: A
using inertial motion units (IMUs),’’ in Proc. 4th IAPR Asian Conf. deep survey,’’ Expert Syst. Appl., vol. 164, Feb. 2021, Art. no. 113794.
Pattern Recognit. (ACPR), Nov. 2017, pp. 953–957. [152] H. Zhou, W. Zhou, Y. Zhou, and H. Li, ‘‘Spatial-temporal multi-cue
[130] E. Rakun, A. M. Arymurthy, L. Y. Stefanus, A. F. Wicaksono, and network for continuous sign language recognition,’’ in Proc. AAAI Conf.
I. W. W. Wisesa, ‘‘Recognition of sign language system for Indonesian Artif. Intell., vol. 34, Mar. 2020, pp. 13009–13016.
[153] A. Moryossef, I. Tsochantaridis, R. Aharoni, S. Ebling, and S. Narayanan,
language using long short-term memory neural networks,’’ Adv. Sci. Lett.,
‘‘Real-time sign language detection using human pose estimation,’’
vol. 24, no. 2, pp. 999–1004, Feb. 2018.
in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
[131] S. S Kumar, T. Wangyal, V. Saboo, and R. Srinath, ‘‘Time series neural
pp. 237–248.
networks for real time sign language translation,’’ in Proc. 17th IEEE Int. [154] A. Newell, K. Yang, and J. Deng, ‘‘Stacked hourglass networks for human
Conf. Mach. Learn. Appl. (ICMLA), Dec. 2018, pp. 243–248. pose estimation,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland:
[132] D. M. Adimas, E. Rakun, and D. Hardianto, ‘‘Recognizing Indonesian Springer, 2016, pp. 483–499.
sign language gestures using features generated by elliptical model track- [155] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, ‘‘Convolutional
ing and angular projection,’’ in Proc. 2nd Int. Conf. Intell. Auto. Syst. pose machines,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(ICoIAS), Feb. 2019, pp. 25–31. (CVPR), Jun. 2016, pp. 4724–4732.
[133] R. Cui, H. Liu, and C. Zhang, ‘‘Recurrent convolutional neural networks [156] A. Toshev and C. Szegedy, ‘‘DeepPose: Human pose estimation via deep
for continuous sign language recognition by staged optimization,’’ in neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, Jun. 2014, pp. 1653–1660.
pp. 7361–7369. [157] S. Gattupalli, A. Ghaderi, and V. Athitsos, ‘‘Evaluation of deep learning
[134] K. Bantupalli and Y. Xie, ‘‘American sign language recognition using based pose estimation for sign language recognition,’’ in Proc. 9th ACM
deep learning and computer vision,’’ in Proc. IEEE Int. Conf. Big Data Int. Conf. Pervas. Technol. Rel. Assistive Environ., Jun. 2016, pp. 1–7.
(Big Data), Dec. 2018, pp. 4896–4899. [158] S.-K. Ko, C. J. Kim, H. Jung, and C. Cho, ‘‘Neural sign language trans-
[135] K. Yin and J. Read, ‘‘Attention is all you sign: Sign language trans- lation based on human keypoint estimation,’’ Appl. Sci., vol. 9, no. 13,
lation with transformers,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV) p. 2683, Jul. 2019.
[159] M. Madadi, H. Bertiche, and S. Escalera, ‘‘SMPLR: Deep learning based
Workshop Sign Lang. Recognit., Transl. Prod. (SLRTP), vol. 23, 2020,
SMPL reverse for 3D human pose and shape recovery,’’ Pattern Recognit.,
pp. 1–4.
vol. 106, Oct. 2020, Art. no. 107472.
[136] N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, ‘‘Sign language [160] A. Jain, J. Tompson, Y. LeCun, and C. Bregler, ‘‘Modeep: A deep
transformers: Joint end-to-end sign language recognition and transla- learning framework using motion features for human pose estimation,’’
tion,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer, 2014,
Jun. 2020, pp. 10023–10033. pp. 302–315.
[137] K. Yin and J. Read, ‘‘Better sign language translation with [161] X. Chen and A. L. Yuille, ‘‘Articulated pose estimation by a graphical
STMC-transformer,’’ 2020, arXiv:2004.00588. [Online]. Available: model with image dependent pairwise relations,’’ in Proc. Adv. Neural
https://arxiv.org/abs/2004.00588 Inf. Process. Syst., vol. 27, 2014, pp. 1736–1744.
[138] M. De Coster, M. Van Herreweghe, and J. Dambre, ‘‘Isolated sign [162] J. Charles, T. Pfister, D. Magee, D. Hogg, and A. Zisserman, ‘‘Personal-
recognition from RGB video using pose flow and self-attention,’’ in izing human video pose estimation,’’ in Proc. IEEE Conf. Comput. Vis.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021, Pattern Recognit., Jun. 2016, pp. 3063–3072.
pp. 3441–3450. [163] Y. Bin, Z.-M. Chen, X.-S. Wei, X. Chen, C. Gao, and N. Sang, ‘‘Structure-
[139] I. Papastratis, K. Dimitropoulos, and P. Daras, ‘‘Continuous sign language aware human pose estimation with graph convolutional networks,’’ Pat-
recognition through a context-aware generative adversarial network,’’ tern Recognit., vol. 106, Oct. 2020, Art. no. 107410.
Sensors, vol. 21, no. 7, p. 2437, Apr. 2021. [164] A. Haque, B. Peng, Z. Luo, A. Alahi, S. Yeung, and L. Fei-Fei, ‘‘Towards
[140] T. Jiang, N. C. Camgoz, and R. Bowden, ‘‘Skeletor: Skeletal transformers viewpoint invariant 3D human pose estimation,’’ in Proc. Eur. Conf.
for robust body-pose estimation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 160–177.
[165] M. Wang, X. Chen, W. Liu, C. Qian, L. Lin, and L. Ma, ‘‘DRPose3D:
Pattern Recognit. Workshops (CVPRW), Jun. 2021, pp. 3394–3402.
Depth ranking in 3D human pose estimation,’’ in Proc. 27th Int. Joint
[141] N. C. Camgoz, B. Saunders, G. Rochette, M. Giovanelli, G. Inches,
Conf. Artif. Intell., Jul. 2018, pp. 978–984.
R. Nachtrab-Ribback, and R. Bowden, ‘‘Content4All open research sign [166] M. J. Marín-Jiménez, F. J. Romero-Ramirez, R. Muñoz-Salinas, and
language translation datasets,’’ 2021, arXiv:2105.02351. [Online]. Avail- R. Medina-Carnicer, ‘‘3D human pose estimation from depth maps using
able: http://arxiv.org/abs/2105.02351 a deep combination of poses,’’ J. Vis. Commun. Image Represent., vol. 55,
[142] A. Moryossef, I. Tsochantaridis, J. Dinn, N. C. Camgoz, R. Bowden, pp. 627–639, Aug. 2018.
T. Jiang, A. Rios, M. Muller, and S. Ebling, ‘‘Evaluating the imme- [167] C. Zheng, W. Wu, T. Yang, S. Zhu, C. Chen, R. Liu, J. Shen,
diate applicability of pose estimation for sign language recognition,’’ N. Kehtarnavaz, and M. Shah, ‘‘Deep learning-based human pose
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021, estimation: A survey,’’ 2020, arXiv:2012.13392. [Online]. Available:
pp. 3434–3440. https://arxiv.org/abs/2012.13392
[143] D. M. Gavrila, ‘‘The visual analysis of human movement: A survey,’’ [168] X. Ji, Q. Fang, J. Dong, Q. Shuai, W. Jiang, and X. Zhou, ‘‘A survey on
Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999. monocular 3D human pose estimation,’’ Virtual Reality Intell. Hardw.,
[144] T. B. Moeslund and E. Granum, ‘‘A survey of computer vision-based vol. 2, no. 6, pp. 471–500, Dec. 2020.
human motion capture,’’ Comput. Vis. Image Understand., vol. 81, no. 3, [169] G. Latif, N. Mohammad, J. Alghazo, R. AlKhalaf, and R. AlKhalaf,
pp. 231–268, Mar. 2001. ‘‘ArASL: Arabic alphabets sign language dataset,’’ Data Brief, vol. 23,
[145] H. Ribeiro and A. Gonzaga, ‘‘Hand image segmentation in video Apr. 2019, Art. no. 103777.
sequence by GMM: A comparative analysis,’’ in Proc. 19th Brazilian [170] L. Wang, W. Hu, and T. Tan, ‘‘Recent developments in human motion
Symp. Comput. Graph. Image Process., Oct. 2006, pp. 357–364. analysis,’’ Pattern Recognit., vol. 36, no. 3, pp. 585–601, 2003.
[171] X. Ji and H. Liu, ‘‘Advances in view-invariant human motion analysis: A
[146] S. S. Rautaray and A. Agrawal, ‘‘Vision based hand gesture recognition
review,’’ IEEE Trans. Syst., Man, Cybern., C, Appl. Rev., vol. 40, no. 1,
for human computer interaction: A survey,’’ Artif. Intell. Rev., vol. 43,
pp. 13–24, Jan. 2010.
no. 1, pp. 1–54, Jan. 2012. [172] N. A. Ibraheem and R. Z. Khan, ‘‘Vision based gesture recognition using
[147] G. Kumar and P. K. Bhatia, ‘‘A detailed review of feature extraction neural networks approaches: A review,’’ Int. J. Hum. Comput. Interact.,
in image processing systems,’’ in Proc. 4th Int. Conf. Adv. Comput. vol. 3, no. 1, pp. 1–14, 2012.
Commun. Technol., Feb. 2014, pp. 5–12. [173] Q. Wan, Y. Li, C. Li, and R. Pal, ‘‘Gesture recognition for smart home
[148] M. Mohandes, M. Deriche, and J. Liu, ‘‘Image-based and sensor-based applications using portable radar sensors,’’ in Proc. 36th Annu. Int. Conf.
approaches to Arabic sign language recognition,’’ IEEE Trans. Hum.- IEEE Eng. Med. Biol. Soc., Aug. 2014, pp. 6414–6417.
Mach. Syst., vol. 44, no. 4, pp. 551–557, Aug. 2014. [174] R. Lockton, ‘‘Hand gesture recognition using computer vision,’’
[149] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, ‘‘Skeleton aware multi- 4th Year Project Report, Tech. Rep., 2002, pp. 1–69. [Online].
modal sign language recognition,’’ 2021, arXiv:2103.08833. [Online]. Available: https://www.yumpu.com/en/document/read/5329170/4-year-
Available: https://arxiv.org/abs/2103.08833 project-report-hand-gesture-recognition-using-computer-

126950 VOLUME 9, 2021


M. Al-Qurishi et al.: DL for SLR: Current Techniques, Benchmarks, and Open Issues

[175] A. Tharwat, T. Gaber, A. E. Hassanien, M. K. Shahin, and B. Refaat, [193] A. Sabyrov, M. Mukushev, and V. Kimmelman, ‘‘Towards real-time sign
‘‘Sift-based arabic sign language recognition system,’’ in Proc. Afro- language interpreting robot: Evaluation of non-manual components on
Eur. Conf. Ind. Advancement. Cham, Switzerland: Springer, 2015, recognition accuracy,’’ in Proc. CVPR Workshops, 2019, pp. 75–82.
pp. 359–370. [194] S. Islam, S. S. S. Mousumi, A. S. A. Rabby, S. A. Hossain, and S. Abujar,
[176] C. Manresa, J. Varona, R. Mas, and F. J. Perales, ‘‘Hand tracking and ‘‘A potent model to recognize Bangla sign language digits using con-
gesture recognition for human-computer interaction,’’ Electron. Lett. volutional neural network,’’ Proc. Comput. Sci., vol. 143, pp. 611–618,
Comput. Vis. Image Anal., vol. 5, no. 3, pp. 96–104, 2005. Jan. 2018.
[177] C. Manresa-Yee, J. Varona, R. Mas, and F. J. Perales, ‘‘Hand tracking [195] C. Mao, S. Huang, X. Li, and Z. Ye, ‘‘Chinese sign language recognition
and gesture recognition for human-computer interaction,’’ in Progress with sequence to sequence learning,’’ in Proc. CCF Chin. Conf. Comput.
in Computer Vision and Image Analysis. World Scientific, 2010, Vis. Singapore: Springer, 2017, pp. 180–191.
pp. 401–412. [Online]. Available: https://scholar.google.com/scholar_ [196] P. Sun, F. Chen, G. Wang, J. Ren, and J. Dong, ‘‘A robust static sign
lookup?title=HAND%20TRACKING%20AND%20GESTURE%20 language recognition system based on hand key points estimation,’’ in
RECOGNITION%20FOR%20HUMAN-COMPUTER%20 Proc. Int. Conf. Intell. Syst. Design Appl. Cham, Switzerland: Springer,
INTERACTION&author=Cristina.%20Manresa-Yee&author=Javier.% 2017, pp. 548–557.
20Varona&author=Ramon.%20Mas&author=Francisco%20J..%20 [197] G. Devineau, F. Moutarde, W. Xi, and J. Yang, ‘‘Deep learning for hand
Perales&pages=401-412&publication_year=2009#d=gs_cit&u=%2 gesture recognition on skeletal data,’’ in Proc. 13th IEEE Int. Conf.
Fscholar%3Fq%3Dinfo%3A45Erx3V_nWMJ%3Ascholar.google.com% Automat. Face Gesture Recognit. (FG). May 2018, pp. 106–113.
2F%26output%3Dcite%26scirp%3D0%26hl%3Den [198] A. Thangali, J. P. Nash, S. Sclaroff, and C. Neidle, ‘‘Exploiting phonolog-
[178] T.-Y. Pan, L.-Y. Lo, C.-W. Yeh, J.-W. Li, H.-T. Liu, and M.-C. Hu, ‘‘Real- ical constraints for handshape inference in ASL video,’’ in Proc. CVPR,
time sign language recognition in complex background scene based on 2011, pp. 521–528.
a hierarchical clustering classification method,’’ in Proc. IEEE 2nd Int.
Conf. Multimedia Big Data (BigMM), Apr. 2016, pp. 64–67.
[179] R. Yang, S. Sarkar, and B. Loeding, ‘‘Handling movement epenthesis and
hand segmentation ambiguities in continuous sign language recognition
using nested dynamic programming,’’ IEEE Trans. Pattern Anal. Mach.
MUHAMMAD AL-QURISHI (Member, IEEE) received the Ph.D. degree
Intell., vol. 32, no. 3, pp. 462–477, Mar. 2009.
[180] H. Lahiani, M. Elleuch, and M. Kherallah, ‘‘Real time hand gesture from the College of Computer and Information Sciences (CCIS), King Saud
recognition system for Android devices,’’ in Proc. 15th Int. Conf. Intell. University (KSU), Riyadh, Saudi Arabia, in 2017. He was a Postdoctoral
Syst. Design Appl. (ISDA), Dec. 2015, pp. 591–596. Researcher with the Chair of Pervasive and Mobile Computing (CPMC),
[181] T. B. Moeslund, A. Hilton, and V. Krüger, ‘‘A survey of advances in CCIS, KSU. He is one of the founding members of CPMC. He is currently
vision-based human motion capture and analysis,’’ Comput. Vis. Image working as a Data Scientist with the Research and Innovation Division,
Understand., vol. 104, nos. 2–3, pp. 90–126, Nov./Dec. 2006. Elm Company. He has published several articles in refereed journals (IEEE,
[182] S. Mitra and T. Acharya, ‘‘Gesture recognition: A survey,’’ IEEE Trans. ACM, Springer, Elsevier, and Wiley). His research interests include natural
Syst., Man, Cybern., C, Appl. Rev., vol. 37, no. 3, pp. 311–324, May 2007.
[183] G. R. S. Murthy and R. S. Jadon, ‘‘A review of vision based hand language processing and understanding, big data analysis and mining, per-
gestures recognition,’’ Int. J. Inf. Technol. Knowl. Manag., vol. 2, no. 2, vasive computing, and machine learning. He received the Innovation Award
pp. 405–410, 2009. for a Mobile Cloud Serious Game from KSU, in 2013, and the Best Ph.D.
[184] A. Chaudhary, J. L. Raheja, K. Das, and S. Raheja, ‘‘Intelligent Thesis Award from CCIS, KSU, in 2018.
approaches to interact with machines using hand gesture recognition
in natural way: A survey,’’ 2013, arXiv:1303.2292. [Online]. Available:
https://arxiv.org/abs/1303.2292
[185] T. Aujeszky and M. Eid, ‘‘A gesture recogintion architecture for Arabic
sign language communication system,’’ Multimedia Tools Appl., vol. 75,
THARIQ KHALID received the B.Tech. degree
no. 14, pp. 8493–8511, Jul. 2016.
[186] R. Alzohairi, R. Alghonaim, W. Alshehri, and S. Aloqeely, ‘‘Image based in computer science and engineering from the
Arabic sign language recognition system,’’ Int. J. Adv. Comput. Sci. Appl., National Institute of Technology Calicut, in 2007.
vol. 9, no. 3, pp. 1–11, 2018. He has done extensive work in the field of
[187] M. Elpeltagy, M. Abdelwahab, M. E. Hussein, A. Shoukry, A. Shoala, and machine learning, natural language processing,
M. Galal, ‘‘Multi-modality-based Arabic sign language recognition,’’ IET and computer vision at Samsung Research Insti-
Comput. Vis., vol. 12, no. 7, pp. 1031–1039, Oct. 2018. tute Bangalore, Verse Innovations, and Target
[188] A. T. Magar and P. Parajuli, ‘‘American sign language recogni- Corporation. He is currently working as a Data
tion using convolution neural network for raspberry Pi,’’ EasyChair,
Scientist with the Research and Innovation Divi-
Tech. Rep. 2957, 2020.
[189] Q. Xue, X. Li, D. Wang, and W. Zhang, ‘‘Deep forest-based monocular sion, Elm Company. His research interests include
visual sign language recognition,’’ Appl. Sci., vol. 9, no. 9, p. 1945, computer vision, perception systems for autonomous driving, and deep
May 2019. learning.
[190] K. M. Lim, A. W. C. Tan, C. P. Lee, and S. C. Tan, ‘‘Isolated sign
language recognition using convolutional neural network hand mod-
elling and hand energy image,’’ Multimedia Tools Appl., vol. 78, no. 14,
pp. 19917–19944, Jul. 2019.
[191] B. Mocialov, G. Turner, K. Lohan, and H. Hastie, ‘‘Towards continuous
RIAD SOUISSI received the M.Sc. degree in telecommunication and infor-
sign language recognition with deep learning,’’ in Proc. Workshop Creat-
mation systems from École Centrale Paris. He is the Vice President of
ing Meaning Robot Assistants, Gap Left Smart Devices, 2017. [Online].
Available: http://www.macs.hw.ac.uk/~hh117/pubs/humanoids.pdf Research and Innovation at Elm Company. His research interests include
[192] V. Belissen, ‘‘Sign language video analysis for automatic recognition solving real life and challenging problems using cutting edge technology,
and detection,’’ in Proc. 14th IEEE Int. Conf. Autom. Face Gesture such as AI (computer vision, NLP, and optimization), blockchain, the IoT,
Recognit., 2019. [Online]. Available: https://hal.archives-ouvertes.fr/hal- and sensor technology.
02146366/document

VOLUME 9, 2021 126951

You might also like