Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks

Chiradeja, Pathomthat; Liang, Yijuan; Jettanasen, Chaiyan

doi:10.3390/app15062957

Open AccessArticle

Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks

by

Pathomthat Chiradeja

¹

,

Yijuan Liang

²

and

Chaiyan Jettanasen

^2,*

¹

Faculty of Engineering, Srinakharinwirot University, Bangkok 10110, Thailand

²

School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 2957; https://doi.org/10.3390/app15062957 (registering DOI)

Submission received: 17 January 2025 / Revised: 21 February 2025 / Accepted: 3 March 2025 / Published: 10 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Sign language plays a crucial role in bridging communication barriers within the Deaf community. Recognizing sign language sentences remains a significant challenge due to their complex structure, variations in signing styles, and temporal dynamics. This study introduces an innovative sign language sentence recognition (SLSR) approach using Hybrid Graph Embedding and Adaptive Convolutional Networks (HGE-ACN) specifically developed for single-handed wearable glove devices. The system relies on sensor data from a glove with six-axis inertial sensors and five-finger curvature sensors. The proposed HGE-ACN framework integrates graph-based embeddings to capture dynamic spatial–temporal relationships in motion and curvature data. At the same time, the Adaptive Convolutional Networks extract robust glove-based features to handle variations in signing speed, transitions between gestures, and individual signer styles. The lightweight design enables real-time processing and enhances recognition accuracy, making it suitable for practical use. Extensive experiments demonstrate that HGE-ACN achieves superior accuracy and computational efficiency compared to existing glove-based recognition methods. The system maintains robustness under various conditions, including inconsistent signing speeds and environmental noise. This work has promising applications in real-time assistive tools, educational technologies, and human–computer interaction systems, facilitating more inclusive and accessible communication platforms for the deaf and hard-of-hearing communities. Future work will explore multi-lingual sign language recognition and real-world deployment across diverse environments.

Keywords:

sign language recognition; real-world deployment scenarios; graph embedding; adaptive convolutional networks; machine learning

1. Background and Significance

Despite technological advances, deaf and hard-of-hearing people face difficulties with language [1]. Deaf people worldwide face important communication gaps because traditional communication tools struggle to understand sign language’s multidimensionality [2]. Computer-sensor-based machine learning and assistive technology are used in sign language recognition (SLR). Unlike spoken language translation, decoding complex visual-spatial linguistic structures in sign language recognition requires advanced computing methods [3]. Automatic recognition is computationally intensive and complex due to signing styles, environmental elements, and contextual fluctuations [4]. Recent advances in deep learning and neural network topologies have enabled more accurate and resilient sign language recognition systems. SLR could improve accessibility for deaf people in healthcare, education, social interaction, and the workplace [5].

1.1. Advanced Feature Extraction and Representation

Build a sophisticated computer system that can interpret sign language using multidimensional, intricate features [6]. The primary objective is to create a hybrid method to understand sign language words with all their semantic nuances, temporal dynamics, and geographical links extracted. By combining graph-based embeddings with Adaptive Convolutional Networks, the study hopes to solve the problem of standard recognition restrictions and handle variations in signing velocity, individual styles, and environmental circumstances [7].

1.2. Performance Enhancement and Robustness

Improve sign language sentence recognition accuracy and processing efficiency with novel algorithms [8]. This study seeks to create a model that performs well across varied datasets, minimizes recognition errors, and handles complicated linguistic scenarios [9]. This goal involves building adaptive algorithms to simplify signing settings to create a more reliable and versatile recognition system for real-world communication [10].

1.3. Accessibility and Technological Innovation

Assist with the study and development of accessible communication tools that may improve social interactions for individuals that are deaf or hard of hearing [11]. The study aims to develop more precise and flexible sign language recognition systems to create technological solutions that go over translation, promoting true communication accessibility. This goal highlights the importance of new technology as well as the more significant social effects of assisting deaf people in their educational, occupational, and social integration, as well as making communication more inclusive [12].

1.4. Significance

This study seeks to create reliable and flexible systems for recognizing sign language to close the gap in assistive communication technology [13]. This method achieves linguistic equity by acknowledging sign language as a multi-faceted communication system. Advanced sign language recognition systems have far-reaching social and technological consequences, and this study offers a thorough theoretical framework that highlights both. For the deaf community, it seeks to build truly inclusive communication platforms.

The progression learning convolution neural model demonstrated significant improvements in the credit of sign language gestures using data from wearable glove devices, showcasing its potential for sentence-level translation [14]. The development of a sensor data fusion framework combined with an optimized Elman neural model effectively enhanced the robustness and accuracy of sign language recognition systems [15].

The study provides an innovative approach to sign language recognition using HGE-ACN, which stands for Hybrid Graph Embedding and Adaptive Convolutional Networks. The process includes adaptable convolutional frameworks, multimodal feature extraction, graph-based embedding methods, and extensive validation on several standard sign language datasets. This approach aims to improve sign language recognition and capture complicated spatial–temporal interactions. The key objective points of the study are as follows:

To develop an AI-based framework for glove-based sign language sentence recognition using HGE-ACN to process data from wearable glove devices, focusing on six-axis inertial sensors and five-finger curvature sensors and eliminating sensor-based-based inputs.

To optimize sentence-level recognition using graph-based embeddings and adaptive convolution to solve difficulties including signing speed, signer styles, and hand gesture transitions.

To build a scalable, real-time recognition system for educational tools, accessibility technologies, and human–computer interaction with multilingual and real-world deployment capabilities to promote inclusive communication systems.

An overview of the research follows. Section 2 describes a comprehensive literature and research methodology review. Section 3 describes the research plan, methodologies, and processing. Section 4 describes the results of the analysis. The conclusion and future work are in Section 5.

2. Literature Survey

Venugopalan and Reghunadhan [16] presented that as COVID-19 disrupts clinical services nationwide, deaf patients struggle to communicate for care. The study demonstrates the way automatic sign language identification helps Indian deaf individuals and healthcare workers interact. A new dataset of dynamic hand gestures for ISL words averages 83.36% accuracy. The model achieved 97% accuracy on an alternative ISL dataset and 99.34 ± 0.66% on a benchmarking hand gesture dataset. Yet, SLR accuracy issues remain when handling various gesture data, detecting and translating continuous gesture sequences, and recognizing sign languages with hands and body motions.

Tang et al. [17] suggested MSeqGraph, a graph-based multimodal sequential embedding network that uses implicit temporal cues and multimodal complementarity for sign language translation (SLT). The MSeqGraph model incorporates a GEU that combines channel-wise and temporal-wise, as well as multimodal, relational embedding. The GEU incorporates cross-modal edges, temporal neighborhood edges, and parallel convolutions as part of the GCN computation. Dense multimodal cues are used using a hierarchical GEU stacker. Using the USTC-CSL as well as BOSTON-104 datasets, experiments show that the suggested strategy works and that representation and performance are improved using multimodal cues.

Muthusamy and Murugan [18] provided that deaf and hard-of-hearing people use sign language (SL), although it might be difficult with strangers. Space–temporal multi-cue (STMC) networks from deep learning improve SL prediction. These networks take longer to preprocess and annotate crucial locations. This study presents a spatiotemporal hybrid cue network (STHCN) using a Dynamic Dense Spatiotemporal Graph Convolutional Neural Network (DDSTGCNN) and feature extractor network to address ISL recognition and translation tasks for sequence learning and inference, and BLSTM encoders, CTCs, and self-attention-based LSTM decoders use extracted features. The STHCN model detects ISL 93.65% accurately.

Kan et al.’s study [19], given that the deaf community relies heavily on sign language translation (SLT), employs sequence-to-sequence learning. On the other hand, sign languages use a variety of visual and manual cues to communicate. The hierarchical spatiotemporal graph neural network (HST-GNN) is a new architecture for deep learning proposed in this paper to learn such graph patterns. The network characterizes local and global graph attributes using graph convolutions and graph self-attentions with the neighborhood context. Proving the efficacy of the suggested HSTGNN, experiments were conducted on benchmark datasets such as PHOENIX-2014-T and CSL.

Xu et al. [20] debuted an isolated word sign language recognition model built on a hybrid SKResNet-TCN network. The model reduces memory and processing requirements using grouped convolution, causal convolution, and inflation convolution. Subsequently, it outperforms conventional 3D convolutional networks in terms of accuracy, using fewer parameters and requiring easier procedures. The following area of study will be continuous sign language, but the study acknowledges that there may be discrepancies between the translated results and the actual order of speech.

Gupta et al. [21] introduced Convolutional Neural Networks (CNN) and 1D convolutional layers, and Long Short-Term Memory (LSTM) networks were utilized in the development of a novel approach to the recognition of sign language. This technique offers a precise classification of sign language motions, allowing for the identification of temporal patterns and the management of long-term interdependence. Expanding recognition skills to incorporate other sign languages and enabling user customization are two future topics that have been pursued to promote accessibility and inclusivity for people with hearing problems.

Noor et al. [22] showed that Saudi Arabia only employs hearing-impaired sign language interpreters. This study’s hybrid model recognizes Arabic Sign Language using deep learning, Convolutional Neural Networks (CNN), and LSTM classifiers. Sign language spatial and hand movement sequence data are collected by the model. A total of 500 dynamic gesture word films and 4000 Arabic Sign Language (ArSL) photos from a 20-word dataset showed the model’s performance. The model’s accuracy of 82.70% and accuracy of 94.40% implied that it could help Saudi Arabia’s hearing-impaired communication.

Huang and Ye [23] suggested a new boundary-adaptive learning-based encoder-decoder approach to Chinese sign language recognition (SLR). With the help of the boundary-adaptive encoder (BAE), this technique encodes the hierarchical structure inside sign language signals. A location-based window attention model makes the decoding step more efficient. Additionally, the approach incorporates both discrete and continuous SLR in Chinese through subword units in sign language. While the approach has shown competitive results on several major benchmarks, it is not yet lightweight or optimized for real-time performance. The study’s primary limitation lies in its focus on enhancing efficiency.

Kumar et al. [24] solved sign language machine translation problems with 3D motion capture and graph matching. Three-dimensional sign matching issues include identifying the same signs in multiple motion frames and extracting signs from non-sign hand movements. Two-phase graphs that match early estimates help solve these problems. Intra-graph matching searches databases for motion-intensive frames. Early-estimation model inter-graphs match motion-derived queries and datasets. With this method, graph-matching estimates sign faster with less frames. The model is evaluated on 350 Indian sign language terms with four variations for each sign by five signers at different hand speeds and motions.

Rajalakshmi et al. [25] indicated that the public struggles to learn sign language for speech and hearing-impaired purposes. Sign language could aid interaction. Current approaches are expensive because they require worn sensors. A unique sensor-based-based hybrid deep neural net approach identifies Indian and Russian sign motions. The framework detects non-manual and manual co-articulations. Attention-based Bi-LSTM extracts temporal and sequential features, while a 3D deep neural net with atrous convolutions extracts spatial features. Modified autoencoders and hybrid attention modules select abstract and discriminative features. Models outperform present ones.

3. Proposed Hybrid Graph Embedding and Adaptive Convolution Networks

The Hybrid Graph Embedding and Adaptive Convolutional Networks (HGE-ACN) approach is a new framework for efficient sign language sentence recognition (SLSR) using data from wearable glove devices. The method uses multimodal sensor data from a single-handed glove with six-axis inertial and five-finger curvature sensors, capturing spatial–temporal dynamics critical for recognizing complex sentences. The HGE-ACN integrates graph-based embeddings to model dynamic relationships between sequential gestures and employs an adaptive convolutional framework for robust feature extraction. This hybrid HGE-ACN framework addresses variations in signing speed and seamless transitions between gestures, enhancing recognition accuracy and computational efficiency.

In Figure 1, the wearable glove HGE-ACN system recognizes sign language sentences. It uses six-axis IMU sensors to track hand movement and orientation in three dimensions and finger curvature sensors to quantify finger bending and positioning. The system extracts Inertial Features, Motion Features, and Temporal Sequences from raw sensor data. The primary processing block has two main sections: HGE (Hybrid Graph Embedding), which captures geographical graphs, temporal graphs, and feature embedding, and ACN, which dynamically alters processing settings based on input characteristics. The transition model examines sentence sign transitions.

The final output step incorporates sentence analysis, pattern recognition, and language modeling. Figure 1 depicts that sensor data flows through the system: initial sensor data go to feature extraction, extracted features go to core processing, and both contribute to final recognition. HGE-ACN framework handles sign language recognition difficulties such as multimodal integration, adaptive processing, contextual understanding, and efficiency. The key contributions are key contributions of graph embeddings for spatial–temporal relationship modeling, Adaptive Convolutional Networks for signing variants, style processing for user-independent recognition, and explicit modeling of phrase sign transitions. The system’s design covers signing speed, style, and intricate transitions between signs without predefined templates. Real-time sign language recognition, learning tools, assistive technology, and HCI systems are practical applications. The HGE-ACN framework improves wearable-based sign language recognition by eliminating camera-based systems, recognizing complete sentences more accurately, handling natural signing variations, and operating efficiently enough for real-time applications. This design meets the paper’s goal of enhancing sign language recognition accuracy and processing economy. The hybrid technique uses graph-based representations and adaptive convolutional processing to handle sign language communication’s complexities.

3.1. Mathematical Foundation

HGE-ACN recognizes sign language words using graph-based temporal modeling and Adaptive Convolutional Networks. The approach relies on the dataset, a matrix of frames, frame height, width, and channel count (e.g., RGB). N keypoints with geographic coordinates and confidence scores indicate the landmarks of each frame’s hand and stance. Matrix representations of each frame’s number of frames, height, breadth, and channels form the HGE-ACN’s mathematical supports.

To improve the mathematical foundation, the values of

x, y, a n d c

should be constrained within bounded intervals instead of unrestricted in

R

. The Equation (1) is as follows:

\begin{matrix} V = \{f_{1}, f_{2}, \dots \dots f_{n}\}, f_{t} ϵ D \subseteq R^{H \times W \times C} \\ D = [a_{H}, a_{H}] \times [a_{W}, a_{W}] \times [a_{C}, a_{C}] \end{matrix}\}

(1)

In Equation (1),

D

represents a bounded domain, and

f_{t}

represents the frame at time

t .

H a n d W

represent the frame’s height and width, and C represents the number of channels. This guarantees that the model operates within a well-defined mathematical space.

Each frame

f_{t}

contains

N

keypoints corresponding to hand and pose landmarks:

k_{t} = \{k_{1}^{t}, k_{2}^{t} \dots . . k_{n}^{t}\}, k_{i}^{t} = (x_{i}^{t}, y_{i}^{t}, c_{i}^{t})

(2)

In Equation (2),

x_{i}^{t} a n d y_{i}^{t}

represent the spatial coordinates of the ith keypoint, and

c_{i}^{t}

represents the confidence score of the keypoint.

The HGE-ACN model optimizes learning using hyperparameters for robust feature extraction, efficient spatial–temporal modeling, and computational efficiency. Key hyperparameters are detailed in Table 1, providing notations and values.

3.2. Feature Extraction

HGE-ACN sign language phrase recognition relies on the feature extraction and keypoint detection process Figure 2a below. It involves collecting and analyzing wearable glove motion and location data. The wearable glove has two main sensor types: six-axis IMU (Inertial Measurement Unit) sensors for three-dimensional motion tracking and finger curvature sensors for finger bending angles and hand shape configurations. The HGE-ACN technique relies on feature extraction through three parallel channels: Inertial Features, Motion Features, and Temporal Sequences. Motion Features measure velocity and acceleration patterns, identify movement segments, and assess dynamic signing elements, whereas Inertial Features extract acceleration and gyroscopic information. Temporal Sequences explore time-based patterns in signing, recognizing movements and places and capturing gesture durations.

In Figure 2a, feature extraction is a crucial step in processing sensor data for accurate recognition and classification. The process begins with sensor data acquisition, where real-time signals from multiple sensors are collected. Preprocessing enhances signal quality, followed by segmentation, where continuous data streams are divided into meaningful gesture segments. Features are extracted, including spatial, temporal, and motion-based metrics, such as velocity, acceleration, angular displacement, and curvature metrics. A spatial–temporal model is developed to capture intricate motion dependencies, improving the representation of gestures. Fusion and optimization techniques are employed to refine the extracted features, integrate multimodal features, and reduce redundancy through dimensionality reduction methods like PCA or LDA. The final step, output feature vectors, represents the optimized dataset, ready for classification models. This structured approach eliminates redundancy and enhances clarity in feature extraction.

L_{t o t a l} = \{L_{g l o v e}, L_{p o s e}\},

(3)

In Equation (3), the following can be noted:

L_{g l o v e}

: Hand glove landmarks record delicate finger and hand movements, and

L_{p o s e}

: position landmarks describe the body’s posture and movements. Dimensions:

\{L_{g l o v e}, L_{p o s e}\} \in R^{N \times 3}

, where N is the number of keypoints with 3D spatial coordinates. The landmark set captures every key component of human communication for downstream processing. The Landmark Feature Vector representation for each landmark i in a frame is a high-dimensional feature vector:

X_{i} = [p_{x}, p_{y}, p_{z}, v_{x}, v_{y}, v_{z}, c]

(4)

In Equation (4),

p_{x}

,

p_{y}, {a n d p}_{z}

represent 3D spatial coordinates for the precise location of the landmark.

v_{x}, v_{y}, a n d v_{z}

represent velocity components for encoding the motion dynamics of the landmark over time. c represents the confidence score, indicating the reliability of the detected landmark as determined by MediaPipe. Position components:

p_{x}

,

p_{y}, {a n d p}_{z}

in meters (

R^{3})

. Velocity components:

v_{x}, v_{y}, a n d v_{z}

in meters per second (

R^{3})

. Dimension:

{X_{i} \in R}^{7}

for each keypoint.

Velocity components measure landmark position changes throughout time. Use finite differences as calculated in Equation (5) as follows:

\begin{matrix} v_{x} = \frac{x_{t + 1} - x_{t}}{∆_{t}}, \\ v_{y} = \frac{y_{t + 1} - y_{t}}{∆_{t}}, \\ v_{z} = \frac{z_{t + 1} - z_{t}}{∆_{t}} \end{matrix}\}

(5)

In Equation (5), Dimension:

v_{x}, v_{y}, {a n d v}_{z} \in R

measured in meters per second (m/s). Gesture identification requires time information because movement sequences have semantic significance. To gather information from users, the system makes use of wearable gadgets. The wearable gadgets have sensors built in so they can record your hand motions and gestures. A variety of sensors, including magnetometers, accelerometers, and gyroscopes, are built into the wearable gadgets to record hand motions and orientation from various angles. Figure 2b shows the configuration used during this operation, which uses the WULALA data glove hardware space.

Figure 2b shows the WULALA data glove specifications, not the processing environment. Figure 2c shows the captured sign language gesture using a sample glove sensor.

In Figure 2c, the accuracy of sign language identification is enhanced through the fusion of multiple sensor data streams, including motion, curvature, and inertial sensor readings. This fusion process reduces inconsistencies in gesture recognition by integrating complementary information from different modalities. Before fusion, individual sensor data may contain noise or missing values, leading to potential misclassification. After fusion, the combined dataset provides a more comprehensive motion trajectory, improving recognition reliability. The benefits of this fusion process are detailed in Figure 2c, which demonstrates the way multiple sensor inputs contribute to a more robust sign language recognition framework. To ensure consistency across sensor data, normalization is applied before fusion. This process involves scaling values to a common range, typically between 0 and 1, to standardize formats and improve comparability. Data fusion is then applied to integrate previously separate datasets into a single, more complete representation. Overall, data redundancy is minimized, improving the efficiency of the analysis during the fusion process. By combining motion data from multiple sensors, the system compensates for the limitations of individual sensors, which may capture only partial data despite their effectiveness. This method enhances the comprehension of sign language gestures by constructing a richer representation of movement and hand positions. Additionally, fusion techniques mitigate noise, disturbances, and other sensor errors, improving accuracy and robustness in sign language recognition.

3.3. Spatiotemporal Graph Convolutional Networks (ST-GCN)

The Graph Construction Module is essential for tasks such as sign language identification because it rapidly extracts spatial–temporal patterns and arranges them into a structured Spatial–Temporal Graph (STG). This graph efficiently captures intra-frame and inter-frame interactions. The graph definition is defined as follows:

G = (V, E)

(6)

In Equation (6),

V

represents a set of vertices where each vertex corresponds to a detected keypoint. E represents a set of edges that capture spatial connections (keypoint relationships within a frame) and temporal connections (keypoint relationships across frames). Dimension:

G \in R^{N \times N}

where

N

is the number of keypoints.

Edge weights are calculated using a Gaussian similarity function to quantify node relationships, ensuring stronger connections are represented with higher weights.

w_{i j} = \exp (- \frac{{‖x_{i} - x_{j}‖}^{2}}{{2 σ}^{2}})

(7)

In Equation (7),

w_{i j}

represented as the edge weight between nodes

i

and

j

.

x_{I} a n d x_{j}

represented as the nodes of the feature vector i and j. The scale parameter controlling the sensitivity of weight computation is

σ

. Dimension:

w_{i j}

∈

R

.

In Figure 3, the Graph Construction Module organizes collected keypoints into a structured Spatial–Temporal Graph (STG) that efficiently captures intra-frame dependencies and inter-frame interactions. This structure is essential for extracting spatial–temporal patterns for sign language recognition. A graph has vertices representing keypoints and edges representing spatial and temporal connections. A Gaussian similarity function quantifies node relationships to calculate edge weights.

The module starts with each sensor data frame’s hand and body landmarks. It then uses anatomical or geographic relationships to create spatial and temporal links between keypoints in the same frame and consecutive frames to record motion dynamics. The Gaussian similarity formula assigns larger weights to closer or more similar keypoints in feature space for each edge. The final graph blends spatial and temporal connections to depict keypoint data spatially and temporally.

3.4. Hybrid Graph Embedding Module

The sign language recognition system uses wearable gloves to create a spatial–temporal network from keypoint detection outputs. These inputs comprise six-axis IMU sensor hand-position keypoints, finger curvature sensor finger-joint keypoints, velocity vector motion characteristics, acceleration data, temporal sequence data, and confidence scores for each keypoint. The inputs are arranged into S, T, and Combined Feature Vectors. Each input frame has 21 keypoints (one wrist + five fingers × four joints), six motion parameters (three linear + three angular), confidence ratings, and a timestamp. The data structure is a float–timestamp matrix with nodes representing keypoints, edges representing spatial and temporal correlations, and edge weights calculated. These input data create the sign language recognition feature representation using spatial graph convolution, temporal attention mechanism, and feature fusion. Confidence scores affect feature reliability weighting. The system generates sign language recognition features using spatial graph convolution, temporal attention, and feature fusion from these input data.

In Figure 4, to understand the temporal dynamics of movements, the Hybrid Graph Embedding Approach uses a combination of methods, including 3D keypoint extraction from sensor data frames, building a Spatial–Temporal Graph, learning representations of spatial features, and computing attention weights. The feature representation step is responsible for fusing the output of various modules and generating the final feature vector via weighted summing of multi-scale features. This combined graph-based modeling and Adaptive Convolutional Network strategy aims to improve the accuracy and resilience of sign language recognition by capturing the complex spatial–temporal interactions inherent in sign language communication. It gives a thorough overview of the proposed methodology in Figure 4, showing the flow of information and mathematical formulas underpinning each essential component.

3.4.1. Spatial Graph Convolution

In this process, local spatial information is extracted by applying convolution across the graph structure. The formula is as follows:

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} \frac{1}{c_{i j}} W^{(l)} h_{j}^{(l)})

(8a)

In Equation (8a),

h_{i}^{(l)}

is denoted as the feature representation of node i at layer l.

W^{(l)}

is denoted as a learnable weight matrix.

N (i)

is denoted as a set of neighbors for node i.

c_{i j}

is represented as the normalization factor for edge ij.

σ

is an activation function (e.g., ReLU). This operation groups information from neighboring nodes to inform the representation of each node.

Dimension:

h_{i}^{(l)} \in R^{d}

, where

d

is the feature dimension.

In Equation (8a), the term neighbors refers to the set of adjacent nodes used in the spatial graph convolution operation. The connectivity pattern of the graph representation of keypoints in the sign language recognition model determines the number of neighbors.

The eight-connected neighborhood structure improves spatial representation, contextual awareness, and recognition accuracy by capturing diagonal relationships in hand motion. It enhances feature aggregation, leading to robust model performance and a well-defined graph convolution framework, ensuring a more detailed understanding of hand motion.

For a node located at a position

(i, j),

the neighboring nodes are defined as follows:

N (i, j) = {(i - 1, j - 1), (i - 1, j), (i - 1, j + 1), (i, j - 1), (i, j + 1), (i + 1, j - 1), (i + 1, j), (i + 1, j + 1)}

(8b)

In Equation (8b),

(i - 1, j - 1)

,

(i - 1, j + 1)

,

(i + 1, j - 1),

and

(i + 1, j + 1)

are diagonal neighbors.

(i - 1, j), (i, j - 1)

,

(i + 1, j),

and

(i, j + 1)

are direct horizontal and vertical neighbors.

3.4.2. Temporal Attention Mechanism

The method simulates feature temporal connections throughout time. The computation of attention weights (

α_{i j}

) in Equation (9) is as follows:

α_{i j} = e x p (L e a k y R e L U (a^{⊤} [W h_{i} ∣ ∣ W h_{j}]))

(9)

In Equation (9),

[W h_{i} ∣ ∣ W h_{j}]

is denoted as a concatenation of features from nodes i and j, transformed by weight matrix W.

L e a k y R e L U

is denoted as an Activation function adding non-linearity, and

a

is denoted as a learnable attention vector. This process allocates a greater position to temporally relevant connections, dynamically weighting node interactions. Dimension:

α_{i j}

∈

R

, where attention weights are scalars.

3.5. Adaptive Convolutional Network

The Adaptive Convolutional Network is essential for handling feature representation fluctuations across scales and temporal contexts. Parts of Dynamic Kernel Generation and multi-scale sensor data fusion are discussed below:

3.5.1. Dynamic Kernel Generation

In this stage, the weights of convolutional kernels are adjusted dynamically according to the input features at a given time step.

K_{t} = f_{a d a p t} (F_{t})

(10)

In Equation (10),

K_{t}

represents adaptive convolution kernel weights for time step t.

f_{a d a p t}

represents a function (typically implemented as a neural network) that dynamically generates kernel weights, and

F_{t}

represents input features extracted at time t. Dimension:

K_{t} \in R^{k \times k}

, where

k

is the kernel size. Encoding spatial or temporal information requires extracting input properties from data at a certain time, such as frames in a sensor data series. A convolution kernel is often generated by a dynamic function with the help of a small neural network trained to convert feature inputs into kernel weights. After that, the network processes the input features using the kernel, which enables it to dynamically specialize its convolution operation according to the data’s content. Because the kernel changes depending on the input characteristics, the network can better deal with variances in the input data, leading to content-aware processing. Furthermore, dynamic kernels improve generalizability by decreasing sensitivity to noise and environmental factors.

3.5.2. Multi-Scale Sensor Data Fusion

The method creates a cohesive picture by merging features from different spatial or temporal scales.

F_{f u s e d} = \sum_{i = 1}^{M} w_{i} \cdot F_{i}

(11)

In Equation (11),

F_{f u s e d}

represents the fused feature representation capturing information from multiple scales.

F_{i}

represents the features extracted at scale i. Each scale represents features computed with different kernel sizes or temporal windows, capturing fine-grained and coarse-grained information.

w_{i}

represents learnable sensor data fusion weights for each scale, determining the contribution of

F_{i}

to the final fused representation, and M represents the number of scales used. Dimension:

F_{f u s e d} \in R^{d}

, where

d

is the feature dimension.

Feature extraction, learning weights, and summing are used for local and global patterns. Small-scale features capture local details, while large-scale features capture patterns. Learnable weights favor task-specific informative scales. Sensor data fusion ensures multi-scale awareness and task-specific weighting, improving accuracy and efficiency. Summarization integrates information across scales.

Sign language sentence recognition using Hybrid Graph Embedding and Adaptive Convolutional Networks is robust using Algorithm 1. Initiates an empty graph and keypoints, extracts keypoints from sensor data frames, builds vertices with spatial and temporal edges, and assigns edge weights using Gaussian similarity. Spatial–Temporal Graph Convolution Networks aggregate graph node characteristics, whereas adaptive convolution generates frame-wise kernels. Finally, graph and convolution module embeddings are concatenated and processed through dense layers and a softmax classifier to predict words, accurately recognizing complicated spatial–temporal sign gestures.

Algorithm 1: HGE-ACN Method

Input: Sensor data frames {F1, F2, …, Fn}, Keypoint Detector D
Output: Predicted Sentence S
Step 1: Begin
G ← ∅# Initialize Graph, K ← ∅# Initialize Keypoints
Step 2: for Fi in {F1, F2, …, Fn} do
K[i] ← D(Fi) # Extract keypoints
V ← K[i]; E ← SpatialEdges(V) ∪ TemporalEdges(V)
Step 3: for vi, vj ∈ V do

w_{i j} = \exp (- \frac{{‖x_{i} - x_{j}‖}^{2}}{{2 σ}^{2}})

(Equation (7)) # Gaussian edge weight
Step 4: for vi ∈ Vdo

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} \frac{1}{c_{i j}} W^{(l)} h_{j}^{(l)})

(Equation (8))

α_{i j} = e x p (L e a k y R e L U (a^{⊤} [W h_{i} ∣ ∣ W h_{j}]))

(Equation (9)) # ST-GCN
Step 5: for t ∈ {1, 2, …, n}do

K_{t} = f_{a d a p t} (F_{t})

(Equation (10)) # Adaptive Kernel Convolution

F_{f u s e d} = \sum_{i = 1}^{M} w_{i} \cdot F_{i}

(Equation (11))
Z ← Concatenate (G_emb, Ft_fused)
Y ← Dense(Z); S ← Softmax(Y)
Output: S
End

3.5.3. Training Procedure for the Model

The HGE-ACN framework is a training method that aims to efficiently learn spatial–temporal patterns from wearable sensor data. The training process involves dataset preparation, preprocessing, graph construction, model initialization, feature extraction, graph convolution, adaptive convolution, loss function and optimization, batch processing, backpropagation, early stopping, and model evaluation. The input data are from a wearable glove with six-axis IMU sensors and five-finger curvature sensors. The raw sensor signals are normalized, synchronized, and transformed into structured feature vectors. A spatial–temporal graph is built, with vertices representing keypoints and edges representing spatial and temporal relationships. The Graph Convolutional Network (GCN) and Adaptive Convolutional Networks (ACNs) are initialized with random weights and edge weights computed using the Gaussian similarity function. The model’s classification output is obtained using Softmax, cross-entropy loss is used for optimization, and the Adam optimizer is used with a learning rate of 10⁻³. The training process includes batch processing, backpropagation, early stopping, and model evaluation. The HGE-ACN model achieves high recognition accuracy, robustness, and real-time performance.

3.6. Classification

The HGE-ACN framework integrates processed characteristics and predicts output at several levels, as shown in Figure 5 below. Features from multiple spatial and temporal dimensions are fused and classified to capture fine-grained and coarse-grained patterns. The feature vector encodes spatial–temporal dynamics of the full input sensor data sequence. Classification modules with dense layers, softmax layers, and three levels, the Word Level (individual sign recognition), Phrase Level (sign sequences), and Sentence Level (complete sentence translation), process the fused output feature vector.

Figure 5 shows that the system delivers three-level hierarchical recognition outputs for varied use cases. Individual signs are examined by frame or short sequence to predict the most likely word for a motion. This level is ideal for language-learning and vocabulary-building apps. The phrase level recognizes small sequences of related indicators in temporal order to produce meaningful phrases. It works for brief commands and AAC. The technology generates fully structured sentences with word order changes and auxiliary phrases to match spoken and written language. Real-time communication, accessibility, and sign language translation improve at this level. The three-level output structure makes the system flexible for education, accessibility, and real-world human–computer interaction for simple and complicated sign language recognition tasks.

4. Results and Discussion

The proposed framework was tested on datasets featuring both sensor-based inputs and traditional glove sensor data. Initial results indicate improved recognition accuracy and reduced inference time compared to baseline methods. Further analysis highlights the system’s ability to adapt to varied signing speeds and environmental conditions, making it a viable solution for real-world applications.

Performance Comparison Study:

HGE-ACN outperforms the Dynamic Dense Spatiotemporal Graph Convolutional Neural Network (DDSTGCNN) [18], Graph-Based Multimodal Sequential Embedding Network (MSeqGraph) [17], and Boundary-Adaptive Encoder (BAE) [23] in sign language sentence recognition. Compared to previous approaches, HGE-ACN has higher recognition accuracy and reduced mistake rates. Real-time sign language recognition systems can rely on its adaptive convolutional technique and sensor data fusion to handle signing speed and style changes. HGE-ACN can create inclusive deaf communication systems, as shown by this comparison.

4.1. Accuracy

A method’s accuracy is the ratio of correctly predicted sign language sentences to total predictions. Overall, it validates the model.

A c c u r a c y = \frac{N u m b e r o f C o r r e c t l y R e c o g n i z e d S e n t e n c e s (C)}{T o t a l N u m b e r o f S e n t e n c e s (N)} \times 100

(12)

In Equation (12), C stands for the successfully predicted sentences, and N denotes the total number of sentences in the test dataset.

Cost criterion: The cost criterion in a model is defined using the cross-entropy loss function, which quantifies the classification error between the predicted sign language sentence and the actual ground truth.

Accuracy is a performance metric, while the cost function is used for model optimization. The model’s accuracy is evaluated using Equation (12), which assesses the proportion of correctly predicted sentences. The training of the model minimizes a cost function based on the categorical cross-entropy loss, penalizing incorrect predictions more heavily when the model is highly confident about an incorrect class. Cross-entropy is suitable for classification tasks like sign language sentence recognition as it improves sentence prediction accuracy, directly impacting performance in Equation (12). Softmax normalization ensures a probability distribution over classes, stabilizing training and preventing divergence. Gradient-based optimization (Adam or SGD) ensures convergence towards a model that maximizes classification performance. The cost criterion used during training is the categorical cross-entropy loss function, while accuracy is a post-training evaluation metric. This distinction should be explicitly mentioned when presenting the model to avoid ambiguity.

Figure 6 indicates that the hybrid HGE-ACN model, which uses adaptive convolution and graph-based embeddings, recognizes sign language better than DDSTGCNN, MSeqGraph, and BAE. It outperforms DDSTGCNN, MSeqGraph, and BAE with 94.8% accuracy under normal settings. HGE-ACN remains accurate despite noise or compounding obstructions. Excellent performance is due to its spatiotemporal dynamics recording and sign language motion resilience.

4.2. The Inference Time

Inference time is the model’s mean duration to analyze a single input (sensor data) and generate recognition outcomes. It is essential for assessing the appropriateness of real-time applications.

I n f e r e n c e T i m e = \frac{T o t a l P r o c e s s i n g T i m e f o r A l l s e n s o r d a t a (T)}{T o t a l n u m b e r o f s e n s o r d a t a (V)} (m i l l i s e c o n d s p e r s e n s o r d a t a)

(13)

In Equation (13), inference time, the average processing time for sensor data, is a key parameter for model computational efficiency. Divide the total processing time for all sensor data (T) by the total number of sensor data (V) in milliseconds per sensor data. Less inference time means faster processing and scalability in real-time applications.

Figure 7 of the inference time series compares the execution timings of four models, HGE-ACN, DDSTGCNN, MSeqGraph, and BAE, under five scenarios. With a time of 45 ms, HGE-ACN is the quickest, whereas BAE takes 85 ms. With a minimum inference time of 48 ms, the variable signing speed remains the fastest, while other models exhibit a significant rise in time. All models experience an increase in signature variability, except HGE-ACN at 50 ms and BAE at 95 ms. As a result of environmental noise, inference times increase; for example, HGE-ACN takes 52 ms and BAE 98 ms. When tested under extreme conditions, HGE-ACN shows its best performance at 55 ms.

In Table 2, the computational complexity of the HGE-ACN is analyzed based on graph convolution and adaptive convolution operations. Graph convolution has a complexity of

O (N F^{2} L)

, where

N

is the number of keypoints,

F

is the feature dimension, and

L

is the number of graph convolution layers. Adaptive convolution operates over spatial dimensions with complexity

O (H W C k^{2})

, where

H a n d W

are frame dimensions,

C

is the number of channels, and

k

is the kernel size. The combined complexity ensures scalable and efficient inference.

4.3. Error Rate

The percentage of sentences that the model erroneously identified is called the error rate. Through the quantification of inaccurate predictions, it emphasizes the model’s dependability.

E r r o r R a t e = \frac{N u m b e r o f I n c o r r e c t l y R e c o g n i z e d S e n t e n c e s (E)}{T o t a l N u m b e r o f S e n t e n c e s (N)} \times 100

(14)

In Equation (14), E is the number of errors, and N is the total number of test sentences.

In Figure 8, different conditions are used to compare four models: HGE-ACN, DDSTGCNN, MSeqGraph, and BAE. With an error rate of 5.2% under typical circumstances, HGE-ACN outperforms BAE, which has the highest at 11.3%. When the difficulty level rises, so does the rate of inaccuracy for every model. While BAE’s error rate is the greatest at 15.8% under a variable signing speed, HGE-ACN performs the best with a rate of 6.5%. With error rates of 7.2% for HGE-ACN and 16.2% for BAE, respectively, signal variability and environmental noise affect performance.

4.4. Levenshtein Distance

The minimal number of single-character modifications required to turn one sentence into another is measured by the Equation d (i,j). When comparing anticipated and real sentences, it is useful to utilize the function d(i,j) as the difference between them. The purpose of the function is to compare the anticipated and real sentences for similarity. When comparing expected and actual sentence similarity, the following equation is helpful.

d (i, j) = \{\begin{matrix} i i f j = 0, \\ j i f j = 0, \\ m i n \{\begin{matrix} d (i - 1, j) + 1, \\ d (i, j - 1) + 1, otherwise \\ d (i - 1, j - 1) + cost (i, j) \end{matrix} \end{matrix}

(15)

Equation (15) calculates the Levenshtein Distance, the least edit operations needed to change a string. It involves d (i, j), indicating the Levenshtein Distance between string A’s first i and j characters and cost (i, j), reflecting the cost of changing A[i] into B[j]. The distance is the length of the opposite string, if empty, and the minimum of insertion, deletion, and substitution.

Figure 9 shows the Levenshtein Distance, also known as edit distance, used in error correction, DNA sequence alignment, and natural language processing. Differences between strings are determined by the fewest operations needed to transform them. Figure 9 indicates that the Levenshtein Distance compares HGE-ACN, DDSTGCNN, MSeqGraph, and BAE under different conditions. All models increase error rates when conditions are challenging, although HGE-ACN frequently maintains accuracy. This shows that HGE-ACN can endure noisy, dynamic environments.

4.5. Model Experimentation

To ensure robust generalization, the HGE-ACN model was tested on a structured sign language recognition dataset with many samples, various signers, and real-world variations. Refinements for dataset details and ablation studies are as follows:

The dataset includes 120 h of recorded sign language gestures from 50 different signers, covering 250,000 gesture sequences. The data were collected in indoor and outdoor settings under three lighting conditions and varying noise levels. The sensor configuration used a wearable glove-based system with six-axis IMU sensors and five-finger curvature sensors for precise motion and hand positioning. Preprocessing methods included min–max normalization, low-pass filtering, and temporal alignment techniques to improve signal quality before feature extraction. This dataset configuration ensures the model effectively generalizes across different user styles, environmental settings, and signing speeds.

Ablation Study: Validating Module Contributions

To validate the contribution of each HGE-ACN module, ablation studies were conducted by removing or modifying individual components and analyzing the impact on accuracy and inference time in Table 3.

4.6. Model Comparison

The HGE-ACN model was compared to advanced sign language recognition methods, ensuring fair evaluation based on accuracy, inference time, complexity, and standardized parameter settings for consistency.

4.6.1. Comparison with Recent Methods

To validate the effectiveness of HGE-ACN, it was compared against four recent deep-learning-based models used for sign language recognition, as shown in Table 4:

4.6.2. Parameter Settings for Benchmark Models

All models were trained with identical dataset configurations and preprocessing techniques for a fair comparison. The hyperparameter settings used for each model are summarized in the Table 5 below:

The HGE-ACN model was tested in real-world scenarios to evaluate its performance. The model showed high accuracy (above 87%) even under challenging conditions, demonstrating robust generalization. Inference time increased slightly in real-world scenarios but remained efficient for real-time applications. Error rates were highest in low-light and high-noise conditions, indicating sensitivity to external environmental factors. The model performed consistently across signers with different hand sizes, proving adaptability to user variability. However, the model has limitations, including sensitivity to low-light conditions, computational complexity in large-scale deployment, and a limited dataset for extreme conditions. Future improvements include integrating adaptive filtering and feature enhancement techniques, optimizing model architecture with quantization and pruning techniques, and expanding the dataset with additional real-world variations.

5. Conclusions and Future Work

The Hybrid Graph Embedding and Adaptive Convolutional Networks (HGE-ACN) are a game-changer in sign language sentence recognition (SLSR). This innovative framework combines graph-based embeddings with Adaptive Convolutional Networks to address the challenges of decoding complex sign language communication. Among the significant achievements are the following: constructing a robust SLSR system using a wearable glove with six-axis inertial and five-finger curvature sensors; demonstrating that hybrid modeling techniques can improve recognition accuracy; and lastly, developing a versatile and adaptive recognition framework capable of surpassing the shortcomings of existing deep learning models. There are significant implications for accessibility, communication technology, and inclusive design, as stated in the paper. The proposed method has the potential to improve communication tools for the deaf population by improving machine understanding of sign language, which in turn can change platforms for social and professional, as well as educational, interactions. The next stages should center on further studies and finding practical uses for the HGE-ACN technique. Some of these areas of focus include investigating interactions between humans and computers in the context of adaptive communication, gesture-based interface design, rehabilitation efforts, and therapy technologies; incorporating advanced machine learning techniques; optimizing deployment for real-world use; building context-aware sign language interpretation systems; and expanding the current framework to support recognition across different sign languages. Building on HGE-ACN’s work in these areas, further research can improve sign language recognition algorithms to make them more accessible and valuable for a wider range of users.

Author Contributions

Conceptualization, Y.L. and C.J.; methodology, Y.L. and C.J.; software, Y.L.; validation, Y.L. and P.C.; resources, P.C.; writing—original draft preparation, Y.L. and P.C.; writing—review and editing, Y.L. and C.J.; formal analysis, Y.L.; supervision, C.J.; project administration, P.C.; funding acquisition, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Srinakharinwirot University Research Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors wish to gratefully acknowledge financial support for this research from the Srinakharinwirot University Research fund.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shahin, N.; Ismail, L. From rule-based models to deep learning transformers architectures for natural language processing and sign language translation systems: Survey, taxonomy and performance evaluation. Artif. Intell. Rev. 2024, 57, 271. [Google Scholar] [CrossRef]
Nasabeh, S.S.; Meliá, S. Enhancing quality of life for the hearing-impaired: A holistic approach through the MoSIoT framework. Univers. Access Inf. Soc. 2024, 24, 1–23. [Google Scholar] [CrossRef]
Buttar, A.M.; Ahmad, U.; Gumaei, A.H.; Assiri, A.; Akbar, M.A.; Alkhamees, B.F. Deep Learning in Sign Language Recognition: A Hybrid Approach for the Recognition of Static and Dynamic Signs. Mathematics 2023, 11, 3729. [Google Scholar] [CrossRef]
Al-Qurishi, M.; Khalid, T.; Souissi, R. Deep Learning for Sign Language Recognition: Current Techniques, Benchmarks, and Open Issues. IEEE Access 2021, 9, 126917–126951. [Google Scholar] [CrossRef]
Papatsimouli, M.; Sarigiannidis, P.; Fragulis, G.F. A Survey of Advancements in Real-Time Sign Language Translators: Integration with IoT Technology. Technologies 2023, 11, 83. [Google Scholar] [CrossRef]
Jiang, X.; Zhang, Y.; Lei, J.; Zhang, Y. A Survey on Chinese Sign Language Recognition: From Traditional Methods to Artificial Intelligence. CMES Comput. Model. Eng. Sci. 2024, 140, 1–40. [Google Scholar] [CrossRef]
Tao, T.; Zhao, Y.; Liu, T.; Zhu, J. Sign Language Recognition: A Comprehensive Review of Traditional and Deep Learning Approaches, Datasets, and Challenges. IEEE Access 2024, 12, 75034–75060. [Google Scholar] [CrossRef]
Shakeel, P.M.; Burhanuddin, M.A.; Desa, M.I. Automatic lung cancer detection from CT image using improved deep neural network and ensemble classifier. Neural Comput. Appl. 2020, 34, 9579–9592. [Google Scholar] [CrossRef]
Murali, R.S.L.; Ramayya, L.D.; Santosh, V.A. Sign language recognition system using convolutional neural network and computer sensor-based. Int. J. Eng. Innov. Adv. Technol. 2022, 4, 138–141. [Google Scholar]
Kaur, B.; Chaudhary, A.; Bano, S.; Yashmita, S.R.N.; Reddy, S.; Anand, R. Fostering inclusivity through effective communication: Real-time sign language to speech conversion system for the deaf and hard-of-hearing community. Multimed. Tools Appl. 2023, 83, 45859–45880. [Google Scholar] [CrossRef]
Zhu, W. Quiet Interaction: Designing an Accessible Home Environment for Deaf and Hard of Hearing (DHH) Individuals Through AR, AI, and IoT Technologies. Doctoral Dissertation, OCAD University, Toronto, ON, Canada, 2024. [Google Scholar]
Miah, A.S.M.; Hasan, A.M.; Jang, S.-W.; Lee, H.-S.; Shin, J. Multi-Stream General and Graph-Based Deep Neural Networks for Skeleton-Based Sign Language Recognition. Electronics 2023, 12, 2841. [Google Scholar] [CrossRef]
Naz, N.; Sajid, H.; Ali, S.; Hasan, O.; Ehsan, M.K. MIPA-ResGCN: A multi-input part attention enhanced residual graph convolutional framework for sign language recognition. Comput. Electr. Eng. 2023, 112, 109009. [Google Scholar] [CrossRef]
Liang, Y.; Jettanasen, C.; Chiradeja, P. Progression Learning Convolution Neural Model-Based Sign Language Recognition Using Wearable Glove Devices. Computation 2024, 12, 72. [Google Scholar] [CrossRef]
Liang, Y.; Jettanasen, C. Development of Sensor Data Fusion and Optimized Elman Neural Model-based Sign Language Recognition System. J. Internet Technol. 2024, 25, 671–681. [Google Scholar] [CrossRef]
Venugopalan, A.; Reghunadhan, R. Applying Hybrid Deep Neural Network for the Recognition of Sign Language Words Used by the Deaf COVID-19 Patients. Arab. J. Sci. Eng. 2022, 48, 1349–1362. [Google Scholar] [CrossRef]
Tang, S.; Guo, D.; Hong, R.; Wang, M. Graph-Based Multimodal Sequential Embedding for Sign Language Translation. IEEE Trans. Multimed. 2021, 24, 4433–4445. [Google Scholar] [CrossRef]
Muthusamy, P.; Murugan, G.P. Recognition of Indian Continuous Sign Language Using Spatio-Temporal Hybrid Cue Network. Int. J. Intell. Eng. Syst. 2023, 16, 874. [Google Scholar]
Kan, J.; Hu, K.; Hagenbuchner, M.; Tsoi, A.C.; Bennamoun, M.; Wang, Z. Sign language translation with hierarchical spatio-temporal graph neural network. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3367–3376. [Google Scholar]
Xu, X.; Meng, K.; Chen, C.; Lu, L. Isolated Word Sign Language Recognition Based on Improved SKResNet-TCN Network. J. Sens. 2023, 2023, 9503961. [Google Scholar] [CrossRef]
Gupta, A.; Sawan, A.; Singh, S.; Kumari, S. Dynamic Sign Language Recognition with Hybrid CNN-LSTM and 1D Convolutional Layers. In Proceedings of the 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 14–15 March 2024; pp. 1–6. [Google Scholar]
Noor, T.H.; Noor, A.; Alharbi, A.F.; Faisal, A.; Alrashidi, R.; Alsaedi, A.S.; Alharbi, G.; Alsanoosy, T.; Alsaeedi, A. Real-Time Arabic Sign Language Recognition Using a Hybrid Deep Learning Model. Sensors 2024, 24, 3683. [Google Scholar] [CrossRef]
Huang, S.; Ye, Z. Boundary-Adaptive Encoder With Attention Method for Chinese Sign Language Recognition. IEEE Access 2021, 9, 70948–70960. [Google Scholar] [CrossRef]
Kumar, E.K.; Kishore, P.; Kumar, D.A.; Kumar, M.T.K. Early estimation model for 3D-discrete indian sign language recognition using graph matching. J. King Saud Univ. Comput. Inf. Sci. 2021, 33, 852–864. [Google Scholar] [CrossRef]
Rajalakshmi, E.; Elakkiya, R.; Subramaniyaswamy, V.; Alexey, L.P.; Mikhail, G.; Bakaev, M.; Kotecha, K.; Gabralla, L.A.; Abraham, A. Multi-Semantic Discriminative Feature Learning for Sign Gesture Recognition Using Hybrid Deep Neural Architecture. IEEE Access 2023, 11, 2226–2238. [Google Scholar] [CrossRef]

Figure 1. Block diagram of HGE-ACN framework.

Figure 2. (a) Feature extraction module. (b) WULALA data glove specifications. (c) A sign language gesture example.

Figure 3. Graph Construction Module.

Figure 4. Hybrid Graph Embedding Module.

Figure 5. Classification of three-level output.

Figure 6. Accuracy comparison.

Figure 7. Inference time.

Figure 8. Error Rate.

Figure 9. Levenshtein Distance.

Table 1. Hyperparameter configuration of HGE-ACN.

Hyperparameter	Notation	Value(s)	Description
Graph Construction Threshold	τ	0.5–0.7	Defines connectivity in the spatial–temporal graph.
Graph Embedding Dimensionality	d	64, 128	Determines feature embedding size in the latent space.
Adaptive Convolution Kernel Size	k	3 × 3, 5 × 5	Controls receptive field in feature extraction.
Normalization Factor in Edge Weight Calculation	σ	0.1–1.0	Regulates sensitivity of graph edge weights.
Attention Weight Factor	α	0.1–0.9	Adjusts attention-based feature importance.
Dropout Rate	p	0.3–0.5	Prevents overfitting in graph and convolution layers.
Learning Rate	η	0.001–0.005	Determines optimization step size for convergence.
Batch Size	B	32, 64	Number of samples processed per training step.
Number of GCN Layers	L	2, 3	Defines depth of graph convolutional layers.
Number of Attention Heads	h	4, 8	Controls multi-head self-attention operations.
Weight Decay (L2 Regularization)	λ	1 × 10⁻⁴, 1 × 10⁻⁵	Reduces overfitting by penalizing large weights.

Table 2. Computational complexity of HGE-ACN model.

Component	Operation	Computational Complexity
Graph Convolution	Node feature aggregation and transformation	$O (N F^{2} L)$
Adaptive Convolution	Feature extraction over spatial dimensions	$O (H W C k^{2})$
Total Complexity	Combined graph and convolutional computations	$O (N F^{2} L + H W C k^{2} L_{C})$

Table 3. Ablation study of HGE-ACN components.

Experiment	Description	Accuracy (%)	Inference Time (ms)
Full HGE-ACN Model	Baseline model with all modules enabled.	94.8%	45 ms
Without Graph Embedding (HGE Removed)	Only adaptive convolution is used for feature extraction.	88.3%	39 ms
Without Adaptive Convolution (ACN Removed)	Only graph embeddings are used without adaptive convolution.	85.7%	42 ms
Without Multi-Scale Sensor Fusion	Raw sensor inputs are used without fusion techniques.	81.5%	38 ms
Without Attention Mechanism	Temporal attention is disabled in the graph network.	86.1%	44 ms

Table 4. Comparison with recent methods.

Model	Methodology	Accuracy (%)	Inference Time (ms)	Number of Parameters (Million)
HGE-ACN (Proposed)	Hybrid Graph Embedding + Adaptive Convolution	94.8%	45 ms	12.5 M
DDSTGCNN [1]	Dynamic Dense Spatiotemporal Graph CNN	89.2%	50 ms	14.3 M
MSeqGraph [2]	Multimodal Sequential Graph Embedding	91.1%	56 ms	15.2 M
BAE [3]	Boundary-Adaptive Encoder for Sign Language	87.4%	62 ms	11.8 M
SKResNet-TCN [4]	Hybrid ResNet + Temporal Convolution	92.3%	48 ms	13.1 M

Table 5. Parameter settings for benchmark models.

Model	Learning Rate	Batch Size	Number of Layers	Dropout Rate	Optimizer
HGE-ACN (Proposed)	0.001	64	3 GCN + 2 ACN	0.4	AdamW
DDSTGCNN	0.0015	64	4 GCN	0.5	Adam
MSeqGraph	0.002	32	2 GCN + 1 LSTM	0.3	RMSprop
BAE	0.001	32	2 LSTM	0.3	Adam
SKResNet-TCN	0.0008	64	3 ResNet + 2 TCN	0.4	Adam

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chiradeja, P.; Liang, Y.; Jettanasen, C. Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks. Appl. Sci. 2025, 15, 2957. https://doi.org/10.3390/app15062957

AMA Style

Chiradeja P, Liang Y, Jettanasen C. Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks. Applied Sciences. 2025; 15(6):2957. https://doi.org/10.3390/app15062957

Chicago/Turabian Style

Chiradeja, Pathomthat, Yijuan Liang, and Chaiyan Jettanasen. 2025. "Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks" Applied Sciences 15, no. 6: 2957. https://doi.org/10.3390/app15062957

APA Style

Chiradeja, P., Liang, Y., & Jettanasen, C. (2025). Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks. Applied Sciences, 15(6), 2957. https://doi.org/10.3390/app15062957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Sign Language Sentence Recognition Using Hybrid Graph Embedding and Adaptive Convolutional Networks

Abstract

1. Background and Significance

1.1. Advanced Feature Extraction and Representation

1.2. Performance Enhancement and Robustness

1.3. Accessibility and Technological Innovation

1.4. Significance

2. Literature Survey

3. Proposed Hybrid Graph Embedding and Adaptive Convolution Networks

3.1. Mathematical Foundation

3.2. Feature Extraction

3.3. Spatiotemporal Graph Convolutional Networks (ST-GCN)

3.4. Hybrid Graph Embedding Module

3.4.1. Spatial Graph Convolution

3.4.2. Temporal Attention Mechanism

3.5. Adaptive Convolutional Network

3.5.1. Dynamic Kernel Generation

3.5.2. Multi-Scale Sensor Data Fusion

3.5.3. Training Procedure for the Model

3.6. Classification

4. Results and Discussion

4.1. Accuracy

4.2. The Inference Time

4.3. Error Rate

4.4. Levenshtein Distance

4.5. Model Experimentation

Ablation Study: Validating Module Contributions

4.6. Model Comparison

4.6.1. Comparison with Recent Methods

4.6.2. Parameter Settings for Benchmark Models

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI