RS-Xception: A Lightweight Network for Facial Expression Recognition

Liao, Liefa; Wu, Shouluan; Song, Chao; Fu, Jianglong

doi:10.3390/electronics13163217

Open AccessArticle

RS-Xception: A Lightweight Network for Facial Expression Recognition

¹

School of Software Engineering, Jiangxi University of Science and Technology, Nanchang 330000, China

²

Jiangxi Modern Polytechnic College, Nanchang 330000, China

³

Information Engineering College, Hebei University of Architecture, Zhangjiakou 075000, China

⁴

Big Data Technology Innovation Center of Zhangjiakou, Zhangjiakou 075000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3217; https://doi.org/10.3390/electronics13163217

Submission received: 11 July 2024 / Revised: 5 August 2024 / Accepted: 13 August 2024 / Published: 14 August 2024

(This article belongs to the Special Issue Advances in Computer Vision and Deep Learning and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Facial expression recognition (FER) utilizes artificial intelligence for the detection and analysis of human faces, with significant applications across various scenarios. Our objective is to deploy the facial emotion recognition network on mobile devices and extend its application to diverse areas, including classroom effect monitoring, human–computer interaction, specialized training for athletes (such as in figure skating and rhythmic gymnastics), and actor emotion training. Recent studies have employed advanced deep learning models to address this task, though these models often encounter challenges like subpar performance and an excessive number of parameters that do not align with the requirements of FER for embedded devices. To tackle this issue, we have devised a lightweight network structure named RS-Xception, which is straightforward yet highly effective. Drawing on the strengths of ResNet and SENet, this network integrates elements from the Xception architecture. Our models have been trained on FER2013 datasets and demonstrate superior efficiency compared to conventional network models. Furthermore, we have assessed the model’s performance on the CK+, FER2013, and Bigfer2013 datasets, achieving accuracy rates of 97.13%, 69.02%, and 72.06%, respectively. Evaluation on the complex RAF-DB dataset yielded an accuracy rate of 82.98%. The incorporation of transfer learning notably enhanced the model’s accuracy, with a performance of 75.38% on the Bigfer2013 dataset, underscoring its significance in our research. In conclusion, our proposed model proves to be a viable solution for precise sentiment detection and estimation. In the future, our lightweight model may be deployed on embedded devices for research purposes.

Keywords:

FER; lightweight network; CNN; squeeze and excitation attention

1. Introduction

Facial expression recognition is a multifaceted cognitive process that requires the integration of visual and auditory stimuli, prior knowledge, and social context. It plays a crucial role in social interactions, emotional understanding, and empathy. This technology is extensively utilized in human–computer interaction, virtual assistants, and the diagnosis and treatment of mental health conditions. Therefore, the development of precise and effective facial emotion recognition models is essential. These models not only enhance the functionality of various real-world applications but also have significant implications across multiple research domains. Currently, numerous researchers have conducted extensive studies on facial emotion recognition, proposing a variety of models. While existing research has yielded significant improvements in overall recognition accuracy, the number of parameters and computational demands of these models are often too large for deployment on mobile devices. We intend to propose an advanced model that integrates the current state-of-the-art attention mechanism to enhance overall recognition performance while maintaining a minimal number of parameters, thereby facilitating its application to future mobile devices.

Previous research in the field of facial expression recognition (FER) has primarily focused on extracting artificial or superficial facial characteristics [1,2,3,4]. Shallow networks have been shown to be highly effective in various tasks, with Nassif A B et al. [5] enhancing facial expression classification accuracy through the use of skip connections. Attention mechanisms have also been successfully integrated into these networks, emphasizing emotionally salient regions for improved recognition [6]. Some studies [7,8,9,10,11,12,13,14] have aimed to enhance detection accuracy and effectiveness by employing CNNs or machine learning algorithms, leveraging deep learning techniques to automatically extract facial features and optimize models with training data to enhance the capabilities of FER [15]. Furthermore, other studies [16,17,18,19,20,21,22] have combined deep feature learning methods with traditional manual feature learning techniques, such as fusing multimodal and multi-temporal features through deep learning and manual methods to generate feature maps [19,20]. Malika et al. [23] proposed an intelligent framework to reduce the dimensionality of facial images and optimize classifier parameters for more accurate emotion recognition. Tanoy Debnath suggested a fusion of features extracted from facial expression images using a local binary pattern (LBP) to enable rapid convergence of the classification model [24].

With the growing demand for data storage and processing in large-scale CNN networks, researchers have proposed lightweight CNNs as a solution for face emotion recognition [25]. In the realm of expression recognition, Helaly et al. [26] developed a comprehensive framework to identify six primary emotions. It has been observed that convolutional neural networks tend to focus most of the extracted features on the central region of the face (nose, mouth, eyes), potentially leading to recognition errors if the feature extraction is biased towards the side of the face [27]. The integration of transfer learning [28,29] has marked a significant advancement in facial expression recognition (FER), enabling the utilization of a single type of data and function without constraints, thus showcasing the generalization capability of artificial intelligence. Presently, numerous researchers are adopting pre-trained models for face emotion recognition, following transfer learning fine-tuning [9]. This strategy diminishes the reliance on the machine’s memory and processor. With the escalating complexity of models and computational requirements, there is a pressing need to enhance the current lightweight face recognition models in terms of FLOPs, parameters, and model size. Seng Chun Hoo et al. [30] introduced an enhanced ConvNeXt (ECN) module within ConvFaceNeXt, which notably reduces FLOP counts while maintaining high accuracy. Taking cues from FaceNet, Zong-Yue Deng et al. [31] devised a deep learning model with a memory size of only 3.5 M, achieving remarkable accuracy in real-time scenarios. Factors like changes in posture, age, and variations in lighting conditions can all influence the efficacy of face recognition. Xie S. et al. [32] developed a framework consisting of two independent branches for processing facial and expression information. Utilizing adversarial learning, the TDGAN network effectively separates other facial attributes from each expression image and subsequently transfers the expression to a specified face. However, the absence of mutual integration and compensation negatively affects the recognition accuracy of the network. Chenqi K. et al. [33] proposed a method that combines semantic and noise levels to infer human editing in images by analyzing visual features and noise. This method not only detects tampering in facial images but also pinpoints the specific area of manipulation, aiding in image authenticity and integrity verification. On the other hand, Hardjadinata H. et al. [34] utilized Xception and DenseNet deep learning architectures to enhance accuracy and efficiency in facial expression recognition systems. Xception’s deep separable convolution efficiently captures spatial dependencies, making it suitable for facial feature extraction and recognition. DenseNet’s densely connected patterns between layers promote feature reuse and gradient flow, potentially enhancing the model’s ability to capture facial expression details. Xunru L. et al. [35] proposed a lightweight and high-precision improved MobileNetV3 network for facial expression recognition, but due to its large flops and model size, it has a certain impact on the storage and model calculation process of small mobile devices. Our lightweight network effectively improves the accuracy of facial expression recognition across diverse datasets by reducing parameters.

Most current research focuses on improving model accuracy, often neglecting the computational cost and model size, which can impose significant burdens on computing systems. The model we propose integrates existing deep separable convolution with a custom attention layer, building on prior research. This network model enhances accuracy while simultaneously reducing model size. Our approach offers novel insights and advancements for future model enhancements. It not only retains the benefits of classic models but also explores feature enhancement and fusion, demonstrating considerable application potential and research value.

In this study, a lightweight network named ‘RS-Xception’ is proposed, consisting of a total of 1.92 M parameters, positioning it as a valuable network architecture in the domain of facial expression recognition. The key contributions of this research are outlined as follows:

Development of a lightweight model: The model integrates deep separable convolution and the SE module, which leads to a reduced number of parameters and computational load, making it suitable for resource-constrained environments while maintaining high performance.
Model adaptability and scalability: RS-Xception demonstrates strong performance across three standard datasets and exhibits adaptability and generalization capabilities across a more complex dataset (RAF-DB).
Technical validation: Transfer learning is employed to compare the model with other architectures on the same dataset, showcasing its superior performance. Furthermore, transfer learning is leveraged to enhance the accuracy of the model, highlighting its potential to enhance generalization capabilities.

2. Materials and Methods

For the FER, we have designed a lightweight model with a simple architecture but excellent practical effect. The Squeeze and Excitation (SE) module enhances the attention mechanism of the model by adaptively reweighting channel features to improve focus on facial expression details. This adaptive feature enhancement boosts important features for better accuracy in facial expression recognition, while suppressing irrelevant features to reduce computational complexity and overfitting risks. The residual connection enhances the stability and efficiency of training deep networks by providing a shortcut path, alleviating gradient vanishing issues in facial expression recognition. Additionally, the residual connection facilitates feature transmission across different levels, enabling the model to capture more expression details effectively. The modular network architecture allows flexible adjustments and extensions for various facial expression recognition tasks, controlling model complexity by adding or removing modules to handle tasks ranging from simple expression recognition to a complex sentiment analysis. The global average pooling layer reduces parameters in the fully connected layer, preventing overfitting and enhancing generalization to unseen data.

The main structure of Xception consists of a residual convolutional network and deep separable convolution, which replaces the traditional convolution method with deep separability. This network, inspired by the Xception network’s residual convolution network combined with deep separable convolution, utilizes fewer parameters and computational resources, making it more efficient for feature extraction and classification in facial tasks, particularly in resource-limited environments. With fewer parameters, the network can be trained faster than the original Xception model, which is beneficial for tasks like facial recognition that involve large datasets or require quick iteration. Despite maintaining high classification performance, the network may exhibit better generalization capabilities on small-scale facial datasets, as it is easier to train and fine-tune with limited data. Additionally, the incorporation of SE blocks in the network enhances the recalibration of channel feature responses, suppresses unnecessary noise, accelerates facial feature detection and localization, improves feature learning, and aids in the refined learning and classification of facial features. Thanks to its modular design and minimal parameters, the proposed network offers high flexibility and scalability when adapting to various datasets and tasks, allowing for easier customization and optimization.

The model utilizes convolutional layers, separable convolutional layers, and SE blocks to extract useful features from the input image. It autonomously learns from basic features like edges and textures to more complex features such as shapes and object patterns. Through multiple convolutions and activation functions, the model refines the feature map to incorporate advanced semantic information. Each layer’s feature representation builds upon the previous layer, forming a hierarchical structure that effectively captures intricate patterns and relationships.

2.1. Depthwise Separable Convolution

We use Depthwise Separable Convolutions (DSCs) to reduce the number of parameters. In a standard convolution, there are N such convolution kernels, resulting in a final output feature map with N channels. On the other hand, depthwise convolution [36] is much simpler. Each convolutional kernel only has a single channel and is responsible for processing a layer of feature maps in the depth direction. DSCs are used to eliminate fully connected layers and reduce the parameter count. A DSC consists of two layers: a depthwise convolution and a pointwise convolution. These layers separate spatial cross-correlation from channel cross-correlation. In Figure 1, a filter (D × D) is applied to each of the M input channels, followed by N convolutional filters (1 × 1 × M) to combine the M input channels into N output channels. The values in the convolutional combined feature map (1 × 1 × M) are applied independently of their spatial relationship within the channel. Depth-separable convolution reduces computation by

1 / N + 1 / D^{2}

compared to standard convolution.

2.2. SE-ResNet

SENet is an image recognition network model that was proposed in 2017. The network aims to improve classification accuracy by enhancing key features by comparing the correlation between feature channels. There are three main actions involved in SENet [37].

The global feature information is extracted from the previous convolutional layer through a squeeze operation, followed by global average pooling on the feature map. The results are in feature maps

Z_{c}

with dimensions of

1 \times 1 \times C

, where each element c is calculated using Equation (1).

Z_{c} = F_{s q} (u_{c}) = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} u_{c (i, j)}

(1)

The operation of global average pooling, denoted as Fsq in Equation (1), involves pooling each channel’s eigenmap into feature maps of size

1 \times 1 \times C

, where

C

represents the number of channels. This results in a reduction in spatial dimensions. Subsequently, there are two fully connected layers. The first layer aims to decrease the number of channels from

C

to

C / r

, where r is the compression ratio. This ‘compression layer’ helps reduce computational requirements by reducing feature dimensionality and overall model complexity.

The excitation action is defined as in Equation (2).

S_{c} = F_{e x} (Z, W) = σ (W_{2} δ (W_{1} Z))

(2)

The sigmoid activation function, denoted as

σ

, is utilized in this research. Furthermore, the excitatory function known as the Rectifier Linear Unit (ReLU) is represented by

δ

. In order to manipulate the dimensionality, weights

W_{1}

and

W_{2}

are employed to decrease and amplify it correspondingly.

The Scale operation encompasses the multiplication of the eigentensor with the excitation, which captures the significance of every channel through comprehensive feature learning. Subsequently, the acquired weights are utilized to multiply the corresponding channel, thus distinguishing the primary and secondary details of the graph. The calibration operation, as elucidated in Equation (3), is employed to attain the ultimate output of the block.

{X_{c} = F}_{s c a l e} (u_{c}, S_{c}) = u_{c} \cdot S_{c}

(3)

The SENet model effectively captures and utilizes the global feature information of the image while reducing the computational burden. This structure is particularly beneficial when dealing with large images or when there is a need for extensive computational resources, as it can significantly simplify the model’s complexity and computational requirements.

The problem of gradient vanishing in deep networks is effectively addressed by ResNet [38], which was proposed by He et al. To mitigate this issue, surplus blocks are introduced to traditional CNNs. The ResNet architecture contains several residual blocks. The principle of the residual block is to introduce the output of the first several layers directly to the input of the later layers by using a skip connection. This structure allows the network to better learn the differences between the input and output, which can enhance the performance of the model. Various studies have validated the effectiveness of residual blocks in alleviating the vanishing gradient problem in deep networks. As a result, multiple architectures have integrated these residual blocks.

The SE-ResNet, a network framework proposed in this study, amalgamates the established SENet and ResNet architectures. To enhance its comprehension, SE blocks from SENet have been incorporated into ResNet as depicted in Figure 2. The role of the SE block is to enhance the information channel and suppress the less useful one. By merging the feature information of the preceding convolutional layer with the subsequent one through residual blocks, this methodology effectively tackles the issues of accuracy decline due to image disappearance and gradient vanishing, which typically arise when the number of network layers increases.

2.3. RS-Xception

The RS-Xception is a newly designed convolutional neural network specifically tailored for image classification tasks, as depicted in Figure 3. Table 1 presents a comparison between different deep learning models and ours. The metrics considered for comparison are the number of parameters, depth (number of layers), floating-point operations per second (FLOPSs), and inference time on the CPU. Our model outperforms others in terms of efficiency, with fewer parameters, lower depth, reduced FLOPSs, and faster inference times. This makes our model a compelling option for scenarios with constrained computing resources. Despite having a lower parameter count, the RS-Xception displays an exceptional level of accuracy, thereby rendering it suitable for lightweight tasks. The primary convolutional layer makes use of a

7 \times 7

convolutional kernel to initiate the convolution process. Subsequently, batch normalization and the ReLU activation function are employed. Following this, an additional feature extraction is conducted using a

3 \times 3

convolution kernel.

Module 1, the initial module, is comprised of two separable layers for convolution. It also includes a

1 \times 1

convolutional layer that adjusts the channel count and a residual connection that contains Squeeze and Excitation (SE) blocks. Batch normalization is applied, as well as ReLU activation functions. Maximum pooling is employed to reduce spatial dimensions. Modules 2 to 5 are exact replicas of Module 1, but with an increasing number of filters for the convolutional layer in each module. Residual connections and SE blocks are present in all modules for the purpose of transferring information and recalibrating channels. The final classification layer consists of a

3 \times 3

convolutional layer, responsible for generating the ultimate class probabilities. Global average pooling (GAP) is utilized to reduce the spatial dimension to a single dimension. The softmax activation function is then applied to produce the final classification probability. The SE block is employed to capture channel interconnectedness and recalibrate different channels. The RS-Xception architecture takes inspiration from the Xception model but is adapted to be more compact and suitable for resource-constrained environments. The model utilizes the depth-separable convolution of Xception with fewer layers and a smaller convolutional kernel size. In a resource-constrained environment, residual joining, SE blocks, and global average pooling are combined to reduce the number of parameters, computational requirements, and risk of overfitting. By combining separable convolutions, residual connections, and SE blocks, RS-Xception achieves commendable performance in lightweight models. In the last output layer, the softmax activation function is used to detect seven emotions.

The proposed simulation model replicates the complete design of the VGG model’s 3 × 3 convolutional layer [40]. The residual block consists of two convolutional layers with the same number of output channels. Each convolution operation is followed by a batch normalization layer and a ReLU activation function. After that, the input is added to the residual block before applying the final ReLU activation function, bypassing the intermediate convolution process. Residual connections are incorporated after each depth-separable convolution module to facilitate efficient flow of information and gradients in deep networks. Additionally, an SE block is integrated into the depth-separable convolution, allowing for channel recalibration at the output of each module. This recalibration process enables the model to prioritize important channels by assigning them higher weights, thereby focusing more on crucial features for the final task. By enhancing the effective training of deep networks, the residual connection aids in capturing complex features and improving model performance without introducing extra computational complexity.

To maintain the output shape of the convolutional layer consistent with the input, an extra convolution (

1 \times 1

) is employed to adjust the channels. Each convolutional layer needs to be followed by a batch normalization layer. As the maximum pooling layer in the stride of 2, the width and height of the feature map need not be reduced. In each consecutive module, the number of channels doubles compared to the previous module, while the height and width are halved. The grayscale image in the structural flow chart (Figure 3) represents the feature extraction map generated by the model from the input image. The model extracts key facial features from the original image, which are subsequently inputted into the attention module. This attention module emphasizes the critical information within the feature map before passing it to the next step. In order to enhance the efficiency of training, the cross-entropy loss function (Equation (4)) is utilized. The categorical cross-entropy loss function is derived from cross-entropy, a metric that quantifies the disparity between two probability distributions. Specifically in classification tasks, it evaluates the model’s predicted probability distribution for each class against the actual distribution of the target. A lower probability prediction of the correct class in the cross-entropy loss incurs a higher penalty, encouraging the model to enhance the likelihood of accurate classification. When paired with the softmax activation function at the output layer, the categorical cross-entropy enables the model to not only identify the most probable classes, but also to express the level of confidence in each prediction as probabilities, thereby furnishing additional information for subsequent decision-making processes.

L o s s = - \sum_{i = 1}^{\begin{matrix} output \\ size \end{matrix}} y_{i} \cdot l o g {\hat{y}}_{i}

(4)

In Equation (4), it is observed that the value of

y_{i}

can only be 0 or 1. When

y_{i}

is 0, the outcome is also 0, whereas the outcome is present only when

y_{i}

is 1. In essence, the categorical cross-entropy concentrates on a single outcome, making it suitable for use with softmax in single-label classification tasks.

In this study, we assessed the efficiency of RS-Xception on three datasets: CK+, FER2013, and Bigfer2013. Firstly, we provide a description of the datasets and experimental conditions. Next, we employ several existing methods to evaluate the performance of the proposed model on the test dataset, ensuring its effectiveness. The performance evaluation indicators used include 4 indicators represented by Equations (5)–(8), ROC curve, and confusion matrix. Finally, we incorporate transfer learning into our approach and observe an improvement in the model’s accuracy.

A c c u r a c y = (T P + T N) / (T P + T N + F P + F N)

(5)

P r e c i s i o n = T P / (T P + F P)

(6)

R e c a l l = T P / (T P + F N)

(7)

F 1 - s c o r e = (2 \times P r e c i s i o n \times R e c a l l) / (P r e c i s i o n + R e c a l l)

(8)

3. Results

3.1. Dataset Details

This section describes three datasets: CK+ [41], FER 2013, and Bigfer 2013. The CK+ dataset, also known as Cohn–Kanade, is frequently utilized as a controlled dataset in laboratory settings to evaluate FER systems, as depicted in Figure 4. It is specifically developed to overcome the limitations present in the CK dataset, the most obvious of which is the lack of validated sentiment labels. The dataset consists of 45 samples for anger, 59 samples for disgust, 25 samples for fear, 69 samples for happiness, 28 samples for sadness, 83 samples for surprise, 593 samples for neutral, and 18 samples for contempt. The data are split into training (80%), PublicTest (10%), and PrivateTest (10%) sets. Each image in the dataset has been resized to 48 × 48 pixels in grayscale format. FER2013, on the other hand, is an extensive and unrestricted dataset, which consists of grayscale face images with dimensions of 48 × 48 pixels, ensuring consistent face positioning across all images. FER2013 contains a total of 35.9 K images and is annotated with seven distinct expression labels, namely anger, disgust, fear, happiness, neutral, sadness, and surprise. The dataset is divided into three subsets: the training set, the public test set, and the final test set. The training set consists of approximately 28,709 images, while the public test set and final test set each contain around 3589 images. This dataset serves as a valuable resource for researchers and developers in the fields of expression recognition and sentiment analyses due to its diverse range of expressions and challenging real-world conditions. The Bigfer2013 dataset combines 35.9 K records from FER2013 and 13.7 K records from the ‘Muxspace’ dataset. It contains 14,685 happy images (29.63%), 13,066 neutral images (26.36%), 6345 sad images (12.8%), 5205 angry images (10.5%), 5142 fearful images (10.37%), 4379 images of surprise (8.82%), and 755 images of disgust (1.52%). Figure 5 compares the FER 2013 and Bigfer 2013 datasets.

3.2. Experimental Results

The deep learning model is executed on a computer equipped with an Intel(R) Xeon(R) Silver 4214R CPU (Intel Corporation, Santa Clara, CA, USA) operating at 2.40 GHz (with two processors), 128 GB of RAM, and an NVIDIA GeForce RTX 3090 graphics card. For the experiments described in this article, Python 3.9 is utilized, and the model experiments are conducted using the TensorFlow framework along with the cross-platform computer vision and machine learning software library OpenCV. Finally, the Adam optimizer is employed for optimization.

During data preprocessing, we randomly rotate the images by angles of ±15 degrees to enhance the model’s ability to recognize expressions from various angles and orientations. Additionally, we normalize the pixel values of the images to a range between 0 and 1, which mitigates the effects of lighting variations on the model. This process significantly enhances and normalizes the facial expression recognition dataset, thereby improving the model’s performance and robustness and providing a solid data foundation for subsequent research and applications. By applying various transformations to the original images, we generate a more diverse set of training samples, which further enhances the model’s generalization capability.

3.2.1. RS-Xception Performance on CK+

This model underwent training for a total of 100 epochs. The initial learning rate utilized was 0.001. During the training process, samples were processed in batches of 16. The validation and test samples provided within the dataset were utilized to evaluate the RS-Xception’s performance. In addition, there is no specified test data available in this particular database, unlike the FER2013 database. Achieving notable results, the model obtained a recognition accuracy of 97.13% during its peak period. Furthermore, the loss of the multi-class classification task approached zero. When considering recognition accuracy, the proposed model surpasses the FER system of the current horizontal framework. In Figure 6, the performance evaluation results are displayed, specifically showcasing the precision, recall, and F1 score of 96.30%, 96.20%, and 96.06%, respectively, shown in Table 2. Figure 7 presents the confusion matrix and ROC curve outcomes generated by the model on the CK+ database test set. These results demonstrate the efficacy of the trained model in accurately recognizing the majority of facial images associated with affective classes. Ultimately, these findings further validate the effectiveness of the enhanced FER model, highlighting its exceptional performance in recognizing facial expressions within numerous samples.

3.2.2. RS-Xception Performance on FER2013

We will describe the performance of the RS-Xception model on FER2013 datasets. The model achieved an impressive recognition accuracy of 69.02% and a low loss rate of 0.94% when handling multi-class classification tasks. Moreover, the proposed technique produced a precision, recall, and F1 score of 67.51%, 67.55%, and 67.46%, respectively, when assessed on the test set, shown in Table 2. This test set comprised seven classes extracted from the FER2013 dataset. For a comprehensive overview, we included the confusion matrix and ROC curve of the RS-Xception model, obtained from the test samples of the FER2013 dataset. These visuals are accessible in Figure 8. These results undeniably showcase the proficiency of the proposed model in effectively recognizing facial images encompassing various emotions. The confusion matrix further validates the model’s predictive capabilities for the seven categories, with the happiness class outperforming the rest. Importantly, it should be noted that while the improved model demonstrates commendable overall performance, there may be variations in the classification performance for different classes.

3.2.3. RS-Xception Performance on Bigfer2013

In our study, we utilized the Bigfer2013 dataset to train a model. The training process involved 100 epochs, using the Adam optimizer and a multi-class classification loss function. The initial learning rate was set to 0.001. During training, we processed samples in batches of 16. The Fer2013 dataset served as a basis for our work, with the Bigfer2013 dataset being an extension that included additional annotated images. All images in the dataset were 48 × 48 in size and featured diverse characters, which enhanced the model’s generalization capability. The training set and validation set are 80% and 20% of the dataset, respectively. When evaluating the improved model on the Bigfer2013 test samples, we achieved a validation accuracy, precision, recall, and F1 score of 72.06%, 71.86%, 71.21%, and 71.38%, respectively, shown in Table 2. The confusion matrix and ROC curve of the model on the Bigfer2013 dataset are presented in Figure 9. Notably, within a certain range, as the data increase, the accuracy also increases, thus confirming the effectiveness of our model on diverse datasets.

Although the CK+ dataset has a small sample size, it offers detailed expression information and high annotation accuracy, making it a valuable supplementary dataset for enhancing model performance. On the other hand, the fer2013 dataset, despite its uneven classification of expressions and significant variability, provides a large and diverse set of data that can aid in training a model with strong generalization capabilities. However, fer2013 does suffer from collection errors and human accuracy is limited to around 65 ± 5%. To address these limitations, we introduced the Bigfer 2013 dataset, which includes a substantial number of network images to augment the existing data and improve the model’s generalization abilities. The enhanced model performance can be observed through the increased data volume, improved generalization capabilities, and the practicality of the model when dealing with large datasets. To evaluate the applicability and generalization ability of the model, we conducted tests using a more complex dataset, RAF-DB, for validation. This dataset presents additional challenges owing to its unique characteristics, which include image quality, background complexity, the naturalness of expression, and the quality of annotations. The model performed well on the RAF-DB dataset, achieving an accuracy of 82.98%, a recall rate of 81.98%, and an F1 score of 81.93%. These findings highlight the difficulties encountered by the model when dealing with diverse datasets, emphasizing the necessity for further optimization and adjustments to improve its generalization ability. To assess the performance of our model thoroughly, we conducted a comparative study against the prevailing classification networks. In order to maintain fairness in the comparison, none of the network models utilized pre-trained weights in this experiment. The detailed experimental findings can be found in Table 3. In the comparison experiments, the improved MobilenetV2 model is highlighted due to its similar number of layers to our proposed network. However, our model has 1.35 million fewer parameters than the improved MobilenetV2. Furthermore, our model achieves higher accuracy rates of 1.17% and 0.4% when compared to the CK+ and Fer2013 datasets, showcasing the efficiency of our method. The comparison of accuracy between the proposed model and existing models is shown in Figure 10.

Lightweight face recognition models, such as CBAM, improved MobileNetV2, and IE-DBN, demonstrate enhanced performance on the CK+ dataset by emphasizing local modules. However, our proposed network consistently outperforms these models. The CK+ dataset, collected in a controlled environment, minimizes noise, thereby presenting a unique challenge. While the results of current methods presented in the table approach 100%, our proposed network still achieves outstanding results. In contrast, the FER2013 dataset, characterized by a larger volume of data and more complex environmental conditions, poses a greater challenge for model evaluation. In this context, the model by Sidhom O. et al. employs a three-stage hybrid feature extraction method to enhance efficiency. Meanwhile, improved MobileNetV2 and E-FCNN improve accuracy on this dataset by extracting texture features. However, these networks overlook the interaction between the overall context and finer details, and local attention can adversely affect the recognition of similar emotions. Our proposed network addresses this by focusing on local details while the pooling operation accounts for the interaction between details and the overall context, resulting in excellent performance on the FER2013 dataset. To validate the efficiency of our model, we compare its training results on the more complex RAF-DB dataset against other advanced models. The RAF-DB dataset contains diverse images, leading networks such as TDGAN and E-FCNN to prioritize texture information, expression data, and other unrelated facial features. However, these different branches often lack integrated communication. Furthermore, the highly imbalanced distribution of various expression images in the RAF-DB dataset can significantly hinder network performance. Nevertheless, our proposed network achieves state-of-the-art performance on the RAF-DB dataset.

The high accuracy, recall, and F1 score in the CK+ dataset demonstrate the superior performance of the model on this dataset. The evaluation results from the Fer2013 dataset also show good performance in predicting positive classes, with a high proportion of correct predictions in this category. The model’s reliability is further supported by a comprehensive evaluation of the accuracy, recall, and F1 score. Additionally, the comparison between the accuracy and recall of the model on the BigFer2013 dataset indicates an overall improvement in performance with an increase in data, highlighting the model’s high generalization and robustness. The AUC of the ROC curve in the CK+ dataset was 1.00, indicating excellent overall performance, with AUC values for each category close to or equal to 1.00, demonstrating high classification performance and minimal misclassifications. Similarly, the AUC of the Fer2013 dataset was 0.93, with AUC values for each category around 0.90, showcasing good classification effects and model efficiency. The AUC in the ROC curve of BigFer2013 was 0.95, surpassing the AUC of the Fer2013 dataset, with improved classification effects on each category, suggesting that increasing the amount of data can enhance the model’s performance.

3.3. Ablation Experiments

The experiment revolved around the examination of six distinct pre-trained models, namely MobileNet [48], DenSENet 121 [49], ResNet18, ResNet50, ResNet101, and SENet 18. The numerical value adjacent to each model’s name denotes its depth. From a practical perspective, when engaging in transfer learning (TL) [28], it becomes crucial to meticulously choose a pre-trained model and establish a dimensional similarity matrix for meticulous fine-tuning. There exist three common approaches to fine-tuning a network: (1) training the original model, (2) training selected layers while keeping others frozen, and (3) solely training the classifier while the convolutional base remains frozen. For tasks sharing similarities, it proves to be adequate to fine-tune a single classifier and/or multiple layers to acquire a new skill. Nevertheless, regarding divergent tasks, comprehensive model training becomes obligatory. In the ablation experiment, we used an additional classifier and a fully connected layer.

The proposed system for FER utilizes a CNN to capture important information from images. The CNN is composed of multiple layers, each progressively learning more intricate features. At the shallow layer, basic properties such as edges and corners are detected, while deeper layers understand more complex patterns. The FER task, which entails identifying emotions from facial expressions, is akin to other image-based operations like classification. To build the FER model, we compared the accuracy of various pre-trained CNN models (ResNet50, ResNet101, MobileNet, ResNet18, DenSENet 121, and SENet 18). These models have been fine-tuned using sentiment data, accomplished by redefining the classification layer. Specifically, the last dense layer of the pre-trained model is modified and replaced with a new dense layer responsible for classifying the facial image into one of seven emotion types: anger, disgust, fear, happiness, neutral, sadness, and surprise. A dense layer receives inputs and produces vectors with desired dimensions. Pre-trained models, such as ResNet50 and MobileNet, streamline the training process by replacing the last fully connected layer with a new dense layer for classification. This approach involves freezing all pre-trained parameters, replacing the original output module with a fully connected layer for classification. Feature extraction in these models relies on the pre-trained parameters, and the number of layers in the model significantly impacts transfer learning accuracy. For tasks like facial expression recognition (FER), the bottom layer must be substituted with a new dense layer tailored to the number of emotion categories. Due to the extensive parameterization of these models, training them from scratch is time-consuming. Therefore, replacing the last dense layer only requires training the parameters in a small number of layers, which reduces a lot of time, prompting the use of pre-training and transfer learning methodologies to evaluate their performance. As such, the output layer of the FER model comprises seven elements, representing the classification after fine-tuning from the convolutional base and additional dense layers of the pre-trained model.

We evaluated six classical models using the FER2013 dataset. Afterward, the pre-trained model underwent fine-tuning using sentiment data and was updated for FER through the redefinition of the dense layer. The transfer learning process is depicted in Figure 11. To mitigate the potential issue of overfitting due to limited data in the FER database, we employed various dynamic data augmenters throughout the learning process. For each model, training and validation were conducted over 200 epochs using the Adam optimizer and a batch size of 16. The loss function employed was cross-entropy. We initiated the learning rate at 0.001. The DTL framework implemented in advanced mainstream models exhibits recognition accuracy ranging from 58% to 70% on the FER2013 dataset. Our proposed model achieves comparable or even higher accuracy, reaching 69.02% on the FER2013 dataset, without utilizing transfer learning, as evidenced in Figure 12. Moreover, our model demonstrates superiority over the other six pre-trained CNN models in terms of its performance on the FER2013 dataset. Our model also exhibits the fewest parameters compared to the pre-trained models examined. Furthermore, it outperforms the other models in training and validation accuracy. These experimental findings effectively showcase the remarkable efficiency of the proposed FER framework.

Although the RS-Xception achieves the accuracy of transfer learning for these six advanced models, we aim to further improve its accuracy by utilizing transfer learning methods [32]. To accomplish this, we employed a sentiment dataset that has been preprocessed through activities such as scaling and cropping. We utilized a pre-trained model on FER2013 and kept its parameters intact. Following the last GAP of the model, we added two additional dense layers, freezing all parameters of the original layers. These two dense layers utilized a fully connected nature to output seven expression classifications using softmax. By transferring frozen and modified models on Bigfer2013, we only trained the parameters of the last two fully connected layers, significantly reducing the training time. The latter dataset introduces various styles (brightness, direction, region) that differ from the former, allowing us to test the model’s generalization ability. To ensure that the accuracy of the training results is not compromised due to insufficient data, we included this dataset for comparison. Our proposed transfer learning technique achieves a precision, recall, and F1 score of 75.86%, 75.22%, and 74.88%, shown in Table 2, respectively, on the test set of seven categories. Additionally, we present the transfer learning confusion matrix and ROC curves in the Bigfer2013 dataset, shown in Figure 13. As shown in Figure 14, it is evident that the transfer learning model has achieved a substantial improvement in accuracy, reaching a rate of 75.38%. This result shows that transfer learning plays an important role in improving the model’s recognition accuracy and generalization capabilities.

4. Discussion and Conclusions

In this paper, we propose a lightweight network structure based on ResNet and SENet, incorporating the concept of the Xception network. The proposed model demonstrates good accuracy when evaluated on commonly used FER datasets such as CK+, FER2013, and Bigfer2013. We evaluate the performance of the improved model using evaluation criteria like the accuracy, F1 score, recall, and precision. Additionally, we compare our model with advanced models by replacing the dense layer with six different pre-trained models for transfer learning. We fine-tuned the parameters of these models and compared their performance on the same dataset. We find that the proposed model achieved accuracy of transfer learning for these six advanced models. Furthermore, we perform transfer learning on our model and observe a significant improvement in accuracy, highlighting the importance of transfer learning in enhancing model accuracy and generalization ability. Compared to advanced deep transfer learning frameworks, our model performs best in recognizing facial emotions in most samples. Figure 15 shows the recognition effect of the proposed model on some images. Many current studies emphasize local modules or texture features through attention mechanisms to highlight key areas of the face, or adopt multi-stage hybrid feature extraction methods to enhance model efficiency. While these approaches do improve recognition performance, they also increase computational complexity and often overlook the interaction between the overall background and fine details. Our method adeptly addresses the relationship between global and local features by introducing pooling operations. This not only reduces the number of parameters but also significantly enhances the computational efficiency and accuracy of the model. Consequently, our method preserves global background information while capturing finer local features, leading to more accurate and efficient facial expression recognition.

In this experiment, we limited the preprocessing of the images, focusing solely on the different image styles and quantities across datasets. This approach may introduce noise that could interfere with the model’s judgment and subsequently reduce its accuracy. In future work, we will emphasize image preprocessing and incorporate techniques such as MF.

The application of deep learning models in various fields presents the challenge of enhancing their adaptability to different tasks and environments. The RS-Xception model’s design focuses on improving adaptability by incorporating flexible structures and adaptive feature learning mechanisms like SE blocks. While the fundamental concepts of RS-Xception, such as deep separable convolution and SE blocks, are not entirely novel, their integration and application in a lightweight design, strong adaptability, and optimization for real-time applications continue to offer potential for innovation and practicality in technology development. The exploration and utilization of its feature recalibration mechanism also hold both novelty and practical value in current and future technology scenarios. The model we proposed has a small number of parameters, which addresses the issue of insufficient processing power in lightweight chips. By analyzing the confusion matrix and ROC curves of different datasets, we observe that the number of expressions in each dataset affects the recognition accuracy of the model’s categories. Our study highlights the effectiveness of transfer learning in improving model accuracy and reducing training costs, showcasing its potential in future network structure development. Currently, our model only utilizes simple cropping and rotation techniques. While RS-Xception has made progress in lightweight design with deep separable convolutions and SE blocks, there is potential for further optimization. Advanced techniques like brightening, zooming in, and zooming out could enhance the model’s robustness in varying lighting conditions, facial orientations, and expressions.

We propose an innovative lightweight network that integrates the SE attention mechanism with a Depthwise Separable Convolutional residual structure to address the computational and parameter limitations of current deep learning models. Looking ahead, our objective is to optimize and deploy this model on embedded devices to facilitate multimodal deep learning of facial expressions and speech information, thereby enhancing recognition accuracy. We will investigate intonation, speech rate, and linguistic content in audio data, combining these elements with visual features to more precisely infer emotional states. Furthermore, we plan to employ pruning techniques to streamline the model and reduce computational demands while preserving efficient performance. Ultimately, we aim to develop a facial expression recognition system capable of processing multimodal inputs in real time and delivering rapid emotional feedback.

Author Contributions

S.W. and J.F. wrote the main manuscript text and the main experiments. L.L. and C.S. prepared the translation and edited this paper. All authors wrote and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the Science and Technology Research Project of Hebei Provincial Sports Bureau (2024QT01), the basic scientific research business project of colleges and universities in Hebei Province (Hebei Provincial Department of Education, 2023JCT008), National Natural Science Foundation of China project “Research on the Relationship between Heterogeneity of Innovation Networks and Enterprise Innovation Performance—Taking the Undertaking Industrial Transfer Demonstration Zone as an Example” (71462018), National Natural Science Foundation of China project “Research on the Matching of Digital Strategy and Business Model in Digital Disruption” (71761018).

Data Availability Statement

Our datasets (CK+, FER2013, and Bigfer2013) are all public datasets. The datasets (CK+, FER2013, and Bigfer2013) used in this study are publicly available from https://www.kaggle.com/datasets/davilsena/ckdataset (accessed on 12 December 2023), https://www.kaggle.com/datasets/deadskull7/fer2013 (accessed on 13 December 2023), and https://www.kaggle.com/datasets/uldisvalainis/fergit (accessed on 13 December 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Belmonte, R.; Allaert, B.; Tirilly, P.; Bilasco, I.M.; Djeraba, C.; Sebe, N. Impact of facial landmark localization on facial expression recognition. IEEE Trans. Affect. Comput. 2021, 14, 1267–1279. [Google Scholar] [CrossRef]
Liang, L.; Lang, C.; Li, Y.; Feng, S.; Zhao, J. Fine-grained facial expression recognition in the wild. IEEE Trans. Inf. Forensics Secur. 2020, 16, 482–494. [Google Scholar] [CrossRef]
Lim, C.; Inagaki, M.; Shinozaki, T.; Fujita, I. Analysis of convolutional neural networks reveals the computational properties essential for subcortical processing of facial expression. Sci. Rep. 2023, 13, 10908. [Google Scholar] [CrossRef] [PubMed]
Shao, J.; Cheng, Q. E-FCNN for tiny facial expression recognition. Appl. Intell. 2021, 51, 549–559. [Google Scholar] [CrossRef]
Nassif, A.B.; Darya, A.M.; Elnagar, A. Empirical evaluation of shallow and deep learning classifiers for Arabic sentiment analysis. Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 21, 1–25. [Google Scholar]
Kardakis, S.; Perikos, I.; Grivokostopoulou, F.; Hatzilygeroudis, I. Examining attention mechanisms in deep learning models for sentiment analysis. Appl. Sci. 2021, 11, 3883. [Google Scholar] [CrossRef]
Saeed, S.; Shah, A.A.; Ehsan, M.K.; Amirzada, M.R.; Mahmood, A.; Mezgebo, T. Automated facial expression recognition framework using deep learning. J. Healthc. Eng. 2022, 2022, 5707930. [Google Scholar] [CrossRef] [PubMed]
Talaat, F.M. Real-time facial emotion recognition system among children with autism based on deep learning and IoT. Neural Comput. Appl. 2023, 35, 12717–12728. [Google Scholar] [CrossRef]
Helaly, R.; Messaoud, S.; Bouaafia, S.; Hajjaji, M.A.; Mtibaa, A. DTL-I-ResNet18: Facial emotion recognition based on deep transfer learning and improved ResNet18. Signal Image Video Process. 2023, 17, 2731–2744. [Google Scholar] [CrossRef]
Bansal, M.M.; Sachdeva, M.; Mittal, A. Transfer learning for image classification using VGG19: Caltech-101 image data set. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 3609–3620. [Google Scholar] [CrossRef]
Wen, G.; Hou, Z.; Li, H.; Li, D.; Jiang, L.; Xun, E. Ensemble of deep neural networks with probability-based fusion for facial expression recognition. Cogn. Comput. 2017, 9, 597–610. [Google Scholar] [CrossRef]
Ge, H.; Zhu, Z.; Dai, Y.; Wang, B.; Wu, X. Facial expression recognition based on deep learning. Comput. Methods Programs Biomed. 2022, 215, 106621. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Wen, G. MRMR-based ensemble pruning for facial expression recognition. Multimed. Tools Appl. 2018, 77, 15251–15272. [Google Scholar] [CrossRef]
Hua, W.; Dai, F.; Huang, L.; Xiong, J.; Gui, G. HERO: Human emotions recognition for realizing intelligent Internet of Things. IEEE Access 2019, 7, 24321–24332. [Google Scholar] [CrossRef]
Alonazi, M.; Alshahrani, H.J.; Alotaibi, F.A.; Maray, M.; Alghamdi, M.; Sayed, A. Automated Facial Emotion Recognition Using the Pelican Optimization Algorithm with a Deep Convolutional Neural Network. Electronics 2023, 12, 4608. [Google Scholar] [CrossRef]
Arora, M.; Kumar, M.; Garg, N.K. Facial emotion recognition system based on PCA and gradient features. Natl. Acad. Sci. Lett. 2018, 41, 365–368. [Google Scholar] [CrossRef]
Connie, T.; Al-Shabi, M.; Cheah, W.P.; Goh, M. Facial expression recognition using a hybrid CNN–SIFT aggregator. In Proceedings of the International Workshop on Multi-Disciplinary Trends in Artificial Intelligence, Gadong, Brunei Darussalam, 20–22 November 2017; pp. 139–149. [Google Scholar]
Kaya, H.; Gürpınar, F.; Salah, A.A. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis. Comput. 2017, 65, 66–75. [Google Scholar] [CrossRef]
Zhao, L.; Niu, X.; Wang, L.; Niu, J.; Zhu, X.; Dai, Z. Stress detection via multimodal multi-temporal-scale fusion: A hybrid of deep learning and handcrafted feature approach. IEEE Sens. J. 2023, 23, 27817–27827. [Google Scholar] [CrossRef]
Fan, X.; Tjahjadi, T. Fusing dynamic deep learned features and handcrafted features for facial expression recognition. J. Vis. Commun. Image Represent. 2019, 65, 102659. [Google Scholar] [CrossRef]
Mehendale, N. Facial emotion recognition using convolutional neural networks (FERC). SN Appl. Sci. 2020, 2, 446. [Google Scholar] [CrossRef]
Zeng, J.; Shan, S.; Chen, X. Facial expression recognition with inconsistently annotated datasets. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 222–237. [Google Scholar]
Arora, M.; Kumar, M. AutoFER: PCA and PSO based automatic facial emotion recognition. Multimed. Tools Appl. 2021, 80, 3039–3049. [Google Scholar] [CrossRef]
Debnath, T.; Reza, M.M.; Rahman, A.; Beheshti, A.; Band, S.S.; Alinejad-Rokny, H. Four-layer ConvNet to facial emotion recognition with minimal epochs and the significance of data diversity. Sci. Rep. 2022, 12, 6991. [Google Scholar] [CrossRef] [PubMed]
He, L.; He, L.; Peng, L. CFormerFaceNet: Efficient lightweight network merging a CNN and transformer for face recognition. Appl. 2023, 13, 6506. [Google Scholar] [CrossRef]
Helaly, R.; Hajjaji, M.A.; M’Sahli, F.; Mtibaa, A. Deep convolution neural network implementation for emotion recognition system. In Proceedings of the 2020 20th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA), Monastir, Tunisia, 20–22 December 2020; pp. 261–265. [Google Scholar]
Huang, Z.Y.; Chiang, C.C.; Chen, J.H.; Chen, Y.C.; Chung, H.L.; Cai, Y.P.; Hsu, H.C. A study on computer vision for facial emotion recognition. Sci. Rep. 2023, 13, 8425. [Google Scholar] [CrossRef] [PubMed]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning—ICANN 2018: Proceedings of the 27th International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 270–279. [Google Scholar]
Sarkar, A.; Behera, P.R.; Shukla, J. Multi-source transfer learning for facial emotion recognition using multivariate correlation analysis. Sci. Rep. 2023, 13, 21004. [Google Scholar]
Hoo, S.C.; Ibrahim, H.; Suandi, S.A. Convfacenext: Lightweight networks for face recognition. Mathematics 2022, 10, 3592. [Google Scholar] [CrossRef]
Deng, Z.Y.; Chiang, H.H.; Kang, L.W.; Li, H.C. A lightweight deep learning model for real-time face recognition. IET Image Process. 2023, 17, 3869–3883. [Google Scholar] [CrossRef]
Xie, S.; Hu, H.; Chen, Y. Facial expression recognition with two-branch disentangled generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2359–2371. [Google Scholar] [CrossRef]
Kong, C.; Chen, B.; Li, H.; Wang, S.; Rocha, A.; Kwong, S. Detect and locate: Exposing face manipulation by semantic-and noise-level telltales. IEEE Trans. Inf. Forensics Secur. 2022, 17, 1741–1756. [Google Scholar] [CrossRef]
Hardjadinata, H.; Oetama, R.S.; Prasetiawan, I. Facial expression recognition using xception and densenet architecture. In Proceedings of the 2021 6th International Conference on New Media Studies (CONMEDIA), Tangerang, Indonesia, 12–13 October 2021; pp. 60–65. [Google Scholar]
Liang, X.; Liang, J.; Yin, T.; Tang, X. A lightweight method for face expression recognition based on improved MobileNetV3. IET Image Process. 2023, 17, 2375–2384. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhu, Q.; Zhuang, H.; Zhao, M.; Xu, S.; Meng, R. A study on expression recognition based on improved mobilenetV2 network. Sci. Rep. 2024, 14, 8121. [Google Scholar] [CrossRef]
Rabea, M.; Ahmed, H.; Mahmoud, S.; Sayed, N. IdentiFace: A VGG Based Multimodal Facial Biometric System. arXiv 2024, arXiv:2401.01227. [Google Scholar]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Zhang, X.; Chen, Z.; Wei, Q. Research and application of facial expression recognition based on attention mechanism. In Proceedings of the 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2021; pp. 282–285. [Google Scholar]
Zhang, H.; Su, W.; Yu, J.; Wang, Z. Identity–expression dual branch network for facial expression recognition. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 898–911. [Google Scholar] [CrossRef]
Sidhom, O.; Ghazouani, H.; Barhoumi, W. Three-phases hybrid feature selection for facial expression recognition. J. Supercomput. 2024, 80, 8094–8128. [Google Scholar] [CrossRef]
Mukhopadhyay, M.; Dey, A.; Kahali, S. A deep-learning-based facial expression recognition method using textural features. Neural Comput. Appl. 2023, 35, 6499–6514. [Google Scholar] [CrossRef]
Jiang, B.; Li, N.; Cui, X.; Liu, W.; Yu, Z.; Xie, Y. Research on Facial Expression Recognition Algorithm Based on Lightweight Transformer. Information 2024, 15, 321. [Google Scholar] [CrossRef]
Khan, S.; Chen, L.; Yan, H. Co-clustering to reveal salient facial features for expression recognition. IEEE Trans. Affect. Comput. 2017, 11, 348–360. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]

Figure 1. Depthwise Separable Convolution.

Figure 2. Squeeze and Excitation block.

Figure 3. Model structure.

Figure 4. CK+ image example.

Figure 5. Comparison of Fer 2013 and Bigfer 2013 expression data.

Figure 6. (a) Training precision graph of CK+ (blue line), FER2013 (scarlet line), Bigfer2013 (green line), and RAF-DB datasets (brown line). (b) Training loss function values of CK+, FER2013, Bigfer2013, and RAF-DB datasets. (c) Validation accuracy graphs of CK+, FER2013, Bigfer2013, and RAF-DB datasets. (d) Validation loss function values of CK+, FER2013, Bigfer2013, and RAF-DB datasets.

Figure 7. Confusion matrix (a) and ROC curve (b) of CK+ data (class 0–7 represents neutral, anger, contempt, disgust, fear, happiness, sadness, and surprise).

Figure 8. Confusion matrix (a) and ROC curve (b) of FER2013 dataset (class 0–6 represents angry, disgusted, scared, happy, sad, surprised, and neutral).

Figure 9. Confusion matrix (a) and ROC curve (b) of Bigfer2013 dataset (class 0–6 represents angry, disgusted, scared, happy, sad, surprised, and neutral).

Figure 10. The comparison of accuracy between the proposed model and existing models.

Figure 11. The process of model transfer learning (ResNet50, ResNet101, MobileNet, ResNet18, DenSENet 121, SENet 18).

Figure 12. (a) The comparison of the training accuracy of RS-Xception on the FER2013 dataset and the transfer learning training accuracy of the other 6 models on the FER2013 dataset. (b) The comparison of the training validation accuracy of RS-Xception on the FER2013 dataset and the transfer learning validation accuracy of the other 6 models on the FER2013 dataset.

Figure 13. (a) The confusion matrix of the model after Bigfer2013 transfer learning. (b) The ROC curve of the model after Bigfer2013 transfer learning (class 0-6 represents angry, disgusted, scared, happy, sad, surprised, and neutral).

Figure 14. (a) The comparison of the validation accuracy of the model on the BigFer2013 dataset and transfer learning on the Bigfer2013 dataset, and (b) comparison of the loss function value of the model on the BigFer2013 dataset with the value of the loss function of transfer learning on Bigfer2013.

Figure 15. Results of expression classification using the proposed model.

Table 1. Comparison of parameters of each model.

Model	Parameters	Depth	Flops	Time (ms) per Inference Step (CPU)
Xception	22.9 M	81	8900 M	109.4
VGG16	138.4 M	16	15,517 M	69.5
VGG19	143.7 M	19	19,682 M	84.8
ResNet50	25.6 M	107	4100 M	58.2
ResNet101	44.7 M	209	7900 M	89.6
ResNet152	60.4 M	311	11,000 M	127.4
InceptionV3	23.9 M	189	6000 M	42.2
InceptionResNetV2	55.9 M	449	17,000 M	130.2
MobileNet	4.3 M	55	600 M	22.6
MobileNetV2	3.5 M	105	312.86 M	25.9
DenseNet121	8.1 M	242	5690 M	77.1
Improved MobilenetV2 [39]	3.26 M	25	\	\
Ours (SE block)	1.91 M	28	70 M	15.9

Table 2. The performance evaluation of the experiment.

Experiments	Accuracy	Precision	Recall	F1 Score
On CK+	97.13%	96.30%	96.20%	96.06%
On FER2013	69.02%	67.51%	67.55%	67.46%
On Bigfer2013	72.06%	71.86%	71.21%	71.38%
DTL on Bigfer2013	75.38%	75.86%	75.22%	74.88%
On RAF-DB	82.98%	82.06%	81.98%	81.93%

Table 3. Comparison of the accuracy of some of the latest models.

Approach	Dataset	Accuracy (%)
CBAM [42]	CK+	95.1
IE-DBN [43]	CK+	96.02
CCFS + SVM [1]	CK+	96.05
Improved MobilenetV2 [39]	CK+	95.96
Model by Sidhom O et al. [44]	Fer2013	66.1
Self-Cure Net [45]	Fer2013	66.17
Improved MobileViT [46]	Fer2013	62.2
Improved MobilenetV2 [39]	Fer2013	68.62
PSR [47]	RAF-DB	80.78
E-FCNN [4]	RAF-DB	78.31
TDGAN [32]	RAF-DB	81.91
Ours	CK+	97.13
Ours	Fer2013	69.02
Ours	RAF-DB	82.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, L.; Wu, S.; Song, C.; Fu, J. RS-Xception: A Lightweight Network for Facial Expression Recognition. Electronics 2024, 13, 3217. https://doi.org/10.3390/electronics13163217

AMA Style

Liao L, Wu S, Song C, Fu J. RS-Xception: A Lightweight Network for Facial Expression Recognition. Electronics. 2024; 13(16):3217. https://doi.org/10.3390/electronics13163217

Chicago/Turabian Style

Liao, Liefa, Shouluan Wu, Chao Song, and Jianglong Fu. 2024. "RS-Xception: A Lightweight Network for Facial Expression Recognition" Electronics 13, no. 16: 3217. https://doi.org/10.3390/electronics13163217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RS-Xception: A Lightweight Network for Facial Expression Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Depthwise Separable Convolution

2.2. SE-ResNet

2.3. RS-Xception

3. Results

3.1. Dataset Details

3.2. Experimental Results

3.2.1. RS-Xception Performance on CK+

3.2.2. RS-Xception Performance on FER2013

3.2.3. RS-Xception Performance on Bigfer2013

3.3. Ablation Experiments

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI