1. Introduction
Facial expression is one of the potent and prompt means for humans to communicate their emotions, intentions, cognitive states, and opinions to each other [
1]. Facial expression plays an important role in the evolution of complex societies, which help to coordinate social interaction, promote group cohesion, and maintain social affiliations [
2]. Potential expression recognition technology applications include tutoring systems that are sensitive to students’ expression, computer-assisted deceit detection, clinical disorder diagnosis and monitoring, behavioral and pharmacological treatment assessment, new entertainment system interfaces, smart digital cameras, and social robots. To convey emotions of joy, happiness, and satisfaction, a smile is the most typical facial expression in humans [
3]. Modern longitudinal studies have used smile data from images to predict future social and health outcomes [
4].
Researchers have made substantial progress in developing automatic facial expression detection systems in the literature [
5]. Anger, surprise, disgust, sadness, happiness, and fear are many of the six basic facial expressions and emotions commonly referred to. Among the various facial expressions, happiness, as usually demonstrated in a smile, often occurs in the daily life of a person. Two components of facial muscle movements are included in a smile, namely Cheek Raiser (AU6) and Lip Corner Puller (AU12), as shown in
Figure 1 [
6]. In computer vision and its fields of operation, the automated facial expression recognition system has become very interesting and difficult [
7].
In managed settings, current facial expression recognition has promising results, but performance on real-world data sets is still unsatisfactory [
8]. This is because there are broad differences in facial appearances through the color of the skin, lighting, posture, expression, orientation, head location, lightening state, and so on. By incorporating deep learning [
9], optimization [
10], and ensemble classification, automatic methods of identification for facial expression are suggested to deal with existing system difficulties. Five key steps are given for the planned work: preprocessing, a deep evolutionary neural network used for feature extraction, feature selection utilizing swarm optimization, and facial expression classification employing a support vector machine, ensemble classifiers [
11], and a neural network [
12].
Motivation and Contribution
As an important way to express emotion, facial expression has a vital influence on the communication between people. Currently, in computer vision and pattern recognition, facial expression recognition has become an active research scope. Real-time and effective smile detection can significantly enhance the development of facial expression recognition. The classification of smiles in an unconstrained environment is difficult because of the invertible and wide variety of facial pictures. Faces’ extensive optical alterations, such as occlusions, posture transitions, and drastic lightings, make certain functions very difficult in real-world implementations. The majority of current studies deal with smile detection and not smile classification. However, within the current smile classification approaches, their models are not smile attribute specific hence their performance may be limited.
The main goal of this paper is to build an adaptive model for the classification of smiles that incorporates both row-transformation-based features extraction and a cascade classifier to increase the accuracy of classification. In contrast to the current methods of classifying smiles, which rely on deep neural networks to extract features that, in turn, require a large number of samples and more computation, the suggested model relies on row transformation to reduce and improve the discriminatory capability of the extracted features. Furthermore, the suggested model utilizes the cascade classification concept to build an accurate classifier. Cascading classifiers allow the most likely smile pictures to be evaluated for all features that differentiate an individual. The accuracy of a classifier can even be varied. A chain of experiments proves that the suggested model technique is substantially reliable and quicker than other widespread prototypes.
The remainder of the article is organized as follows.
Section 2 discusses the current related work.
Section 3 presents the proposed model steps in detail.
Section 4 explains experimental designs.
Section 5 includes the conclusion and future work.
2. Related Work
Several scientific studies have been performed in the field of identification of facial expressions that apply to a range of technologies such as computer vision, image recognition, bioindustry, forensics, authentication of records, etc. [
13,
14,
15]. In many studies, Principal Component Analysis (PCA) was used to provide a coding framework for facial action that models and recognizes various forms of facial action [
16,
17]. However, PCA-based solutions are subject to a dilemma in which the projection maximizes variance in all images and negatively affects recognition performance. Independent Component Analysis (ICA) is one of the statistical methods that are adapted to perform expression recognition to elicit statistically independent local face characteristics that proceed better than PCA [
18].
Recently, as training and feature extraction are carried out simultaneously, deep learning among the science community has attracted substantial interest in the field of smile detection. The Deep Neural Network (DNN) was the first method of deep study used for the training and classification of models in high-dimensional data [
19]. The DNN has one problem: it takes too long to overcome challenges at the preparation stage. The Convolutional Neural Network (CNN) is a deep learning technique that solves DNN problems by reducing preprocessing and thereby enhancing image, audio, and video processing [
20,
21]. The CNN has great performance while classifying smiles images that are very similar to the data set using a huge computational cost. However, CNNs usually have difficulty in classifying an image that includes some degree of tilt or rotation.
As the feature extraction module represents the core module for facial classification, many algorithms inspired by nature are suggested to select the characteristics of the picture [
22], among others, primarily in medicinal applications [
23]. In order to choose the optimal features in the face, the feature selection strategy is used to classify the smile of a human by excluding unwanted or redundant features. However, traditional optimization solutions do not maximize and converge to the global minimum solution. Through using metaheuristic evolutionary optimization algorithms such as Ant Colony Optimization (ACO) [
24], Bee Colony Optimization (BCO) [
25], Particle Swarm Optimization (PSO) [
26], etc., conventional techniques will minimize drawbacks. Such approaches are inefficient in evaluating the global optimum concerning the pace of convergence, capability for experimentation, and consistency of solution [
27]. An updated Cuckoo Search (CS) algorithm is suggested to take several characteristics to perform classification and uses two learning algorithms, namely K-nearest neighbor and Support Vector Machines (SVMs) [
28].
In the literature, many other methods for extracting the salient features of an image have been used, such as the chaotic Gray-Wolf Algorithm [
29] and Whale Optimization Algorithm (WOA) [
30]. Because randomization is so important in exploration and exploitation, using the existing randomization technique in WOA would raise computational time, especially for highly complex problems. The Multiverse Optimization (MVO) algorithm suffers from a low convergence rate and entrapment in local optima. To overcome these problems, a chaotic MVO algorithm (CMVO) is applied that minimizes the slow convergence problem and traps local optima [
31].
A graphical model for the extraction and description of functions using a hybrid approach to recognize a person’s facial expressions was developed in [
32]. However, large memory complexity is the main disadvantage. In this case, matrices can also be a good solution when the graph is roughly complete (every node is connected to almost all of the other nodes). In [
33], a Zernike model was developed based on a local moment to classify a person’s expressions such as regular, happy, sad, surprise, angry, and fear. Using characteristics for speech recognition and motion change, recognition was done, and SVM was used for the classification. The experiments carried out showed that when compared to the individual descriptor, the integrated system achieves better results. However, this takes a long training time and has a large difficulty to understand and interpret the final model, variable weights, and individual impact.
Recently, several methods for classifying face speech using a neural network approach have been suggested [
34,
35]. A target-oriented approach using a neural network for facial expression detection was discussed in [
34]. There are many limitations of this approach such as stated goals may not be realistic, and unintended outcomes may be ignored. In [
36], the detection technique was used to perform automatic recognition of facial expressions using the Elman neural network to recognize feelings such as satisfaction, sadness, frustration, anxiety, disgust, and surprise. The identification rate was analyzed to be lower for pictures of sorrow, anxiety, and disgust. However, neural networks demand processors with parallel processing power by their structure. Furthermore, experience and trial and error are used to achieve the appropriate network structure.
Inspired by the good performance of the CNNs in computer vision tasks, such as image classification and face recognition, several CNN-based smile classification approaches have been proposed in recent years. In [
37], a deeper CNN that has a complex CNN network consisting of two convolution layers, each accompanied by a max-pooling and four initiation layers, was suggested for facial expression recognition. It has a single-part architecture that takes face pictures as input and classifies them into one of the seven sentences. Another related work in [
38] utilizes deep learning-based facial expression to minimize the dependency on face physics. Herein, the input image is convoluted in the convolution layers with a filter set. To identify the facial expression, the CNN generates a map of functions that are then paired with fully connected networks. In [
39], a deep learning approach is introduced to track consumer behavior patterns by measuring customer behavior patterns. The authors in [
40] presented a deep region and multilabel learner’s scheme for estimation of head poses and study of facial expressions to report the interest of customers. They used a feedback network to isolate vast facial regions.
In general, a deep learning approach gives optimum facial features and classification. However, it is difficult to gather vast amounts of training data for facial expression recognition under different circumstances and more massive calculations are required. Therefore, the calculation time of the deep learning algorithm needs to be reduced. In order to minimize the number of features, a Deep Convolutional Neural Network (DCNN) and Cat Swarm Optimization (CSO) are used for facial expression recognition methods to minimize processing time [
10]. Yet, there is no common theory to help choose the best resources for deep learning because they need an understanding of the topology, the process of training, and other parameters; as a consequence, fewer experienced individuals find it impossible to follow.
In contrast to the previous methods, which rely on a deep learning concept for smile classification, and in order to solve the problem facing this type of learning in terms of its difficulty to gather vast amounts of training data for facial expression recognition under different circumstances, the suggested approach utilizes both the row transformation technique and the cascade classifier in a unified framework. The cascade classifier can process a large number of features. Even so, the effectiveness of this method is fundamentally dependent on the extracted features, which may indeed not require much time to realize its purpose algorithm. In this case, row transformation is used to exclude any redundant coefficients from a vector of features, thus increasing the discriminatory capacity of the derived features and reducing computational complexity.
4. Experimental Results
The proposed facial expression recognition system is tested with a data set of benchmark data sets that includes the Japanese Female Facial Expression (JAFFE), Extended Cohn–Kanade (CK+), and CK+48 data sets [
43,
44,
45]. JAFFE is a Japanese database containing 7 facial expressions with a 256 × 256-pixel resolution of 213 images. With 10,414 images with a resolution of 640 × 490 pixels, the CK+ database has 13 expressions. The CK+48 data set has 7 facial expressions with a resolution of 48 × 48 pixels with 981 images. Features are extracted from ROIs using histograms and lip, teeth, and eye areas, which produce a 21-dimensional feature vector. Herein, 80 percent is selected for training, and 20 percent is for the testing of each data set considered. The prototype classification methodology was developed in a modular manner and implemented and evaluated on a Dell
TM Inspiron
TM N5110 laptop device, manufactured by Dell Computer Corporation in Round Rock, Texas, U.S. with specifications Intel(R) Core(TM) i5-2410M processor running at 2.30 GHz, 4.00 GB of RAM, Windows 7 64-bit. Herein, recognition rate, accuracy, sensitivity, recall, specificity, precision,
, and sensitivity are used to evaluate the efficiency of the suggested model. See [
39] for more details.
where
TP,
TN,
FP, and
FN are the true positive, true negative, false positive, and false negative, respectively. Herein, 80% of samples for each class were used for training, and the remaining 20% of samples were used for testing.