1. Introduction
Motor vehicles have become an important means of transportation for the daily travel and cargo transportation of residents. Their present possession and annual increase show an explosive growth trend, which unavoidably causes an increasing number of traffic safety problems and accidents. Therefore, facing the development background of the above era, how to reduce the probability of traffic safety problems and improve the traffic safety factor has become a common concern of scholars. Risky driving is one of the essential factors leading to traffic safety problems. It causes the driver to have less control over the vehicle, which in turn leads to the driver being unable to perform normal car maneuvers, such as steering, gear shifting, and deceleration [
1,
2,
3].
Statistical results indicate that more than 75% of traffic accidents and traffic safety problems are closely related to irregular driving and risky driving behaviors [
4]. For example, during the driving process, calling, drinking and smoking can affect driver attention, making them unable to focus on the driving conditions ahead and the environment around the motor vehicle, which may directly lead to the occurrence of safety accidents. Therefore, it is important to improve the detection capacity of the driver’s driving status, and the timely identification and correction of risky driving behaviors can avoid traffic safety problems to the greatest extent [
5]. At this stage, a large number of scholars have carried out experimental research on the detection of risky driving and have achieved relatively excellent performance. Among them, the early risky-driving-detection systems were mainly based on vehicle driving information, driver physiological signals or driver facial characteristics, and they also achieved relatively stable detection accuracy supported by accurate sensor devices. However, the traditional risky-driving-detection system still has some application problems, such as a slow detection efficiency, complex detection scheme and difficult application deployment.
With the further development of electronic imaging technology and the continuous innovation of computer intelligence technology, image-classification technology based on machine vision has flourished and has been applied to production tasks in various fields. Among them, deep-learning technology, as one of the hottest intelligent research directions in recent years, has shown extremely outstanding achievements in the field of image recognition and classification by automatically extracting features from input images using convolutional neural networks (CNNs). Therefore, deep-learning-based image-recognition and classification techniques have been applied to many fields such as medicine [
6], machinery [
7], and agriculture [
8], and the different models have also been tested, experimented, and applied by researchers in the fields of automotive image recognition and driver status detection.
Based on deep-learning techniques, Alotaibi et al. proposed a distracted-behavior-detection system based on residual modules and recurrent neural networks (RNNs), and the experiment results proved that the method has high classification accuracy in the driver distracted-driving image-classification task [
9]. Fusing in-vehicle sensor data with vision data, Furkan et al. proposed a system based on a CNN and migration-learning techniques applied it to hazardous driving condition detection and achieved a 96% detection accuracy on the test dataset [
10]. To detect driver driving behaviors, Xing et al. designed a driver-activity-recognition system based on deep convolutional neural networks (CNNs) to detect seven common driving behaviors, compared the classification performance of three networks, AlexNet, GoogLeNet and ResNet50, and AlexNet was relatively better in the detection test [
11]. In addition, different scholars have experimented, tested the performance of various types of deep-learning networks in risky-driving-image-classification tasks, and applied the related techniques to practical detection scenarios [
12,
13,
14,
15,
16].
However, they focused on the risky driving behavior recognition accuracy of the existing deep-learning model, and did not consider the collaborative optimization of recognition accuracy and efficiency to reduce the difficulty of the deployment of the recognition system. As you know, the complexity of a model determines the speed of its response, and we can reduce that complexity by reducing the depth of the model. Therefore, this paper explores the classification accuracy of deep-learning models at different depths, and introduces the visual attention module to further enhance each classification model, so as to explore risky-driving-image-recognition models with low model complexity and high classification accuracy, which can provide guidance for model selection in different application scenarios. The key contributions of this work are:
(1) Taking the driver’s risky-driving images as the research object, including four categories of images: normal driving, driving and drinking, driving and smoking, and driving and calling, this paper proposes four different visual attention modules and builds ResNet image-classification models of different depths.
(2) This experiment embeds the proposed four visual attention modules into the ResNet models to explore the classification performance.
(3) This experiment introduces the Grad-CAM algorithm for visual analysis to observe the influence of the visual attention modules for feature extraction in risky-driving images.
The rest of this paper is organized as follows:
Section 2 discusses the structure of base convolutional neural networks, pooling strategies, visual attention modules, and the data augmentation technique.
Section 3 indicates the experimental results and discussions, and
Section 4 concludes the paper with a summary and future research directions.
2. Methodology
This section describes various techniques involved in the visual attention mechanism and deep-learning-based risky-driving-image-classification systems, mainly containing convolutional neural networks, ResNet architecture, different pooling operation schemes, different types of visual attention modules, and data-augmentation techniques.
2.1. Convolutional Neural Networks & ResNet
Convolution neural networks (CNNs) are a kind of feed-forward neural network with a deep structure and convolution calculation, which has strong learning capability and uses a convolution layer structure to classify input information shift invariant [
17]. The basic CNNs consist of five structures: thr input layer, convolutional layer, pooling layer, fully connected layer and classification layer. The CNN network architecture is shown in
Figure 1.
ResNet [
18] is a class of networks designed to solve the gradient explosion and the overfitting problems during the model-training phase as the network deepens. The purpose of the residual module (
Figure 2a) is to add the features extracted from the front to back layers of the model, and by using the shortcut connection (
Figure 2b), ResNet effectively solves the problems of network gradient explosion and overfitting during the training process. At the same time, by introducing the batch-normalization (BN) layers, ResNet speeds up the network training speed and convergence stability. Due to the application advantages of the ResNet, this experiment selects a different-depth ResNet as the guiding architecture to complete the risky-driving-image-classification tasks.
2.2. Pooling Operation & Different Pooling Strategies
The pooling operation is one of the most important processing units in the CNN models, which plays the role of extracting representative features for captured image features, and therefore is also called the sub-sampling or down-sampling operation. After the pooling operation, the dimension of the output feature is effectively reduced, which is helpful for reducing the network training parameters and preventing overfitting. In the CNN architecture, the common pooling strategies include max pooling, average pooling and stochastic pooling, as shown in
Figure 3.
For different types of image-classification tasks, different pooling strategies can focus on preserving different image features, such as texture, contour, background or other types of features in the input feature maps, and researchers can select different pooling strategies to optimize the CNN models for a specific task. However, it is worth knowing that the single pooling strategy often results in the loss of useful feature extraction. For instance, max pooling discards all non-maximum values in the pooling kernel, while average pooling fails to retain the maximum feature values, and stochastic pooling does not focus on the retention of features in a specific direction. Therefore, the single pooling strategy also limits the classification performance of the CNN models, and needs to be compensated for and solved by the optimization methods.
2.3. Visual Attention Module Design
To solve the problem of feature loss caused by using a single pooling strategy and to improve the classification performance of deep-learning models in risky-driving-image-classification tasks, this paper proposes to incorporate visual attention mechanisms into risky-driving-image-classification models, and this section mainly illustrates the four visual attention module design schemes.
2.3.1. Squeeze and Excitation Visual Attention Block (SE Block)
The squeeze and excitation visual attention block (SE block) was firstly proposed by Hu et al. in SE Net [
19], which adds the visual attention mechanism to the CNN model in the channel direction to obtain more channel feature information, and the structure of the SE block is shown in
Figure 4a.
The SE block mainly contains three processing processes: squeeze, excitation and scale. The output of the previous layer is the processing object, and a 1 × 1 convolution operation is performed first to obtain the feature map.
where
represents the number of parameters of the c-th filter, the
is the input image,
represents the convolution operation process, and
is the output feature map.
Afterwards, the SE block will use the convolutional output to perform squeeze, excitation and scale operations in sequence, where the squeeze process is implemented as a global average pooling operation, that is, each feature channel of the feature map is compressed and characterizes the global distribution of responses over the feature channels; the excitation process is implemented by using a fully connected layer. The result after excitation is subjected to another fully connected operation to achieve feature dimensionality recovery, and the sigmoid activation function is used to obtain a weight value between 0 and 1. This process allows the CNN model to effectively learn the nonlinear interactions and nonreciprocal relationships between channels, and ensures the attention enhancement of multiple channels. Finally, the output values of the excitation processing are subjected to a reweight process that is used to weight the normalized weights to the features of each channel, also known as scale, which is weighted to the previous features channel by channel through the dot product. Through the SE block operation, the CNN model is effectively enhanced for feature extraction in the channel direction, and the SE block can be flexibly embedded in the residual branch of the ResNet model, as shown in
Figure 4b.
2.3.2. Channel Visual Attention Block & Spatial Visual Attention Block (CA Block & SA Block)
Referring to the design idea of the SE visual attention block, Woo et al. proposed two new visual attention blocks, the channel attention module (CA block) and the spatial attention module (SA block), for spatial direction and channel direction, respectively [
20], which further improve the feature-extraction ability and classification performance of the CNN image-classification model. The structure details of the CA block and SA block are shown in
Figure 5.
In the CNN image-classification models, the CA block and SA block focus on performing visual attention tasks in different ways, where the CA block focuses on computing the intrinsic relationships between individual channels, while the SA block focuses on the intrinsic relationships of feature maps at the spatial level.
On the one hand, in the CA block, it performs the max pooling, average pooling and stochastic pooling operations on the input feature map F to simultaneously obtain the texture, contour and background information of the input image and enhance the model robustness. Finally, the computation result will be sent to an MLP shared network, which will sum the corresponding elements of the three different feature maps and output the channel attention feature map, so the CNN model not only obtains the reduced dimensionality of the output feature images in the convolutional layer, but also retains more comprehensive image features. On the other hand, in the SA block, the max pooling, average pooling and stochastic pooling are performed on the input feature maps in turn, and the results are obtained for feature concatenation. Then, the fused feature maps are subjected to a standard convolution operation to recover the feature dimension and output the spatial visual attention feature map, so the SA block can efficiently help the CNN model solve the problem of “which regions are important and which regions are minor” in the input image. In addition, both the CA block and the SA block can be flexibly deployed in the ResNet, and their embedding schemes are similar to those of the SE block.
2.3.3. Mixed Visual Attention Block (MA Block)
In the process of exploring the use of visual attention mechanisms in CNN models, Woo et al. found that there is still space for the upward improvement of CNN image-classification models, so they proposed a mixed visual attention block that combines the use of two types of visual attention blocks to improve the feature-extraction and image-classification performance of deep-learning models, as shown in
Figure 6. Meanwhile, through experiments, Woo et al. pointed out that setting the CA block in front and the SA block in the back has a more significant performance on the model enhancement, and the increase in computational complexity contributed by this MA block to the CNN model is relatively small. In addition, the embedding method of the MA block in the ResNet model is consistent with the deployment of the SE block, CA block and SA block, which indicates the high application flexibility of the MA block.
2.4. Data-Augmentation Technology
With the deepening of the deep-learning model and the increase in the model complexity, training a new, deep and large CNN image-classification model needs to be supported by a large amount of labeled image data, and an insufficient amount of image data will directly lead to overfitting and accuracy bottlenecks during the training phase. Besides, as a relatively new research area, there are relatively few public datasets and insufficient image data for the risky-driving-image-classification task. In addition, the acquisition of risky-driving images requires a professional camera at a fixed position on the driver’s side of the motor vehicle, which has relatively strict requirements for imaging equipment and shooting environments, which also increases the difficulty of acquiring risky-driving images and preparing data sets.
One solution to the above problem is the data-augmentation (DA) technology, which is now widely used by researchers to obtain training data that can be used for deep-learning models. Specifically, the classic DA methods includes rotating, flipping, scaling, increasing contrast, adding Gaussian noise, and many other forms. Among them, rotation processing rotates the original training image by a certain angle; flipping inverts the original image horizontally or vertically; scaling enlarges or shrinks the original image by a certain proportion; increasing contrast changes the saturation (S) and value (V) of the original image in the HSV color space; adding Gaussian noise randomly perturbs each pixel RGB in the original image. Therefore, by using the above classic DA methods, researchers can quickly and efficiently expand the training image dataset for their CNNs models, which in turn alleviates the problems of overfitting and unbalanced data volume between groups during the training phase.
4. Conclusions
In order to further improve the performance of the deep-learning image-classification model in the risky-driving-detection task, this paper proposes a solution of embedding visual attention blocks into the deep-learning framework to improve the feature-extraction ability and classification performance. Through the model comparison and evaluation, it is worth noting that the classification accuracy of ResNet models with lower depths can exceed the ResNet models with higher depths by embedding the visual attention modules, while there is no significant change in model complexity. Therefore, we can greatly improve the recognition accuracy of the ResNet model by embedding the visual attention module, but the recognition efficiency will not be affected, which is of great significance to the practical application and popularization of this technology. Moreover, the results of the confusion matrices analysis and Grad-CAM visualization analysis confirm the superiority of the proposed model.
In future studies, we will further expand the amount of dangerous-driving-scene recognition and image data, optimize the configuration of the visual attention module, and carry out practical applications and optimization on the basis of improving the accuracy and efficiency of recognition.