1. Introduction
As the power source of metro trains, the quality of the traction motor bearings directly affects the normal operation of the motor. The frequent starting and stopping of the metro causes alternating changes in the speed of the traction motor bearings and the loads they are subjected to. With long-term harsh working conditions, the inner and outer rings of bearings and rolling elements will produce varying degrees of pitting, cracking and more complex forms of failure. The adverse vibrations generated by a faulty bearing, when input into the entire system over an extended period, not only damage the traction motor but also pose a risk to other structural components. This poses a serious threat to the safety and reliability of metro trains. The intelligent diagnosis of bearings fault in complex working conditions enables the timely identification of fault types, facilitating early maintenance intervention and providing significant engineering value for practical applications.
Conventional approaches for bearing fault diagnosis predominantly rely on signal processing techniques. To address the issue of noise interference during feature extraction, wavelet thresholding was employed to effectively eliminate significant noise components from the raw data [
1,
2]. In an effort to enhance the signal-to-noise ratio, ref. [
3,
4] adopted empirical mode decomposition (EMD) to decompose the signal into multiple intrinsic mode functions. Furthermore, ref. [
5] introduced an optimized variational mode decomposition (VMD) method to facilitate the selection of intrinsic mode functions containing pertinent fault information. Despite the promising outcomes achieved by these traditional methods in bearing fault diagnosis, they are accompanied by inherent limitations. These drawbacks encompass restricted generalization capability, challenges in extracting deep fault features, and complexities associated with parameter optimization. Signal analysis technology, as a research hotspot, has been receiving attention from scholars. Subsequently, the introduction of new methods has successfully addressed many challenges [
6,
7].
With the development of artificial intelligence technology, machine learning and deep learning [
8] have gained significant attention in various fields, and numerous researchers have started extracting deeper features and making notable contributions [
9,
10,
11]. A convolutional neural network (CNN), as one of their important representatives, possesses a powerful adaptive feature extraction capability. Moreover, CNN has demonstrated remarkable performance in the field of image processing. As such, scholars have increasingly introduced CNN into the field of fault diagnosis and conducted a series of research studies in this area. Ref. [
12] has recently proposed a CNN model that utilizes widened convolutional kernels to improve the feature extraction efficiency of the network. Ref. [
13] has deployed a CNN to extract features from Mel spectrum generated from the voiceprint signals of motors. Ref. [
14] has presented a multiscale CNN model that effectively extracts signal features at different frequencies. This advanced model is further combined with LSTM to identify fault types. In the field of medical imaging, ref. [
15] proposed an improved CNN model architecture for the identification of a lung nodule and early-stage cancer diagnosis by comparing multiple photos. In big data environments, to reduce the costs associated with data collection and processing, some researchers have explored unsupervised learning techniques. To synchronously extract local and global structural information from the raw unlabeled industrial data, ref. [
16] proposed a new multiple-order graphical deep extreme learning machine (MGDELM) algorithm. Ref. [
17] proposed a novel self-training semi-supervised deep learning (SSDL) approach to train a fault diagnosis model together with few labeled and abundant unlabeled samples. The previously discussed research studies have made notable advances in fault diagnosis. However, because of their reliance on single-sensor signals, there may be limitations in accurately characterizing fault information, which could ultimately reduce their overall reliability.
Multisignal fusion technology enables the simultaneous processing of time-series data obtained from multiple sensors, thereby capturing a broader range of system variability while offering heightened complementarity and fault tolerance. In one study, feature extraction was performed on original vibration and acoustic signals, which were subsequently fused using a 1DCNN-based network model [
18]. Another approach proposed a frequency-domain multilinear principal component analysis to effectively identify faults by integrating diverse vibration and acoustic signals [
19]. Similarly, a two-dimensional matrix was constructed from multi-axial vibration signals, and an enhanced 2DCNN model was employed for fault diagnosis [
20]. These methods have demonstrated commendable enhancements in diagnostic accuracy. However, it is worth noting that a limitation common to these approaches is the omission of time correlation among signals, which may result in the loss of crucial fault-related information.
Upon a comprehensive analysis of existing literature, it has been observed that diagnostic approaches leveraging deep learning techniques frequently employ increasing network depths to enhance the model’s learning capacity and improve diagnostic performance. Nevertheless, the utilization of progressively deeper networks may give rise to challenges such as the vanishing or exploding gradient problem. To address this issue, deep residual networks were introduced [
21], effectively mitigating the aforementioned problem. Furthermore, an innovative activation function named STAC-tanh was proposed by [
22], which enables adaptive feature extraction in the bearing system by employing the hyperbolic tangent function with slope and threshold adaptivity. Another compelling approach involved the fusion of Gramian angular field (GAF) with ResNet, leading to notable advancements in bearing fault diagnosis [
23]. Additionally, ref. [
24] combined transfer learning with ResNet, utilizing a pretrained ResNet model on ImageNet as a fault feature extractor, which yielded remarkably accurate results. These aforementioned studies have demonstrated promising outcomes in the realm of bearing fault diagnosis. However, certain limitations persist, including the sole reliance on a single sensor signal and the absence of experimental verification through the use of a purpose-built platform.
In summary, most of the studies are based on open source datasets with simple working conditions and failure forms, but the actual working conditions of bearings are complex and can present different parts and degrees of failure. To address the challenges faced in compound bearing fault diagnosis under complex working conditions, such as the low reliability of single sensor signals, the tendency for traditional data processing methods to result in important information loss, the degradation of diagnostic models with increasing network depth, and the difficulty of feature extraction, this paper proposes an intelligent diagnosis method for compound bearing faults in metro traction motors by combining MTF-processed acoustic-vibration signals using IFCNN for feature fusion along with an optimized version of ResNet. The main contributions of the paper are expressed as follows:
The application of IFCNN in compound bearing fault diagnosis allows for the fusion of multiple signal features, reducing the limitations of single sensor signals and providing more reliable diagnostic results.
The optimized ResNet model improves the efficiency of feature extraction by addressing the vanishing gradient problem. Combined with the MTF data processing method, it can effectively extract complex bearing fault features under varying working conditions with good accuracy and stability.
The construction of a test platform for metro traction motor bearings was completed, and intelligent diagnosis of composite faults under variable working conditions was conducted, validating the effectiveness of the proposed methods.
The remaining sections of this paper are arranged as follows: In
Section 2, the data processing method used in this study and the construction of the dataset are introduced.
Section 3 focuses on the multisignal fusion technology used in this study.
Section 4 provides a detailed description of the fault diagnosis model and the corresponding diagnostic process.
Section 5 explains the specific experimental design, as well as the diagnostic scheme adopted in this study.
Section 6 analyzes the experimental results and carries out a series of method comparisons to validate the effectiveness of the proposed approach.
Section 7 summarizes the main content of the paper and draws conclusions.
3. Multisignal Fusion
To enhance system stability and increase diagnostic reliability, this article collected vibration signals and acoustic emission signals and fused them for processing. This fusion processing can establish correlations between multiple signal sources. Usually, information fusion can be divided into three levels: data-level fusion, feature-level fusion, and decision-level fusion. Considering that the sample data in this study consist of MTF encoded images of different fault types, it is advantageous to employ CNN for image processing. Therefore, this paper adopted the IFCNN for feature-level fusion of the data.
IFCNN consists of three modules, namely, the feature extraction module, the feature fusion module and the feature reconstruction module [
27], and the structure of this framework is shown in
Figure 2.
The feature extraction module consists of two convolutional layers. The first layer uses the first convolutional layer of the ResNet101 network model, pretrained on the ImageNet dataset. This layer includes 64 convolutional kernels with a size of 7 × 7 and retains the training parameters, enabling effective extraction of image features. The second convolutional layer includes 64 convolutional kernels with a size of 3 × 3, which are used to adjust the features extracted by the first layer in order to adapt to feature fusion. For this study, the feature fusion module adopts an element-wise maximum fusion strategy. The final module is the image reconstruction module, in which the third convolutional layer includes 64 convolutional kernels with a size of 3 × 3. This layer adjusts the fused convolutional features and plays an important role in reconstructing the image. The fourth convolutional layer reconstructs the feature map with three-channel output, and it includes 3 convolutional kernels with a size of 1 × 1.
This framework uses the mean squared error (MSE) as the basic loss function and adds a perceptual loss to optimize the model. The expression for the perceptual loss (
) is as follows:
where
and
are the feature maps of the predicted fused image and the true fused image, respectively;
is the feature map channel index;
,
and
are the number of channels, height and width of the feature map, respectively. The expression for the basic loss (
) is as follows:
where
and
are the predicted fused image and the true fused image, respectively;
is the RGB image channel index;
and
are the height and width of the true fused image, respectively. The expression for the total loss (
) is as follows:
where
and
are the weighting coefficients. For the fusion of MTF-encoded images in this study, the sums are both set to 1.
6. Experimental Results and Comparison of Methods
During the operational process of a metro system, variations in bearing speed and load are inevitable. While previous steady-state tests have certain limitations, it becomes crucial to analyze the results of variable working condition tests to validate the effectiveness of the proposed method. To further explore the changes in compound working conditions, an additional analysis comparing the fusion of acoustic emission and vibration signals with a single signal was incorporated to emphasize the advantages of the proposed method. In the generic working condition tests, the feature extraction capabilities of four models, namely the proposed model, RepVGG, CBAM-CNN and ResNet, were compared to evaluate their performance.
6.1. Single Working Condition Changes
Based on the fault diagnosis method proposed in
Section 5.3, with the control of constant speed and load, the training set was input into the model constructed in this paper, and fault diagnosis was performed on the test set. The diagnostic results are shown in
Table 6.
Based on a comprehensive examination of the aforementioned table, it is observed that when maintaining a constant speed while altering the load, the fault diagnosis accuracy reaches nearly 100%. Conversely, in cases where the load remains constant but the speed varies, a decrease in fault diagnosis accuracy is observed, indicating a substantial influence of rotational speed on diagnostic outcomes. Subsequent analysis reveals that the accuracy of items numbered 12, 15 and 18 is significantly low, whereas items numbered 3, 6 and 9 demonstrate accuracy close to 100%, albeit slightly lower than other items within the initial nine numbers. This discrepancy can be attributed to the fact that fault characteristics extracted under medium- to high-speed and medium to heavy load conditions are more discernible compared to those under low-speed and light load conditions.
6.2. Compound Working Condition Changes
Mixed data with different speeds and loads were included in the training set and used to train the model proposed for fault diagnosis on the testing set. Subsequently, a comparison was made between the fusion of acoustic emission and vibration signals and using a single signal. The diagnostic results are shown in
Table 7.
The table clearly indicates that the diagnostic results of items numbered 4 to 6 surpass those of items numbered 1 to 3. Notably, the training and testing sets for items numbered 1 to 3 encompass varying rotation speeds, whereas items numbered 4 to 6 involve different loads. It is observed that the diagnostic accuracy of items numbered 4 to 6 remains relatively stable, whereas item numbered 3 exhibits significantly lower accuracy compared to items numbered 1 and 2. The underlying reason behind this phenomenon aligns with the findings presented in
Section 6.1 of this paper.
From the standpoint of signal acquisition, the fusion of acoustic emission and vibration signals yields higher diagnostic accuracy in fault diagnosis compared to utilizing a single signal. This finding provides further substantiation that the application of multisignal fusion technology can effectively enhance system stability and diagnostic accuracy. Furthermore, it is evident that employing a single vibration signal for diagnostics yields superior results in comparison to employing a single acoustic emission signal. This can be attributed to the fact that the acoustic emission acquisition system exhibits heightened sensitivity to environmental noise, primarily stemming from the operational testing equipment, which poses challenges in noise elimination.
6.3. Generic Working Conditions
To evaluate the performance of the proposed fault diagnosis model, all fault samples involving three different speeds and three different loads were included in both the training and testing sets. The sample ratio between the two sets was set to 9:1 to ensure the training set was large enough to enable the model to effectively learn the fault data while still reserving an adequate number of samples for testing. Subsequently, the model was applied to diagnose faults on the testing set. To visualize the diagnostic results, a confusion matrix was employed, providing an intuitive and reliable representation of classifications made by the model. The confusion matrix is presented in
Figure 9.
The confusion matrix provides a clear and intuitive visualization of the model’s misclassifications and the types of errors. It can be seen that the overall diagnostic performance is good, and the accuracy rate for the fusion of acoustic emission and vibration signals is almost 100%. However, the diagnosis accuracy rate for label 6, which corresponds to the “outer Ring + rolling element pitting” fault type, is relatively low. The model misclassified three test samples as “rolling element pitting”. Further analysis revealed that the two types of faults have similar features, making it difficult to extract differences between them. By comparing (a–c) in
Figure 9, the results further confirm that multisignal fusion technology has higher reliability and accuracy compared to a single signal, especially under changing working conditions.
To compare the feature extraction capabilities of different models, the training and testing sets samples of above-mentioned generic working conditions were respectively input into RepVGG, CBAM-CNN and ResNet models for diagnosis. Two types of faults were selected as examples: label 1 (corresponding to “inner ring pitting”) with better diagnostic results and label 6 (corresponding to “outer ring + rolling element pitting”) with poorer results. The precision–recall (PR) curves and receiver operating characteristic (ROC) curves were generated for the optimized ResNet, RepVGG, CBAM-CNN and ResNet models and evaluation indicators, such as average precision (AP) and area under the curve (AUC) were introduced.
The precision–recall (PR) curve is a graphical representation of the performance of a binary classification model, with recall on the x-axis and precision on the y-axis. It illustrates the trade-off between precision and recall at various classification thresholds. The relevant theoretical formulas for the PR curve are as follows:
where
TP represents the number of true positive instances;
FP represents the number of false positive instances; and
FN represents the number of false negative instances.
The principle of average precision (AP) is to summarize the Precision-Recall (PR) curve by calculating the average precision value. It can be obtained by computing the area under the PR curve. It provides a comprehensive assessment of how well the model balances precision and recall across different recall levels.
The receiver operating characteristic (ROC) curve is a tool used to evaluate the performance of binary classification models. It plots the false positive rate (
FPR) on the x-axis and the true positive rate (
TPR) on the y-axis. The principle of the ROC curve can be described using the following formulas:
where
FP represents the number of negative instances incorrectly classified as positive;
TN represents the number of negative instances correctly classified as negative;
TP represents the number of positive instances correctly classified as positive; and
FN represents the number of positive instances incorrectly classified as negative.
Area under the curve (AUC) is obtained by calculating the area under the ROC curve. The resulting AUC value ranges from 0 to 1, where 0.5 represents a random classifier and 1 represents a perfect classifier. A higher AUC value indicates better classifier performance.
The diagnostic results are presented in the form of PR and ROC curves in
Figure 10 and
Figure 11. The overall accuracy rate, AP and AUC for all fault types were calculated for the four models, and the weighted average values were recorded in
Table 8.
Generally, the closer the PR curve in
Figure 10 is to the upper right corner, the larger the AP value, and the better the model performance. The closer the ROC curve in
Figure 11 is to the upper left corner, the larger the AUC value, and the better the model performance. Observing the figure above, it can be seen that for the two selected fault types with different diagnostic effects, the PR and ROC curves of proposed model are both closer to the right-angle edge than those of RepVGG, CBAM-CNN and ResNet, indicating better performance. Combined with the data in
Table 8, the three accuracy evaluation indicators of the proposed model are higher than those of the compared models, validating the good feature extraction ability of the proposed model.
7. Conclusions
This paper focused on the study of the feature extraction ability of the model for complex working conditions, using the metro traction motor bearings as the research object. On the basis of ResNet, CBAM was introduced to optimize the ResNet model. Nine different working conditions and eight compound fault types were designed for experimentation. In addition, a dataset was constructed using MTF image encoding and IFCNN image fusion technology. During the model training process, UMAP was used for visualization to intuitively demonstrate the feature extraction effect of the proposed model. After the experiment, three evaluation indicators were used for objective evaluation of the feature extraction ability of the optimized ResNet, RepVGG, CBAM-CNN and ResNet models.
The results of the experiment show that the MTF-ResNet model with multisignal fusion performs well under complex working conditions, with a diagnostic accuracy rate of up to 99.25%. Based on the results, some important conclusions can be drawn. Specifically, in terms of sensors, using only vibration signals produces better diagnostic results than using only acoustic emission signals. In addition, compared with a single signal, using acoustic emission and vibration signal fusion can provide more comprehensive and integrated information, while reducing misclassifications caused by the limitations of a single signal, thereby improving fault diagnosis accuracy and making the diagnosis result more reliable. In terms of data processing, MTF image encoding technology is a simple data processing method that retains the time correlation of the data, making it easier for the model to extract more comprehensive fault features. For feature extraction models, introducing CBAM after the batch normalization layers of the ResNet model can make the model more focused on capturing important features, quickly distinguishing different types of fault features, and improving diagnostic efficiency. Furthermore, the ResNet structure can effectively alleviate the gradient disappearance phenomenon that occurs as the network deepens, thereby preventing model degradation.
Undoubtedly, this study presents several avenues for future research in the proposed methodologies. Firstly, the inclusion of additional sensors or exploration of different sensor types holds promise. For instance, incorporating multidirectional vibration sensors or temperature sensors could offer a more comprehensive spectrum of fault information, thereby enhancing diagnostic fault tolerance. Secondly, exploring more advanced data processing techniques warrants investigation to enhance the quality of input signals. The acoustic emission signals acquired in this study exhibited significant levels of environmental noise that proved challenging to eliminate. Therefore, employing sophisticated techniques may substantially improve the value derived from these acoustic emission signals. Moreover, conducting model testing on larger datasets utilizing more complex compound faults can effectively confirm the feature extraction capabilities and generalization of the model. This approach will serve as a more robust means of validation. Furthermore, future research focusing on feature extraction models should prioritize the development of lightweight and efficient models to facilitate practical implementation.
Despite the inherent limitations of the methods proposed in this paper, they exhibit commendable feature extraction capabilities within intricate operational scenarios. Consequently, these methods hold potential for application in fault diagnosis tasks related to metro traction motor bearings, thereby possessing appreciable value in engineering applications.