3.1. Wavelet Packet Theory
The collected voltage signals are non-steady-state signals, including the transient and fault transients of the eccentric rotation of the motor. It is not only difficult to directly identify the fault transients from the measured original signals, but also the identification effect is general. Therefore, using wavelet packet decomposition is an effective method to extract fault transients. The wavelet packet transform decomposes the signal on multiple scales. The wavelet packet can “adaptively change” the structure of the time-frequency window. The appropriate high-frequency and low-frequency parts of the signal are selected for analysis. The variable time window makes the time-frequency window narrower in the low-frequency part and wider in the high-frequency part. The wavelet transform solves the lack of Fourier-transform in the time dimension [
26].
Wavelet packet multi-scale analysis decomposes the entire space into orthogonal sums of multiple subspaces according to different scale factors
, as shown in
Figure 11. Unified orthogonal decomposition of scale subspace
and wavelet subspace
into
is [
27]:
can be shown in a single table , and is defined as a function.
, then, the whole space satisfies the double-scale equation
In the formula, is the high-pass filter bank of wavelet packet and is the low-pass filter bank of the wavelet packet.
During fault feature selection and feature extraction, the intra-class dispersion should be as small as possible, and the inter-class dispersion should be as large as possible. The current signal generated by the motor’s rotating motion is similar to the periodic signal. The original signal is subtracted from the envelope to obtain the reconstruction. According to the principle of permutation entropy, the smaller the entropy value is, the more ordered the signal, the larger the entropy value, and the more disordered the signal is. Since the fault current signal is generated due to the periodic friction between the brush and the pole piece, the smaller the entropy value, the more it can reflect the fault information of the motor [
28]. Several common wavelet base entropy values are selected for comparison.
According to the comparison of data in
Table 1, the Shannon entropy value after bior2.2 wavelet decomposition and reconstruction is the smallest, it is more orderly, and the details after decomposition contain the most feature information.
3.2. Improved LSTM Network
The recurrent neural network is mainly used in time series prediction and natural language processing. It is a neural network that automatically models sequence data. One of the key points of RNN is that it can be used to connect previous information to the current task, Thus, the output of the current sequence is related to the input of the previous sequence. As shown in
Figure 12, from the network structure, the recurrent neural network will remember the previous information and use the previous input information to affect the output information of the following nodes. However, for a long sequence of networks, the RNN loses the ability to learn to connect to information far away as the spacing increases [
29,
30].
The LSTM neural network is a variant of the RNN network, which was first proposed by Hochreiter [
31]. The improved LSTM network based on RNN can well solve the problem of gradient explosion and gradient disappearance. There are one or more cells in each LSTM neuron to record the current state information of LSTM neurons. Besides, there are three control gates in the LSTM network: the Forget Gate, the Output Gate, and the Input Gate, as shown in
Figure 13.
The forget gate in LSTM can calculate the information that needs to be forgotten by calculating the selectivity. By using the Sigmoid function, a probability value between 0 and 1 is obtained; 1 represents all reservations and 0 represents all forget.
The formula represents the Sigmoid activation function, represents the weight matrix corresponding to the forgetting gate, represents the output of the previously hidden layer unit, is the input of the current moment, and is the bias term of the forgetting gate.
After deciding which bits of information to discard, the next step is to determine what new information needed will be stored in the cell state. There are two parts to this. The first part of the Sigmoid layer called the “input gate layer” determines what values will be updated. The second part of the tanh layer creates a new candidate value vector that will be added to the state. After deciding which bits of information to discard, the next step is to determine what new information will be stored in the cell state.
represents how much input to the current time needs to be stored into the cell state of the current time, is the corresponding offset, and represents that adding the current time input generates new information into the cell state.
The forget gate and the memory gate determine the updated information, and then, the old cell state can be updated accordingly. The Sigmoid layer determines which part of the cell state will be output. Finally, the cell state is processed by tanh to obtain a value between −1 and 1 to determine the final output.
In the formula, is the hidden layer state, is the weight matrix of the output gate unit corresponding to the input and , is the bias term, and is the output value of the output gate unit.
In recent years, LSTM has achieved good results in signal fault diagnosis, but LSTM still has great shortcomings. For example, many parameters can be adjusted, and it is difficult for LSTM networks to converge. Many people have improved LSTM based on their data models [
32]. This article proposes an improved structure based on LSTM. The specific structure is shown in the following
Figure 14:
The input gate was canceled and the amount of new information was added. The amount of old state reserved was set to two complementary values of 1. Thus, we only forget when we needed to add new information; we added new information only when we needed to forget it.
In the classification task, if a sample is far from the number of samples in other categories, the classifier in this case usually performs poorly. In the actual production of the factory, the total number of defective products accounts for less than 2% of the total production, which has led to an extremely uneven distribution of the data of good products and defective products. The loss of the traditional two-class crossover is shown in Equation (11);
is the output of the activation function, so the value is between 0 and 1, and y is the label value. It can be seen that ordinary cross-entropy calculations have a higher output probability for positive samples. The smaller the loss value becomes, and the smaller the output probability for negative samples, the smaller the loss is. This causes the loss function to be slow during the iteration of a large number of simple samples and may not be optimized the best.
This paper proposes a loss function that can automatically adjust the risk penalty factor. This method can increase the mining of difficult-to-classify samples and can also adjust the weighting factor to reduce the unsatisfactory classification effect caused by sample imbalance, to improve the accuracy of fault diagnosis.
is a balance factor, which is used to balance the uneven proportion of positive and negative samples. adjusts the loss of easy-to-classify samples so that more attention is put upon difficult and misclassified samples during training. The larger the weight is, the better the accuracy is; or otherwise, the sample with a small probability of occurrence will be more misjudged.
3.3. Data Collection and Processing
The acquisition card was more convenient for collecting voltage signals, therefore, it was necessary to add a sampling resistor in the acquisition circuit to convert the current signal into a voltage signal. The resistance value can neither be too large nor too small. Excessive resistance will affect the motor power. The heating of the resistor will also increase during work. If a small sampling resistor will reduce the output voltage of the resistor, the proportion of the collected signal error offset and interference noise will increase. It thereby reduces the sampling accuracy. After many tests and comparisons, a resistance of 30 ohms was selected to obtain better results. The specific collection device is shown in
Figure 15.
The data acquisition of the experiment used LabVIEW2018 software, the acquisition card used NIUSB-6211, and the sampling rate was 50 K/s. We collected the data of nine types of motor: armature sticking, phase disconnecting, brush fault, wave fall, wave height, wave length, magnetic field fault, armature confusion, and good quality motor, including 500 good motors. Bad samples collected 100 motors each, the acquisition time was 0.48 s, and the number of sampling points was 24,000. The rotation frequency of each motor was around 220 Hz. Then, the number of sampling points per rotation cycle:
In the above formula, is the number of points collected when the motor rotates for one revolution, is the rotation frequency of the motor, and is the sampling rate set by the acquisition card.
Selecting 240 points generally includes one complete waveform cycle. Due to the non-stationary of the motor during the rotation, even if the same motor is rotating at different rotation moments, the waveforms collected will be slightly different. Therefore, the same motor can be used for continuous sampling, and then, the data obtained can be segmented. The data were entered as a sequence every 240 points. The data format was (6, 40). The collected data were divided according to 240 points. The total number of good products in the dataset was 50,000 samples, and the total number of defective products in each category was 10,000. We used a 4:1 ratio of the training set and test set, as shown in
Table 2.