The electric traction system is the core power control system of high-speed trains, mainly consisting of a pantograph, traction transformer, traction inverter, traction motor, and transmission gear. The electric traction system of high-speed trains has the highest failure rate among all vehicle control systems, accounting for more than half of all fault types. This poses a significant risk to the safe operation of high-speed trains. Therefore, conducting real-time state monitoring and fault diagnosis of the electric traction system is of great theoretical significance and engineering value in enhancing the safety and reliability of high-speed trains [
1].
In the past twenty years, scholars and engineers in the field of safety, both domestically and internationally, have conducted extensive research on fault diagnosis of high-speed train electric traction systems. The mainstream research methods can be categorized into two main types: methods based on analytical models [
2] and data-driven methods [
3]. The former approach relies on the design of observers and filters, which necessitate the establishment of accurate models, while the latter does not require precise modeling of the system and can automatically learn and adapt to different fault modes. Therefore, data-driven fault diagnosis methods have broader applications. Common data-driven fault diagnosis methods for high-speed train electric traction systems include signal processing [
4], statistical analysis [
5], artificial intelligence [
6], etc. Quantitative diagnosis based on artificial intelligence can be further subdivided into methods such as support vector machine (SVM), neural network, and fuzzy logic. SVM has powerful classification ability, robustness, and interpretability in machine learning-based fault diagnosis methods, and can identify key fault features with relatively low data volume [
7]. In order to compensate for the scarcity of sample fault data when classifying high-voltage circuit breaker features, a multi-class relevant vector machine algorithm was designed based on the “one-to-one” multi-class model [
8]. A large number of binary relevant vector machine models were trained based on the extracted data features to realize the multiple fault diagnosis. Ref. [
9] analyzed the data based on wavelet packet decomposition and reconstruction to calculate the distribution of wavelet packet energy and extract the energy feature of faults. Combined with Genetic Algorithm-Support Vector Machine (GA-SVM), it not only enhanced the classification speed but also improved the classification accuracy. Ref. [
10] decomposed the original bearing vibration signal into multiple intrinsic modal functions using the variational mode decomposition (VMD) method, and reconstructed the bearing vibration signal by introducing a feature energy ratio (FER) criterion. By calculating the multiscale entropy of the reconstructed signal to construct multidimensional feature vectors, the constructed multidimensional feature vectors are input into a particle swarm–support vector machine classification model for the identification of different fault modes in rolling bearings. In summary, the above methods are all targeted towards balanced fault datasets, without consideration of imbalanced data conditions. Additionally, in the actual operation of high-speed trains, the proportion of fault data in the overall data is imbalanced. When using the aforementioned data, the issue of imbalance is difficult to avoid, which objectively increases the difficulty of fault diagnosis in the electric traction system.
Imbalanced datasets typically have two conditions: class imbalance and misclassification cost imbalance [
11]. The difficulties of the imbalanced data classification problem mainly manifest in the imbalance of the data itself and the limitations of traditional classification algorithms. For data, the number of samples in two classes is unbalanced, making it difficult for classifiers to learn the features of minority-class samples through training. Meanwhile, majority-class samples can blur the boundaries of minority-class samples, making it difficult to distinguish between minority-class samples and majority-class samples in the class overlapping regions. For a classifier, SVM aims to minimize the structural risk and maximize margins. When a dataset is imbalanced, in order to reduce the loss risk, the classification hyperplane will inevitably shift towards the direction of minority-class samples, leading to misclassification of minority-class samples as multiple-class samples in the overlapping regions. Nowadays, fault diagnosis research on imbalanced data mainly focuses on data, fault features, and classification algorithms [
12]. For example, in order to eliminate the impact of data imbalance on fault diagnosis accuracy, under-sampling and over-sampling methods were adopted in ref. [
13] to balance the data in terms of quantity. However, the under-sampling method may lead to information loss, resulting in reduced learning ability of classifiers for majority-class samples. Ref. [
14] enhanced the diagnostic performance by retaining key features of the imbalanced dataset to improve the discrimination between the majority and minority classes, further mitigating the impact of data imbalance from the classifier perspective. However, the diagnostic performance of the model is overly reliant on the rationality of feature selection. Furthermore, improvements in classification algorithms can retain all sample information when addressing data imbalance issues, and the trained model can adapt to various imbalanced datasets, avoiding complex data preprocessing. Ref. [
15] proposed a Biased Support Vector Machine (Biased-SVM) model-based scheme. By assigning different penalty factors to the two classes of samples, increasing the penalty factor for the minority-class samples and reducing the penalty factor for the majority-class samples, the problem of low classification accuracy caused by data imbalance is to some extent resolved. Inspired by the above schemes, this paper proposes a self-adjusting support vector machine (St-SVM)-based approach. The specific design steps, relevant evaluation metrics, and the process of analyzing classification performance are also presented. The simulation experiments for CRH2 high-speed trains are conducted on the Traction Drive Control System-Fault Injection Benchmark (TDCS-FIB) platform using three different imbalance ratios to address the data imbalance issue.