Power System Fault Classification and Prediction Based On A Three-Layer Data Mining Structure
Power System Fault Classification and Prediction Based On A Three-Layer Data Mining Structure
ABSTRACT In traditional fault diagnosis methods in power systems, it is difficult to accurately classify
and predict the types of faults. With the emergence of big data technology, the fault classification and
prediction methods based on big data analysis and processing have been applied in power systems.
To make the classification and prediction of the fault types more accurate, this paper proposes a hybrid
data mining method for power system fault classification and prediction based on clustering, association
rules and stochastic gradient descent. This method uses a three-layer data mining model: The first layer
uses the K -means clustering algorithm to preprocess the original fault data source, and it proposes to use
self-encoding to simplify the data form. The second layer effectively eliminates the data that have little
impact on the prediction results by using association rules, and the highly correlated data are mined to
become the regression training data. The third layer first uses the cross-validation method to obtain the
optimal parameters of each fault model, and then, it uses stochastic gradient descent for data regression
training to obtain a classification and prediction model for each fault type. Finally, a verification example
shows that compared with a single data mining algorithm model, the proposed method is more comparative
in terms of the data mining, and the established power system fault classification and prediction model
has global optimality and higher prediction accuracy, which has a certain feasibility for real-time online
power system fault classification and prediction. This method reduces the disturbances from low-impact or
irrelevant data by mining the fault data three times, and it uses cross-validation to optimize the multiple
regression parameters of the regression model to solve the problems of low accuracy, large errors and easily
falling into a local optimum, given the conduct of fault classification and prediction.
INDEX TERMS Association rules, data mining, K-means, machine learning, power system fault, stochastic
gradient descent algorithm.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 200897
Y. Wang et al.: Power System Fault Classification and Prediction Based on a Three-Layer Data Mining Structure
functions, the control and protection ability of the power Gradient descent is one of the most commonly used meth-
system was improved [1], [2]. Although the performance of ods when solving for the model parameters of machine
these new mathematical analysis models and equipment is learning algorithms, especially unconstrained optimization
enhanced, the intelligence, interaction and automation of the problems. Newton’s method provides a method for solving
equipment are not sufficient. It is possible to judge the occur- nonlinear optimization problems whose convergence rate is
rence of the fault and take protective actions for the power fast, but each iteration requires solving a complex Hessian
system in time, but it cannot predict the type of the fault, and matrix. Meta-heuristic algorithms are based on an intuitive
thus, the adopted protective measures could cause protection or empirical construction, which can give a feasible solution
failure due to inappropriate choices, and even enlarge the fault to the problem for an acceptable calculation time and space
loss. Therefore, it is necessary to further study the prediction when the degree of deviation of the feasible solution from
of the power system fault types, which will help the operators the optimal solution might not necessarily be predicted in
to take correct protection and remedial measures in time to advance. However, it cannot guarantee that the global optimal
minimize the fault loss. solution will be obtained absolutely, and it often falls into
Compared with early fault diagnosis methods, artificial a local optimum on some problems. As a result, the hybrid
intelligence diagnosis methods have been applied in the fields data mining method, which combines multiple algorithms,
of fault diagnosis and prediction, such as fuzzy diagnosis has emerged. One study [26] proposed a power system line
methods [3], diagnosis methods based on genetic algo- trip fault prediction method based on an long-short term
rithms [4], [5], fault diagnosis methods using expert sys- memory (LSTM) network and SVM. Another study [27]
tems [6], [7], methods based on neural networks [8]–[10], proposed an optimized neural network fault diagnosis strat-
and diagnosis methods using the support vector machine egy for heating systems based on data mining, which used
(SVM) [11], [12]. The effective use of these artificial intel- an association rule mining method to optimize the selec-
ligence technology methods has been superior to early diag- tion of the feature sets. A data driven modeling method for
nosis methods to a certain extent. an aeroengine aerodynamic model that combined stochastic
However, with a power system that generates massive gradient descent (SGD) and support vector regression was
amounts of data every moment, the traditional artificial intel- proposed [28]. In addition, one study [29] proposed a port
ligence diagnosis method cannot process the big data system- cargo throughput prediction method based on empirical mode
atically, and the accuracy of the system fault diagnosis results decomposition (EMD) recurrent neural network and adaptive
cannot be further improved, which affects the efficiency grouping algorithm. Another study [30] proposed a similarity
of the diagnosis. The emergence of the data mining meth- grouping-guided neural network modeling method for mar-
ods [13], [14] improved the performance of the fault diagnosis itime time series prediction. The experiments on both port
to a large extent. Data mining is a cutting-edge technology of cargo throughput and vessel traffic flow have illustrated its
data analysis, which can quickly obtain valuable information superior performance in terms of prediction accuracy and
from various types of data. The functions are mainly the robustness. It can be seen that the fault diagnosis and pre-
following: 1) Automatically predicting trends and behaviors. diction model of the hybrid data mining method is excellent
2) Association analysis can find hidden associations in the and exceeds other methods.
data. 3) Clustering can enhance people’s understanding of Cluster analysis as one of the most important research
the similarities among things. 4) Deviation detection can look branches in the field of data mining, which classifies clustered
for meaningful differences between the observations and the objects according to their own characteristics. Cluster analy-
reference values. However, most of the data mining diagnosis sis has been widely used in software engineering, machine
methods are implemented using a single algorithm model. For learning, statistics, image analysis, web clustering engines
example, one study [15] proposed a fault diagnosis method and text mining. Association rules, as an inductive learning
based on decision trees for vehicle test data mining. Since algorithm, have a strong ability to discover certain rules and
the decision tree ignores the correlations of the attributes in associations in the data. As the representative algorithm of
the vehicle test data set, overfitting is prone to occur. Another association rules, Apriori [31] uses a layer-by-layer search
study [16] developed a social network analysis management strategy to traverse the solution space. SGD is often used to
framework for the industry environmental risks using associ- train various machine learning models due to its fast learning
ation rules based on frequent patterns, which is suitable for rate and online update [32]. When addressing big data, SGD
discrete data, but it is more difficult to implement, and its has a small number of calculations in a single iteration, and
performance will decrease on some data sets. Therefore, the thus, the convergence speed is significantly higher than that
models achieved by a single algorithm are not ideal. of other algorithms. The optimization efficiency is better than
Some researchers began to pay attention to the improve- that of the classic algorithm, and therefore, the application of
ment and optimization of the selected algorithms [17]. Some SGD in data regression training is extended to many different
optimization algorithms used to solve the optimal solution fields.
problem of the algorithm model have been applied, which Based on the above-mentioned considerations, this paper
mainly include the gradient descent method [18], Newton proposes a hybrid data mining algorithm based on K -means
method [19], and the meta-heuristic algorithm [20]–[25]. clustering, Apriori association rules and SGD to classify
and predict power system faults. The hybrid algorithm per- which affects the timely handling of the fault. Therefore, this
forms three-layer mining on the fault data to establish dif- paper uses three layers of data mining on the original data
ferent fault prediction models: Firstly, K -means clustering of the power system short-circuit faults, and it establishes a
and self-coding are used to preprocess the raw data. Then, fault classification and prediction model (FCPM) to predict
the association rules filter the samples for the second layer of whether a fault is about to occur and to predict the type of
the data mining. Finally, SGD is used for the data regression fault that will occur.
training and completes the third layer of the data mining. This
mining mode solves some of the current problems faced by III. FAULT CLASSIFICATION AND PREDICTION METHOD
data mining. Firstly, it reduces the interference between the This section will introduce the structure and implementation
complex data and avoids obtaining results from local opti- process of the proposed method, and the mathematical model
mization. Secondly, the complementary functions between of each algorithm will be introduced in detail.
the algorithms ensure the integrity of the data mining. Thirdly,
the method adjusts parameters according to the different fault A. OVERALL METHOD ARCHITECTURE
prediction models, in such a way that the fault prediction This paper proposes a fault classification and prediction
model has good robustness and fault tolerance, which can method based on K -means clustering, association rules and
be applied to various actual fault prediction scenarios. Com- SGD. The source samples are the node voltage data after
pared with the single algorithm model, the proposed method a certain fault occurs in the power system. The fault types
has greatly improved the accuracy and reliability of power are mainly single-phase ground fault (SPGF), two-phase
system fault classification and prediction, which can be used phase-phase fault (TPPF), two-phase ground fault (TPGF)
to optimize parameters online and can be applied to different and three-phase fault (TPF). After the data collection is
operating states. completed, the source sample library is shown as follows:
The paper originally proposes the three-layer data mining {AQ, Gi }, where, AQ = {X1 , X2 . . . Xi } is the voltage data set
structure, each layer structure has a special data mining func- in the source sample library, and {Gi } is the fault type of the
tion, and cooperate with each other to complete the classifi- fault node, Gi ∈ {1, 2, 3, 4}, where Gi = 1 is SPGF, Gi = 2
cation and prediction work of the power system fault types. is TPPF, Gi = 3 is TPGF, and Gi = 4 is TPF.
The main contributions are outlined as follows: The overall architecture of the three-layer data mining
method is shown in Fig. 1. The proposed method integrates
1) The clustering algorithm and self-encoding were used
three data mining algorithms: K -means clustering, Apriori
to preprocess complex source data, which classifies the
association rules and SGD. In the process of three-layer data
source data and simplifies the form of the classified
mining, the K -means method and self-coding method are
data.
used to preprocess the raw data, simplify the data form,
2) The method uses association rules to filter the samples
reduce the complexity of the data set, and accelerate the data
in advance and classifies them according to the type of
processing speed. After using the Apriori algorithm to mine
fault, which increases the correlations in the data.
the data for the second time, the relevant samples are sorted
3) The cross-validation method finds the optimal parame-
out according to the fault type for regression training, which
ters that correspond to different fault models, and then,
can prevent the SGD from falling into a local optimum due to
stochastic gradient descent is used to train the fault
using random data samples and improves the accuracy of the
models, which improves the accuracy of the power
regression training.
system’s fault prediction.
4) A multi-layer data mining model based on K -means,
B. THE FIRST LAYER OF THE DATA MINING PROCESSING
association rules and stochastic gradient descent is
METHODS AND RULES
built, which improves the completeness of the data
mining. After obtaining the source samples, the K -means clustering
algorithm clusters the source samples and preprocesses the
The remainder of this paper is organized as follows: the data. Moreover, a data encoding rule is proposed to encode
description of the problem is presented in Section II. The the clustered data samples and simplify the data form, which
proposed algorithm model framework and the theory of each cooperates with K -means clustering to conduct first-layer
part are explained in Section III. Then, in the fourth section, data mining and the sorting of samples to obtain sample
the whole test example is introduced, and the results are library I. The specific methods and rules are as follows:
verified. Finally, the fifth section concludes the study.
1) K-MEANS CLUSTERING METHOD
II. PROBLEM STATEMENT The K -means clustering method in this paper includes three
Short-circuit faults are very common faults in power systems, main aspects: the Euclidean distance is used to classify the
which can cause large-scale power outages. When faults data samples; The criterion function is used to judge whether
occur, the power protection components can decide only the sample clustering is completed; and the number of best
whether to act according to the current operating conditions, classification clusters is determined by comparing contour
but they can fail to determine what type of fault has occurred, coefficients.
a: EUCLIDEAN DISTANCE JUDGMENT METHOD When the criterion function of formula (2) converges,
The K -means clustering method classifies the samples which is when the cluster center does not change signifi-
according to the Euclidean distance between the data sample cantly, the cluster center stops updating. At this time, the
and the center of each cluster, and they are classified into the sample classification into K clusters is completed.
cluster with the minimum Euclidean distance. The Euclidean
distance is calculated as in formula (1): c: THE CONTOUR COEFFICIENT
r To obtain the optimal number of clusters in K -means clus-
j 2 j 2 j 2
d X, Y = j
x1 −y1 + x2 −y2 +. . .+ xn −yn tering in the first-layer data mining, the method of calcu-
v
u n lating the contour coefficients of different clusters is used.
uX j 2
Then, by comparing those contour coefficients, the number
=t xi − yi (1) of clusters with the largest contour coefficient is found to
i=1
be the optimal number of clusters. For each sample of a
where X = (x1 , x2 , . . . , xn ) is any unclassified sample cluster, the contour coefficient calculation method is shown
in n-dimensional space that corresponds to the elements in formula (3):
(the voltage data on the non-faulty node) in the AQ of the (1) First, the cluster cohesion αk is calculated. (The aver-
j j j
source sample library. Y j = (y1 , y2 , . . . , yn ) is the center age distance from x to all other points in the cluster to
of the jth cluster. When classifying the samples for the first which it belongs).
time, any sample can be randomly selected as the cluster (2) Then, the separation degree bk between the cluster and
center. the other clusters is calculated. (The average distance
between x and all points that are not in the same
b: CRITERION FUNCTION cluster).
The average of all samples in each cluster is used to update the (3) Lastly, the contour coefficient Sk is calculated. (The
cluster center, and the criterion function is used to determine difference between αk and bk is divided by the larger
whether the cluster center stops updating. The criterion func- of the two).
tion is to minimize the sum of the squared errors between the
samples in the cluster and the cluster center, which is shown bk − αk
Sk = (3)
in formula (2): max (bk , αk )
K The value of the contour coefficient is in the range [−1, 1].
j 2
X X j
min xi − yi (2) The closer it is to 1, the larger the value of Sk is. The average
j=1 x j ∈X j ,yj ∈Y j value of the contour coefficients of all samples is used as
i i
the contour coefficient under the current cluster number K .
j
where Y j is the jth cluster center, yi is the ith element data in The larger the contour coefficient is, the farther the distance
Y j , K is the number of the clusters, X j is any samples in the between the clusters, and the better the classification effect.
j
jth cluster, and xi is the ith element data in X j . Therefore, the K value with the largest contour coefficient
the source samples, makes the data sample in the same cluster 2) The candidate N+1 item sets are found by connect-
as relevant as possible, and prepares for the second-layer data ing and pruning based on the frequent N item sets
mining. (N + 1 = 2, 3 . . .).
3) By scanning the sample library M, all of the non-empty
C. THE SECOND LAYER OF THE DATA MINING sets larger than the minimum support in the candidate
PROCESSING METHODS AND RULES N+1 item set are found as the frequent N+1 item sets.
Because there are some potential laws between the voltage 4) If the frequent N+1 item sets are empty sets, then the
at the node and the fault types in the power system, the confidence and the lift of the rules composed of all
association rules are used in the second-layer data mining to of the frequent item sets are calculated, and the rules
find out the samples that are highly correlated with a certain that meet the minimum confidence and that have a lift
fault type. Training the FCPM with these highly correlated greater than 1 are found to be the strong association
samples will greatly improve the accuracy of the FCPM. rules. Otherwise, return to 2) to search the higher order
frequent item sets.
1) APRIORI ASSOCIATION RULE METHOD
The sample sets that satisfy the strong association rules
The Apriori algorithm is an association rule algorithm that constitute the association library; then, all of the sample sets
is based on mining frequent item sets: the elements in BQ related to Gi are extracted, where the samples are sorted out
and Gi in sample library I were correspondingly combined according to the fault types. These samples form the sample
into a whole sample library M: {Z1 , Z2 , . . . , Zi }, and each library II: {CQj , Gj }, where Gj is the jth fault type, and
row of the sample library M was taken as a sample group. CQj is the strong association sample sets that correspond to
The association rules for frequent item sets are used to find Gj . The difference between the source sample library, the
the association between two or more samples in the sample sample library I and the sample library II is as follows: the
group. By calculating the support, the confidence, and the lift source sample library and the sample library I are the same
of these frequent item sets, the correlation degree between in their dimension and in the number of samples, and the
the samples is measured, and the non-empty sets that meet source sample library standardizes the form of the samples
the requirements of the support, the confidence and the lift through clustering preprocessing and self-encoding to form
are selected. sample library I. After association mining, the associated
Assuming that Zx and Zy are non-empty sets of M, the library obtained from sample library I is very large, but
support, the confidence and the lift are calculated as follows: only the samples related to Gi are extracted to form sam-
ple library II, and thus, the data size of sample library II
a: SUPPORT
is much smaller than that of the complete association
Support is the probability of Zx and Zy appearing simultane- library.
ously. After the association rules mining, the samples are highly
correlated in their attributes, and the information associated
Support Zx → Zy = P Zx ∩ Zy (4)
with the fault types is stored, which is helpful for mining
b: CONFIDENCE valuable results during the SGD data regression training.
Confidence is the probability that Zy appears at the same time In this way, the result deviation caused by data redundancy is
when Zx appears. avoided, and the performance and accuracy of the regression
analysis are improved.
Confidence Zx → Zy = P(Z x ∩ Zy )/P(Zx ) (5)
D. THE THIRD LAYER OF THE DATA MINING PROCESSING
c: LIFT
METHODS AND RULES
Lift represents the ratio of the probability of Zy appearing
After the first two layers of data mining, K -means clus-
at the same time that Zx appears and the probability of Zy
tering and Apriori association rules have mined the strong
appearing.
correlation samples that correspond to the different types
P(Zx ∩ Zy ) of power system faults. The third layer of the data mining
Lift Zx → Zy = (6)
P(Zx )P(Zy ) uses these strong association samples of sample library II to
establish the FCPM for each fault type, and it achieves the
2) THE SECOND-LAYER MINING RULES BASED ON APRIORI goal of fault classification and prediction. To accelerate the
ASSOCIATION RULE prediction speed and further improve the prediction accuracy,
The Apriori association rule method is used to conduct the the cross-validation method is used to obtain the optimal
second-layer data mining of BQ in the sample library I: parameters in each fault prediction model. Then, the SGD
1) Firstly, the minimum support and the minimum con- obtains the solution of the optimal parameters for each fault
fidence are set, and the sample library M is scanned prediction model by performing regression training on the
to find all of the frequent N item sets. (N increases strong association samples. The specific description is as
from 1.) follows:
1) FAULT CLASSIFICATION AND PREDICTION MODEL BASED where α is a hyperparameter. By setting α to reduce the
ON STOCHASTIC GRADIENT DESCENT parameter scale, the purpose of model simplification is
SGD is an iterative optimization algorithm that is often used achieved, which means that the model has better generaliza-
to solve and optimize model parameters of machine learning tion ability. The regular item R(w) is used to measure the
algorithms. SGD is a deformed form of the gradient descent complexity of the loss function, and it limits the parameters
algorithm, which has been successfully applied to text clas- of the loss function. The regular items R(w) mainly include
sification [33] and large-scale sparse machine learning prob- L1 regularization and L2 regularization:
lems in natural language processing [34], [35]. The gradient Xm
is to obtain the partial derivative of the unknown parameters L1 = |w|j = kwk1 (12)
j=1
of a multivariate function and obtain the vector composed 1 Xm
of these partial derivative functions. When all of the partial L2 = w2 = kwk22 (13)
2 j=1 j
derivatives in the gradient are 0, the optimal solution of the
where L1 regularization can produce a sparse weight matrix,
model parameters can be obtained. SGD uses only one sample
which can be used for feature selection. L2 regularization can
per iteration. When processing large-volume samples, only
prevent the model from overfitting by reducing the weight
a small number of samples can be used to iterate the model
coefficient. To a certain extent, L1 can also prevent overfit-
parameters to obtain the optimal solution. Therefore, SGD
ting, but the effect is not as good as L2.
has the advantage of having a fast training speed.
c: THE OPTIMIZED OBJECTIVE FUNCTION
a: PREDICTION MODEL FUNCTION
Given sample library II: {CQj , Gj }, assuming that the weight The smaller the empirical risk and structural risk are, the bet-
coefficients of the samples at each node are linear, a linear ter the model fit; as a result, the final objective optimization
model function is obtained: function is
n
1X
f CQj = wT CQj + b
(7) min : E (w, b) = L Gj , f CQj + αR (w) (14)
n
j=1
where w is the model parameter vector, and b is the intercept.
wT CQj is the inner product of CQj and w. SGD considers a set of training samples each time to find the
true gradient of the objective optimization function. For each
b: THE PARAMETER OPTIMIZATION METHOD BASED ON set of samples, the iterative model parameters are updated by
SGD OF THE FAULT PREDICTION MODEL the update rule given by formula (15):
i) LOSS FUNCTION !
The loss function is used to estimate the difference between ∂R (w) ∂L wT (CQj ) + b, Gj
w←w−η α + (15)
the actual value Gj and the model predicted value f (CQj ) ∂w ∂w
that corresponds to the sample, which is expressed by L(Gj ,
where η is the learning rate of the step size in the control
f (CQj )). This article uses the following two loss functions:
parameter space. To prevent the parameter w from oscillating
the SVM type loss function is shown in formula (8), and the
near the solution, η is decreased according to the following
logistic regression type loss function is shown in formula (9):
formula (16):
Hinge: equivalent to SVM classification:
1
L Gj , f CQj = max 0, 1 − Gj f CQj η(t) =
(8) (16)
α (t0 + t)
Log: equivalent to Logistic regression:
where t is the time step, and t0 is the initial step size, which
L Gj , f CQj = log 1 + exp −Gj f CQj
(9) is the same as the initial value of the weight by default;
additionally, α and t jointly affect the learning rate.
ii) RISK FUNCTION
The risk function is the expectation of the loss function, and d: K -FOLD CROSS-VALIDATION PARAMETER OPTIMIZATION
it is also called the empirical risk: METHOD
n The K -fold cross-validation method is used to find out the
1X
L Gj , f CQj optimal parameter group (loss function L, hyperparameter
Er = (10)
n
i=1 α, regular term R(w) and iteration number N ); then, the
optimal model parameter w is solved by the iteration cal-
Although the objective function is to minimize the empirical
culation of SGD. The strong correlation samples that corre-
risk, because of learning historical data and the complexity
spond to a certain fault type in sample library II are used to
of the functions, it could lead to overfitting of the predic-
train the parameter group, and the solution with the highest
tion results. Therefore, the structural risks is used to avoid
cross-validation score under the fault type is regarded as the
over-fitting:
optimal solution of the parameter group. The optimal value
Sr = αR (w) (11) of the parameter group and its cross-validation scores that
1) CONFUSION MATRIX 2) Precision: the ratio of the correct positive number to the
The confusion matrices are also called the probability tables true and false positives number:
or the error matrices. This type of matrix is a specific matrix TP
that is used to visualize the performance of the algorithm. The Pr ecision = (18)
TP+FP
calculation formula of the overall model accuracy of FCPM,
3) Recall: the ratio of the correct positive number to the
the precision of each fault type, the recall rate, and the F1
true and false negatives number:
score are as follows:
Assuming that the test sample set has a total of S samples: TP
Re call = (19)
1) Accuracy: the ratio between the number of correct TP+FN
predictions and the total number of predictions: 4) F1: Harmonic average of the Precision and the Recall.
TP+TN 2 ∗ Pr ecision ∗ Re call
Accuracy = (17) F1 = (20)
S Pr ecision + Re call
200904 VOLUME 8, 2020
Y. Wang et al.: Power System Fault Classification and Prediction Based on a Three-Layer Data Mining Structure
The multi-class classification confusion matrix of the Then, the test statistics are constructed as follows:
model is converted into a binary classification confusion
yi = βxi + εi (25)
matrix to calculate the above indicators. Each type of fault
is considered separately from the other three types of fault. where xi is the sample vector, yi is the predicted value vector,
The three-phase fault (Gj = 4) is taken as an example: β is the variable coefficient, and εi is the difference between
the average value of a single sample and the average value of
TABLE 1. The meaning of TP, TN, FN and FP in the confusion matrix. the overall sample.
Regression sum of squares:
n
X 2
SSR = yi − ya (26)
i=1
Sum of squared residuals for regression:
n
where TP, TN, FN, and FP in formulas (17) (18) (19) are the X
SSE = (yi − y)2 (27)
number of samples that meet the above.
i=1
According to these performance indicators of the FCPM,
it can be compared with other methods to find the advantages Then, the F statistic is constructed:
and disadvantages of the method’s performance. SSR/p
F= (F ≥ 0) (28)
SSE/ (n − p − 1)
2) ROC CURVE where y is the actual value that corresponds to the sample
The Receiver Operating Characteristic Curve (ROC) is an vector, ya is the average value of y, p is the degree of freedom,
important and common model evaluation method to judge the and n is a small number of samples extracted from the sample
classification results. The ROC space defines the false posi- library.
tive rate (FPR) as the X axis and the true positive rate (TPR) The F value is used to test and measure the overall signifi-
as the Y axis. cance level of the model. When the F statistic is close to zero,
TPR: The rate of being correctly judged to be positive it proves that the original hypothesis H0 holds, which means
among all of the actually positive samples. that the overall significance level of the model is low. The
TP larger the F statistic is, the higher the significance level of the
TPR = (21) model, which proves that the model fits well and the model
TP + FN
is built successfully.
FPR: the rate of being falsely judged to be positive among
all of the actually negative samples.
2) BIG O NOTATION
FP The more statements that are executed in the algorithm,
FPR = (22)
FP + TN the more time it takes for the computation. The number of
Given a classification model, a coordinate point (X=FPR, executions of a statement in an algorithm is called the time
Y=TPR) can be calculated from the true and predicted values frequency, which is denoted as V (n), where n is the number of
of all of the samples. In a model, the coordinates (FPR, TPR) samples. If there is an auxiliary function f (n) such that when n
under different thresholds are drawn in the ROC space, which approaches infinity, the limit value of V (n)/f (n) is a constant
becomes the ROC curve of the specific model. that is not equal to zero, then f (n) is said to be a function
of the same magnitude as V (n), and thus, it is denoted as
F. STATISTICAL TEST AND ALGORITHM TIME COMPLEXITY V (n) = O(f (n)), which is called the time complexity.
To judge about the significance of the results, the statistical The calculation method is called Big O notation, whose
test method is added to the discussion. In addition, in con- derivation rules are as follows: 1) O(1) represents the time
sideration of the effectiveness of the proposed method, the complexity of all constant functions. 2) The time complexity
time complexity and the computational running time are also of other functions retains only the highest order, and its
discussed. coefficient is 1.
TABLE 2. The self-encoding form of some samples in sample library I. highest cross-validation score are taken as the optimal
parameters.
2) Sample library II is divided into training set A and test
set B.
3) Training set A retrains the model under the optimal
parameters.
4) Finally, test set B tests the model and obtains the results.
To prove that the regression training accuracy of the sample
library after the clustering and association rule mining is
higher than after only clustering (without mining by asso-
ciation rules), sample library I is also subjected to 10-fold
cross-validation. For example, in the parameter optimization
process of the single-phase short-circuit fault model, after
each group of parameters is substituted into the SGD algo-
rithm program, the cross-validation scores of sample library I
and sample library II that correspond to these solutions are
listed, as shown in Table 5. Then, the cross-validation scores
are compared, and the optimal solution of the parameters
TABLE 3. Partial rules mined from the frequent itemsets.
(L, R, α, N ) of each fault model is selected. It can be seen that
the cross-validation score is the highest at 0.556 during the
regression of sample library I, and the corresponding optimal
parameters group is (Log, L1, 0.1, 1000). During the regres-
sion of sample library II, the cross-validation score is the
highest at 0.788, and the corresponding optimal parameters
group is (Log, L2, 0.1, 500). In addition, the cross-validation
scores of sample library II is higher than those of sample
library I under the same parameter groups.
It can be seen from Table 5 that under the mathemat-
ical model of SGD, the optimal loss functions of sample
library I and sample library II both selected logistic regres-
sion. Because the amount of classification calculation in
Logistic regression is less and the storage resource is less, the
training data can be quickly integrated into the model. Com-
pared with SVM, it is easy to obtain the probability scores of
the samples. For the optimal regularization items, the sample
TABLE 4. The self-encoding form of the some samples of sample library I chooses L1, because the features between the sam-
library II.
ples are not obvious in sample library I, and thus, the features
must be sparse, which reduces the number of weight parame-
ters and the complexity of the model. Sample library II selects
L2, which reflects that the features between the samples in
sample library II have a certain similarity after the association
rules, and therefore, the complexity of the model is reduced
only by reducing the value of the weight. In addition, L2
can also be combined with logistic regression to solve mul-
ticollinearity problems. Both choices for α are 0.1, which
reflects the same degree of simplification of the parameter
scale. For the optimal number of iterations N, sample library
I iterated 1000 times, while sample library II iterated only
500 times. It can be seen that the cross-validation training
of sample library II has a short convergence time, fast fitting
speed, and lower model complexity.
To further explain that the data mining process using the
samples are divided into 10 parts, each of which clustering and association rules is more accurate than that
is used as a cross-validation set in turn, and the using only clustering, training set A in sample library I and
other 9 parts are used as a training set. The samples training set A in sample library II are used to retrain the model
were trained 10 times in total. The parameters with the separately under each group of parameters, and the SGD test
TABLE 5. Cross-validation scores of the sample library I and the sample in advance by adopting the cross-validation method, which
library II in the single-phase short-circuit fault model.
greatly accelerates the training speed of the fault prediction
model and improves the accuracy of the model.
Through the obtained optimal parameter set (L, R, α, N ),
the optimal solutions of the model parameters w of different
fault types are solved by SGD iterations according to formu-
las (14) and (15), which are shown in Table 6, where a positive
value indicates a positive correlation that makes the variable
and the dependent variable change in the same direction; a
negative value indicates a negative correlation that makes the
variable and the dependent variable change in the opposite
direction.
FIGURE 9. SGD test scores of the different fault models. (a) Single-phase ground fault model. (b) two-phase phase-phase fault
model. (c) two-phase ground fault model. (d) three-phase fault model.
FIGURE 13. The F1 score, precision and recall of the algorithm model. Among them, B stands for billion, which is the unit of the
number of the times the computer runs. It can be seen from
the table that, compared with the WSCC 9 bus system, the
proposed method also proves that the performance index accuracy of the fault classification results has little change,
scores of the SGD when directly used for regression is not being 93.8% and 89.2% respectively. The computational cost
high. However, after the source samples are processed by is 0.235B, which proves that the proposed method can be
the clustering preprocessing and association rules, SGD after computationally efficient to be used in practical applications.
parameter optimization is used to train the processed samples It is concluded that with the increase of the system nodes,
again, and the performance index scores are significantly the runtime and the computational cost will increase, which
improved. The reason is that the loss function used by SGD is because the runtime and the computational cost are related
each time is only determined by a small batch of data, and to the complexity of the actual system, such as the number of
the loss function is different from the real complete set loss nodes, the number of line branches, etc. Therefore, it can be
function; thus, the gradient of its solution also contains a predicted that in a more complex actual system, the compu-
certain degree of randomness. At a saddle point or local tational cost will be greater.
minimum point, it will oscillate and jump, and thus, the result
is that the prediction accuracy is not high. Applying clustering V. CONCLUSION
and association rules to filter the samples in advance can Through multiple data mining methods, including clustering,
reduce the irrelevance between small batches of data and association rules, cross-validation optimization and SGD, the
reduce the shock. Therefore, the performance indexes of SGD proposed method identifies the data samples that are strongly
are improved based on the fast data training speed of the related to the specific faults and determines the potential
clustering preprocessing and association rules. laws for building a more accurate fault classification and
prediction model. Moreover, through the existing operating
G. PRACTICAL APPLICATIONS data, the proposed method can predict which type of fault will
To prove the scalability of the proposed method, a short- occur soon, and it plays an important role in the classification
circuit fault classification and prediction test is carried out and prediction of fault types. For the specific fault models,
with the help of a power grid cooperation project in a certain the proposed method uses an optimization algorithm to deter-
urban area distribution network. The distribution network mine the optimal parameters of the fault models after clus-
has 8 generators, 12 transformers, and 56 data collection tering and association mining of the training samples; then,
points. According to the method in the article, in the dis- the fault models are obtained from training samples under the
tribution network, more than 20,000 sets of voltage data on optimal parameters, and thus, the effect of the classification
the different fault types are used as source sample data, the and prediction is better than other methods mentioned in this
amount of the source sample data and the types of the fault paper.
are the same as the WSCC 9 bus system. In this experiment, The proposed method processes the source data in advance,
the voltage data on the 56 collection points in the distribu- avoiding the low accuracy of the fault classification and
tion network affect the fault types together, the data at each prediction model due to the low-impact or irrelevant data,
collection point have been processed by the algorithm model and it can realize the fault classification and prediction of
for a total of 210 times, therefore, the computational cost the power system in time and accurately. Otherwise, the pro-
of the classification and prediction model in the distribution posed method can be widely applied to the fault classification
network is 56 × 20000 × 210 = 0.235B. In addition, the and prediction of various busbars, transformers, transmis-
computational cost of the classification and prediction model sion lines in the power system and the classification and
in the WSCC 9 bus system is 9 × 20000 × 78 ≈ 0.014B. This prediction of the other systems that involve multi-attribute
experiment was also done on a personal computer. After the classification. In addition, it can also be extended to medical
disease prediction, electronic communication fault detection [4] L. Bessissa, L. Boukezzi, D. Mahi, and A. Boubakeur, ‘‘Lifetime estima-
and other fields. tion and diagnosis of XLPE used in HV insulation cables under thermal
ageing: Arithmetic sequences optimised by genetic algorithms approach,’’
However, the proposed method has some limitations: IET Gener., Transmiss. Distrib., vol. 11, no. 10, pp. 2429–2437, Jul. 2017.
through the experiment test on the distribution network in [5] D. Kumar, I. Kamwa, and S. R. Samantaray, ‘‘Multi-objective design of
an urban area, it is found that the proposed method can- advanced power distribution networks using restricted-population-based
multi-objective seeker-optimisation-algorithm and fuzzy-operator,’’ IET
not achieve rapid and real-time fault prediction in practical Gener., Transmiss. Distrib., vol. 9, no. 11, pp. 1195–1215, Aug. 2015.
applications, and the timeliness needs to be further improved. [6] L. Ali, A. Niamat, J. A. Khan, N. A. Golilarz, X. Xingzhong, A. Noor,
In addition, this paper only considers the voltage data of each R. Nour, and S. A. C. Bukhari, ‘‘An optimized stacked support vector
machines based expert system for the effective prediction of heart failure,’’
node in the same time period under different fault conditions,
IEEE Access, vol. 7, pp. 54007–54014, 2019.
without further analysis the high-dimensional data sample [7] T. K. Saha and P. Purkait, ‘‘Investigation of an expert system for the con-
composed of voltage data of each node, current data of each dition assessment of transformer insulation based on dielectric response
branch and time. Therefore, the future research work will measurements,’’ IEEE Trans. Power Del., vol. 19, no. 3, pp. 1127–1134,
Jul. 2004.
focus on the practical engineering applications of the multi- [8] S. Chen, H. Ge, J. Li, and M. Pecht, ‘‘Progressive improved convolu-
dimensional data of power systems for fault classification and tional neural network for avionics fault diagnosis,’’ IEEE Access, vol. 7,
prediction. pp. 177362–177375, 2019.
[9] G. Rigatos, A. Piccolo, and P. Siano, ‘‘Neural network-based approach for
early detection of cascading events in electric power systems,’’ IET Gener.,
APPENDIX Transmiss. Distrib., vol. 3, no. 7, pp. 650–665, Jul. 2009.
[10] H. Malik and S. Mishra, ‘‘Artificial neural network and empirical mode
The nomenclature of the article: decomposition based imbalance fault diagnosis of wind turbine using
TurbSim, FAST and simulink,’’ IET Renew. Power Gener., vol. 11, no. 6,
Terms Abbreviations pp. 889–902, May 2017.
[11] J. Liu, Z. Zhao, C. Tang, C. Yao, C. Li, and S. Islam, ‘‘Classifying
transformer winding deformation fault types and degrees using FRA based
support vector machine SVM on support vector machine,’’ IEEE Access, vol. 7, pp. 112494–112504,
long-short term memory LSTM 2019.
stochastic gradient descent SGD [12] F. N. Rudsari, A. A. Razi-Kazemi, and M. A. Shoorehdeli, ‘‘Fault analysis
empirical mode decomposition EMD of high-voltage circuit breakers based on coil current and contact travel
waveforms through modified SVM classifier,’’ IEEE Trans. Power Del.,
fault classification and prediction vol. 34, no. 4, pp. 1608–1618, Aug. 2019.
model FCPM [13] H. Lei, H. Yifei, and G. Yi, ‘‘The research of business intelligence system
single-phase ground fault SPGF based on data mining,’’ in Proc. Int. Conf. Logistics, Informat. Service Sci.
(LISS), Jul. 2015, pp. 1–5.
two-phase phase-phase fault TPPF
[14] S. D. Mohaghegh, ‘‘Reservoir simulation and modeling based on artificial
two-phase ground fault TPGF intelligence and data mining (AI&DM),’’ J. Natural Gas Sci. Eng., vol. 3,
three-phase fault TPF no. 6, pp. 697–705, 2011.
structural risks SR [15] H. L. Han, H. Y. Ma, and Y. Yang, ‘‘Study on the test data fault mining
technology based on decision tree,’’ Procedia Comput. Sci., vol. 154,
empirical risk ER pp. 232–237, Jan. 2019.
loss function L [16] F. Ciarapica, M. Bevilacqua, and S. Antomarioni, ‘‘An approach based on
receiver operating characteristic ROC association rules and social network analysis for managing environmental
risk: A case study from a process industry,’’ Process Saf. Environ. Protec-
true positive TP tion, vol. 128, pp. 50–64, Aug. 2019.
true negative TN [17] M. Krishnan and G. Jabert, ‘‘Detection of soil borne pathogens in coffee
false positive FP plantations by modified k-means clustering,’’ in Proc. Int. Conf. Opt. Imag.
false negative FN Sensor Secur. (ICOSS), Coimbatore, India, Jul. 2013, pp. 1–8.
[18] S. Debnath and M. Saeedifard, ‘‘Simulation-based gradient-descent opti-
true positive rate TPR mization of modular multilevel converter controller parameters,’’ IEEE
false positive rate FPR Trans. Ind. Electron., vol. 63, no. 1, pp. 102–112, Jan. 2016.
regression sum of squares SSR [19] A. Garcés, ‘‘On the convergence of Newton’s method in power flow
studies for DC microgrids,’’ IEEE Trans. Power Syst., vol. 33, no. 5,
sum of squared residuals for pp. 5770–5777, Sep. 2018.
regression SSE [20] A. A. El-Fergany and M. El-Arini, ‘‘Meta-heuristic algorithms-based real
area under receiver operating power loss minimisation including line thermal overloading constraints,’’
characteristic curve AUC IET Gener., Transmiss. Distrib., vol. 7, no. 6, pp. 613–619, Jun. 2013.
[21] L. Yang, S. L. Ho, and W. N. Fu, ‘‘Design optimizations of electromag-
netic devices using sensitivity analysis and Tabu algorithm,’’ IEEE Trans.
REFERENCES Magn., vol. 50, no. 11, pp. 1–4, Nov. 2014.
[1] G. M. Ali and S. A. Al-Mawsawi, ‘‘Multiple UPFCs mathematical model [22] H. Jia, J. Li, W. Song, X. Peng, C. Lang, and Y. Li, ‘‘Spotted hyena
enhancing multi-machine power system control,’’ in Proc. 10th Jordanian optimization algorithm with simulated annealing for feature selection,’’
Int. Electr. Electron. Eng. Conf. (JIEEEC), May 2017, pp. 1–4. IEEE Access, vol. 7, pp. 71943–71962, 2019.
[2] Q. Wang and P. Qiu, ‘‘The application of equipment overheating and arcing [23] A. Y. Abdelaziz, R. A. Osama, and S. M. El-Khodary, ‘‘Reconfiguration
fault warning and protection systems of switchgear in power systems,’’ of distribution systems for loss reduction using the hyper-cube ant colony
in Proc. IEEE Innov. Smart Grid Technol.-Asia (ISGT Asia), May 2019, optimisation algorithm,’’ IET Gener., Transmiss. Distrib., vol. 6, no. 2,
pp. 1135–1137. pp. 176–187, 2012.
[3] L. Song, H. Wang, and P. Chen, ‘‘Step-by-step fuzzy diagnosis method for [24] Z. Wang, Y. Fu, C. Song, P. Zeng, and L. Qiao, ‘‘Power system anomaly
equipment based on symptom extraction and trivalent logic fuzzy diagnosis detection based on OCSVM optimized by improved particle swarm opti-
theory,’’ IEEE Trans. Fuzzy Syst., vol. 26, no. 6, pp. 3467–3478, Dec. 2018. mization,’’ IEEE Access, vol. 7, pp. 181580–181588, 2019.
[25] X. Zhou, Z. Wang, D. Li, H. Zhou, Y. Qin, and J. Wang, ‘‘Guidance YUNLIANG WANG received the M.S. degree
systematic error separation for mobile launch vehicles using artificial fish in power systems and automation from Tianjin
swarm algorithm,’’ IEEE Access, vol. 7, pp. 31422–31434, 2019. University, Tianjin, China, in 1988. He is cur-
[26] S. Zhang, Y. Wang, M. Liu, and Z. Bao, ‘‘Data-based line trip fault rently a Professor with the School of Electrical
prediction in power systems using LSTM networks and SVM,’’ IEEE and Electronic Engineering, Tianjin University of
Access, vol. 6, pp. 7675–7686, 2018. Technology. His current research interests include
[27] Y. Guo, G. Li, H. Chen, J. Wang, M. Guo, S. Sun, and W. Hu, ‘‘Optimized intelligent control, data mining, multi-motor
neural network-based fault diagnosis strategy for VRF system in heating
coordinated control, microcomputer control, and
mode using data mining,’’ Appl. Thermal Eng., vol. 125, pp. 1402–1413,
power electronics technology.
Oct. 2017.
[28] L.-H. Ren, Z.-F. Ye, and Y.-P. Zhao, ‘‘A modeling method for aero-engine
by combining stochastic gradient descent with support vector regression,’’
Aerosp. Sci. Technol., vol. 99, Apr. 2020, Art. no. 105775.
[29] Y. Li, R. W. Liu, Z. Liu, and J. Liu, ‘‘EMD-based recurrent neural network
with adaptive regrouping for port cargo throughput prediction,’’ in Proc.
Int. Conf. Neural Inf. Process., 2018, pp. 499–510.
[30] Y. Li, R. W. Liu, Z. Liu, and J. Liu, ‘‘Similarity grouping-guided neural XIAODONG WANG was born in Handan, Hebei,
network modeling for maritime time series prediction,’’ IEEE Access,
China. In 2018, he joined the School of Electri-
vol. 7, pp. 72647–72659, 2019.
cal and Electronic Engineering, Tianjin University
[31] M. Tian, L. Zhang, P. Guo, H. Zhang, Q. Chen, Y. Li, and A. Xue, ‘‘Data
dependence analysis for defects data of relay protection devices based on of Technology, where he is currently a Graduate
apriori algorithm,’’ IEEE Access, vol. 8, pp. 120647–120653, 2020. Student. His major is electrical engineering. His
[32] G. Cui, J. Guo, Y. Fan, Y. Lan, and X. Cheng, ‘‘Trend-smooth: Accelerate main research interests include smart grid, artifi-
asynchronous SGD by smoothing parameters using parameter trends,’’ cial intelligence, data mining, and power system
IEEE Access, vol. 7, pp. 156848–156859, 2019. fault prediction.
[33] A. B. Prasetijo, R. R. Isnanto, D. Eridani, Y. A. A. Soetrisno, M. Arfan,
and A. Sofwan, ‘‘Hoax detection system on Indonesian news sites based on
text classification using SVM and SGD,’’ in Proc. 4th Int. Conf. Inf. Tech-
nol., Comput., Electr. Eng. (ICITACEE), Semarang, Indonesia, Oct. 2017,
pp. 45–49.
[34] F. Kabir, S. Siddique, M. R. A. Kotwal, and M. N. Huda, ‘‘Bangla text
YANJUAN WU received the M.S. and Ph.D.
document categorization using stochastic gradient descent (SGD) classi-
degrees in power systems and automation from
fier,’’ in Proc. Int. Conf. Cognit. Comput. Inf. Process. (CCIP), Mar. 2015,
pp. 1–4. Tianjin University, Tianjin, China, in 2005 and
[35] X. Zhang, N. Gu, R. Yasrab, and H. Ye, ‘‘GT-SGD: A novel gradient 2013, respectively. She is currently an Asso-
synchronization algorithm in training distributed recurrent neural network ciate Professor with the School of Electrical
language models,’’ in Proc. Int. Conf. Netw. Netw. Appl. (NaNA), Oct. 2017, and Electronic Engineering, Tianjin University of
pp. 274–278. Technology. Her current research interests include
[36] X. Han and H. Zhang, ‘‘Power system electromagnetic transient and elec- intelligent control, data mining, smart grids, and
tromechanical transient hybrid simulation based on PSCAD,’’ in Proc. 5th grid optimization and control.
Int. Conf. Electr. Utility Deregulation Restruct. Power Technol. (DRPT),
Changsha, China, Nov. 2015, pp. 210–215.
[37] Z. Zhang and Y. Han, ‘‘Detection of ovarian tumors in obstetric ultrasound
imaging using logistic regression classifier with an advanced machine
learning approach,’’ IEEE Access, vol. 8, pp. 44999–45008, 2020.
[38] N. Yang and Y. Wang, ‘‘Identify silent data corruption vulnerable instruc-
tions using SVM,’’ IEEE Access, vol. 7, pp. 40210–40219, 2019. YANNAN GUO was born in Chifeng, Inner Mon-
[39] X. Yuan, Z. Liu, Z. Miao, Z. Zhao, F. Zhou, and Y. Song, ‘‘Fault diagnosis golia. He received the master’s degree in electrical
of analog circuits based on IH-PSO optimized support vector machine,’’ and electronic engineering from the Tianjin Uni-
IEEE Access, vol. 7, pp. 137945–137958, 2019. versity of Technology. He currently works with
[40] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, ‘‘Automatic road crack Tianjin Tianda Qiushi Electric Power High Tech-
detection using random structured forests,’’ IEEE Trans. Intell. Transp. nology Company Ltd. His main research interests
Syst., vol. 17, no. 12, pp. 3434–3445, Dec. 2016. include power system automation, smart grid, and
[41] B. Ni, S. Yan, M. Wang, A. A. Kassim, and Q. Tian, ‘‘High-order local power fault detection.
spatial context modeling by spatialized random forest,’’ IEEE Trans. Image
Process., vol. 22, no. 2, pp. 739–751, Feb. 2013.