Statistical Analysis and Development of an Ensemble-Based Machine Learning Model for Photovoltaic Fault Detection

Hussain, Muhammad; Al-Aqrabi, Hussain; Hill, Richard

doi:10.3390/en15155492

Open AccessArticle

Statistical Analysis and Development of an Ensemble-Based Machine Learning Model for Photovoltaic Fault Detection

by

Muhammad Hussain

,

Hussain Al-Aqrabi

^*

and

Richard Hill

Centre for Industrial Analytics, Department of Computer Science, School of Computing and Engineering, University of Huddersfield, Queensgate, Huddersfield HD1 3DH, UK

^*

Author to whom correspondence should be addressed.

Energies 2022, 15(15), 5492; https://doi.org/10.3390/en15155492

Submission received: 10 July 2022 / Revised: 24 July 2022 / Accepted: 27 July 2022 / Published: 29 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a framework for photovoltaic (PV) fault detection based on statistical, supervised, and unsupervised machine learning (ML) approaches. The research is motivated by a need to develop a cost-effective solution that detects the fault types within PV systems based on a real dataset with a minimum number of input features. We discover the appropriate conditions for method selection and establish how to minimize computational demand from different ML approaches. Subsequently, the PV dataset is labeled as a result of clustering and classification. The labelled dataset is then trained using various ML models before evaluating each based on accuracy, precision, and a confusion matrix. Notably, an accuracy ranging from 94% to 100% is achieved with datasets from two different PV systems. The model robustness is affirmed by performing the approach on an additional real-world dataset that exhibits noise and missing values.

Keywords:

photovoltaics; hierarchical clustering; unsupervised learning

1. Introduction

The emphasis on producing greener energy is a key pillar of the environmental policy of many developed and developing countries around the world. As a result, grid-connected photovoltaic (GCPV) systems have seen a considerable rise in popularity over the past decade. The wide-scale implementation of PV systems has exposed areas for improvement such as methods for fault detection [1]. An important economic factor is system maintenance costs. Conventional approaches utilize manual inspections by personnel, which is expensive and can introduce human error.

The use of artificial intelligence (AI) for solving complex problems has encouraged a wide range of industries to adopt AI-based algorithms for their needs. This trend can also be seen in the field of PV fault detection.

PV fault detection can be divided into three key categories: electrical; thermal; and visual [1]. This work focuses on a sub-category found within the electrical approach: investigating the deployment of AI algorithms on key electrical parameters obtained from a PV system for fault detection classification [2]. A deeper look into the electrical category brings to light further divisions into sub-categories, namely:

Approaches that are not dependent on environmental data such as solar irradiance, temperature, and wind speed. For example, time-domain reflectometry (TDR) is proposed in [3] for the detection of PV string disconnection.
Methodologies based on the analysis of electrical parameters; primarily, current and voltage characteristics. Silvestre et al. [3] calculated the series resistance (Rs), fill factor (FF), and shunt resistance (Rsh) on the basis of I–V characteristics.
Approaches based on maximum power point tracking (MPPT). Li et al. [4] presented an automated monitoring and fault detection model utilizing a power loss analysis, leading to the identification of problems such as faulty modules, strings, partial shading, MPPT failure, and aging.
Artificial intelligence (AI)-based methodologies. The authors in [5] explored the effectiveness of BP neural networks with the aim of diagnosing faults occurring in PV systems, and compared the results with those of fuzzy logic approaches. In their conclusion, the authors presented BP neural networks as a solution to most of the limitations faced whilst implementing fuzzy logic for PV fault detection.

1.1. Literature Review

Much of the research literature focuses on reducing the number of input features and the complexity of machine learning-based algorithms for PV fault detection. This requires practitioners to evaluate and select the most important input variables for the network to achieve optimal accuracy. Defining a set of relevant features makes data collection and pre-processing more effective for the practitioner.

Millit et al. [6] in their work demonstrated the implementation of artificial neural networks (ANNs) for the modelling and estimation of output power from GCPV installations. The research was based on measurements for one year (1 January 2011 to 24 February 2012) from a PV system located at Marmara University, Turkey. The parameters used for the model training were solar irradiance, voltage, current, and temperature. Dhimish [7] proposed an ANN-based approach for energy harvesting and failure mode prediction in a PV installation to aid dynamic maintenance tasks. The model inputs included environmental features such as external temperature and parameters of the module such as internal temperature and operating times.

Furthermore, Moreno et al. [8] presented an ANN-based algorithm for forecasting global solar irradiance (GHI). The focus of the ANN model was to predict local GHI for four neighboring locations from weather data provided by the US National Oceanic and Atmospheric Administration (NOAA). This research focused on forecasting a decisive parameter required for any further forecasting or even fault detection in PV systems. Chine et al. [9] proposed dual algorithms (MLP and RBF) for fault detection in the DC part of a PV system. Based on the solar irradiance and temperature of the PV modules, several features were created to be used as inputs for the model, including the PV current, voltage, and the number of peaks in the current. The results obtained from both models, evaluated through confusion matrices, showed that the multi-layer perception (MLP) model had an accuracy of 90.3% compared with that of the radial basis function (RBF) model, which was 68.4%. Notably, the dataset used for testing was relatively small (775 samples) and was also obtained through a simulation from MATLAB. This may have led to the model not presenting key relationships that may have otherwise been evident in a real-world dataset. The two algorithms are distinctively different; MLP can accommodate many hidden layers within its infrastructure, but demands more computational power. Conversely, RBF is a single layer network and it can be more effective, depending on the application.

In contrast, Muhammad et al. [10] demonstrated an RBF-based algorithm for PV fault detection; its accuracy was within 96.5–98.1%. The model required only two inputs, solar irradiance and PV output power. Testing the algorithm on a dataset obtained from a live installation provided further confidence in the accuracy of the model. Hussain and Chen [11] proposed a gradient-guided convolutional neural network architecture for enhanced Micro-crack fault detection in PV systems. The proposed model used the analysis of the PV cell surface demonstrating that the proposed algorithm could an F1 score for Micro-crack detection of 98.8%.

Three clear boundaries exist from which a specific type of methodology can be derived based on the type of application. These are supervised, unsupervised, and reinforcement learning [12]. The content of the data, coupled with the end goal, assists researchers with the adaption of one of the above techniques [13]. There are cases where hybrid architecture can be deployed to achieve the end goal; a commonly known approach is semi-supervised learning. In their paper, Yao et al. [14] highlighted the drawbacks of purely supervised learning, especially for sensor data, as the process of data annotation can become cumbersome. They presented a semi-supervised ML model based on probabilistic modelling for PV condition monitoring.

Ye et al. [15] presented a graph-based semi-supervised algorithm for fault detection and the classification of PV systems. The authors highlighted the non-linear characteristics in the PV systems, suggesting that ML was a suitable approach for this application. Furthermore, they justified a semi-supervised model rather than the conventional approach: first, due to the lack of availability of actual PV data that were labelled; and second, due to the difficulties faced whilst trying to update the deployed models. The proposed algorithm could detect PV faults and apply fault labelling to expedite a system recovery. A key highlight of the proposed system was its ability to learn over time as the weather changed.

Another work was presented by Giovanni et al. [16], who developed a neural network-based solution to detect system faults or cyber attacks within a PV system connected to the grid. The proposed model was based on anomaly detection, extracting the critical PV parameters of the system before being processed through an auto-encoder-based neural network for behavioral classification. A similar work was also presented by Sunme et al. [17]; they developed an imputation and fault detection model for a fleet of small-scale PV systems. K-means was implemented to cluster the neighboring data along with unlabeled data and to detect abnormal data points. The work was based on actual roof-top PV data. The clustering results provided error rates of 12.6% (with neighboring data) and 22.3% (without neighboring PV data).

Imtiaz et al. [18] proposed the use of support vector machines (SVM) for online power quality disturbance detection in smart meters. Their methodology included the segregation of a power disturbance from regular readings using a one-class support vector machine (OCSVM) [19]. To accurately detect the power disturbances of a voltage wave, practical wavelet filters were applied. Due to the unlimited types of waveform abnormalities, the OCSVM was selected as a semi-supervised machine learning algorithm that required training on a relatively large sample of standard data. Their model autonomously detected various types of disturbances in real-time, including unknown disturbances that were not catered for in the training dataset.

1.2. Paper Contribution and Organization

Summarizing the above literature, we observed that the selection of a specific algorithm by PV developers was arbitrary rather than based on a methodological framework. As a result, the performance of the developed classifier may be impressive; however, this does not make it computationally the most effective. In this paper, we did not arbitrarily select and train machine learning models for PV fault detection. In fact, this paper contributes to the research in the field of PV fault detection through AI by presenting a distinctive comparison between the conventional statistical approaches against emerging ML algorithms. By doing so, this paper encourages researchers to deploy a ‘bottom-up’ approach for selecting the correct backbone architecture for PV fault detection. By presenting the results of both statistical and ML models, the reader can appreciate the importance of choosing the optimal methodology and the impact this has on factors such as computational demand, latency, and outlier handling.

Based on the premise above, we present a hybrid approach of statistics and ML for tackling PV fault detection. It is also important to note that within ML itself, we used an ensemble approach by combining unsupervised and supervised learning for data pre-processing, model training, testing, and post-deployment validation. This allowed us to address the real-world constraints of acquiring data such as missing data points, human errors, and outliers.

The paper is organized as follows: Section 2 presents our proposed methodology, featuring an in-depth analysis of statistical processing, clustering, and ML-based modelling. In Section 3, we present our model results for the initial dataset and examine the accuracy of the ML models using another PV system that included a noisy and missing dataset. Finally, Section 4 presents the critical outcomes/conclusions of the proposed work.

2. Methodology

2.1. Dataset

The complete dataset comprised of the solar irradiance (

G

), extracted from the Davis weather station located near the PV system, and total power (

P

), extracted from the maximum power point (MPPT) unit connected to a 2.2 kW PV system, as shown in Figure 1a. The PV system comprised 10 series-connected crystalline silicon PV modules, each of which had a power capacity of 220 W. A pure resistive load and a battery bank were interfaced with the output pins of the MPPT unit, which had a tolerance rate (error) of 2%.

Figure 1b,c show the distribution of the correlation values. From the figure, it is evident that irradiance and power were strongly correlated for all days, with an average value of 98.81% (Figure 1b). It can also be seen that the cases of the best and the worst correlations were not dependent on a weekday. We suggest that they could depend on the weather conditions. The highest correlation case at 99.71% is shown in Figure 2c and the lowest correlation of 82.47% is presented in Figure 2d.

The dataset contained minute-by-minute values for the above parameters for 10 weeks. The number of applied faults (manual disconnection of a module(s)) started at full operation where all modules were connected in a series and incremented by one with each fault lasting for one week. The PV fault corresponded with an actual disconnection (open circuit) in the PV modules; this also can be called “dc arcing”.

Table 1 shows the considered PV conditions. The normal operation (NO) conditions represented the PV system working under normal operation, including any shading or overcasting PV contingencies. When the PV system was subjected to 30% of disconnected PV modules or less, this was represented by a low fault (LF) faulty condition. In contrast, any fault above this threshold was considered to be a high fault (HF). When an entire PV string was disconnected/faulty, this was acknowledged as a string fault (SF). This article presents how these different types of faults can be categorized using a statistical analysis approach (t-test), hierarchical clustering, and seven other machine learning-based models. According to Table 1, the PV system was subjected to normal operation for one week and three weeks of a LF faulty condition. During weeks 5 to 9, a high percentage (above 30%) of PV modules were disconnected from the PV string. In the last week of the experiment, we entirely disconnected the PV string.

We considered values only recorded in the time period of 9 a.m.–3 p.m. The rationale behind this specific time window was its ability to capture solar activity throughout all seasons. Figure 2b shows the raw signal for one day. Around 6 am, the irradiance was low as the sun was breaking dawn. Around 5 pm, the irradiance began to decline as we moved towards sunset, diminishing the total output power due to the proportional relationship between the irradiance and power. To focus on the vital aspect of the signal, a time filter was put in place by applying a customized filter. The signal was only analyzed during the daytime from 9 am–3 pm. The purpose was to take the data from all fault cases and analyze how the characteristics of the signal changed for each fault type.

2.2. Statistical Approach

Machine learning algorithms can be very demanding in terms of power and computational time, depending on the nature of the dataset that is being used. Therefore, it is standard practice to exercise a statistical approach wherever possible.

Figure 3 represents the measured powers during the daytime for a chosen faulty case. It is clear to see that for two different days within one faulty case, it was possible to have overlapping curves (as seen in Figure 3a) and a well-separated curve, as shown in Figure 3b. The T-test is a powerful statistical method to test whether there is a significant difference between two datasets, as explained by Dhimish and Holmes [20]. Generally, in the T-test analysis calculated by Equation (1), we were looking for the T-value parameter to be smaller than 2.59 (though this was problem-specific) to conclude that we had enough evidence that the datasets were different.

T = \frac{\bar{x_{1}} - \bar{x_{2}}}{\sqrt{\frac{{s_{1}}^{2}}{n_{1}} + \frac{{s_{2}}^{2}}{n_{2}}}}

(1)

where

\bar{x_{1}}

and

\bar{x_{2}}

were the observed mean values of samples 1 and 2, respectively. The standard deviation of the samples was

s_{1}

and

s_{2}

and the samples sizes were equal to

n_{1}

and

n_{2}

, respectively.

It was evident that the first dataset of two different PV operations had no significance difference (t-test = 0.22 computed using Equation (1) and below the 2.59 theoretical threshold) whereas in the second dataset, day 4 presented the data of the PV system whilst having five faulty PV modules in the string; the t-test was equal to 41.8, indicating a faulty condition in the PV system.

We concluded that the statistical t-test could be used to allocate a fault in the PV systems. However, statistical-based methods are not usually advised for PV fault detection simply because they cannot classify the PV fault condition. In such a case, if the dataset in Figure 3b is studied, the t-test confirmed that there was a fault in the PV system; however, it could not designate whether the faulty condition in the system was associated with the disconnection of one or more PV modules.

2.3. Average of the Daily Measured Ratios and Hierarchical Clustering

As an extension to the above T-test statistical approach, it was decided that the averages of the ratios of the

G

and

P

values would be derived from the dataset for all the values within the specified time window (9 a.m.–3 p.m.). The result was a one-dimensional array, which was then imported into a hierarchical clustering algorithm. An agglomerative clustering technique [21] was selected due to its bottom-up approach. This meant that each data point was treated as a separate cluster and then an agglomerating process was initiated until all clusters were merged into a single cluster.

Hierarchical clustering is a powerful data mining tool that helps to identify groups of similar objects [21]. Hierarchical clustering requires the definition of the linkage parameter and metric before the clustering process can be initiated. Often, the Euclidean distance is chosen as a metric. For our model, we also chose the Euclidean distance by applying Equation (2), where

p

and

q

were two samples of different PV days,

q_{i} - p_{i}

represented the difference of the PV samples, and

n

was the dimension of the data (in our case,

n = 1

).

d (p, q) = \sqrt{\sum_{i = 0}^{n} {(q_{i} - p_{i})}^{2}}

(2)

The linkage parameter determines the proximity matrix recounting the algorithm of how the distance between the clusters is measured. Figure 4 shows the hierarchical clustering results using the single linkage clustering technique. The figure shows that 10 different cases were identified within the given dataset, each representing 7 days.

The single linkage dendrogram was able to show perfect clustering for all cases in the dataset. This was due to the grouping of clusters using the minimum distance between the neighbors of the dataset (PV fault conditions), resulting in optimal clustering for the one-dimensional data type.

2.4. K-Means on the Average of the Daily Measured Ratios

After obtaining perfect clustering through the agglomerative hierarchical clustering approach using the averages ratio, the same dataset was again introduced into the k-means clustering algorithm. K-means clustering is an iterative algorithm used to identify groups with similar objects. Unlike hierarchical clustering, the number of clusters is a parameter that is required for initialization. Another parameter to choose is metric. We chose the Euclidean distance (previously shown in Equation (2)) as a metric for our K-means clustering model.

We used a Python function to implement the K-means clustering using ‘m’ initializations and chose the one that provided the best solution. We advise the use of either hierarchical clustering with a single linkage or K-means with a reasonably large number of initializations (m = 100 worked well for our model).

In a few PV systems datasets, the total number of clusters might be unknown. Therefore, the validation of the clustering results is required. We recommend using evaluation methods such as Dunn or Silhouette. Table 2 shows the Dunn and Silhouette indexes [22] calculated for the partitions made by hierarchical and k-means clustering performed for a different number of clusters. In both indexes, the maximum value corresponded with the perfect partition; therefore, we confirmed our suggestion that about 10 was the optimal number of clusters.

For a few complicated datasets, sophisticated evaluation methods may be required. For example, Ref. [23] uses a decision combined from a few evaluation methods; in [24], a new method using a Sugeno fuzzy function was suggested.

2.5. Machine Learning-Based Network

Clustering showed that the average of the daily measured ratios was a compelling feature to characterize the level of fault and could be used for classification. We used the averages of the daily measured ratios as the input for the machine learning-based models; the process is demonstrated in Figure 5. Our model did not require all the collected data to be labelled. Clustering performed before ML training helped to identify faults (classes) with a very high accuracy level.

It is essential to mention that the ML framework in our developed work was for setting the foundations for further dataset training that involved more features, missing data, and a requirement for feature engineering. The statistical approach coupled with the clustering technique was sufficient for this problem statement. The ML algorithms used were to look at how the models performed on this dataset as we planned to test the PV system with larger and more complicated features.

It is also important to mention the use of cross-validation (CV) for our application. There are many types of CV such as hold-out, K-Fold, Stratified K-Fold, and Leave-One-Out CV (LOOCV). These techniques provided us with the tools for validating the performance of our models within the development stage, assisting with the selection of the most appropriate model for further testing and potential deployment. Hold-out is the most basic tool for CV. CV works by splitting the data points into K-folds and allows the training and testing of all data points. There is no formulated value for K; we selected 10 as this is usually the convention when initiating K-folds, depending on the size of the data. A K-value of 5 would be satisfactory with smaller datasets; a larger K-value may be required for larger datasets. When selecting the K-value for smaller datasets, it is essential to keep in mind the trade-off between bias and computational efficiency.

3. Results

3.1. ML-Based Detection Accuracy

Our initial model consisted of only 70 samples. Although this figure was a high-level representation of the 97,200 instances that had been reduced through various statistical and machine learning techniques (discussed earlier in Section 2) and clustered through the deployment of hierarchical (agglomerative) clustering, the data were not enough to authenticate and provide confidence in the high accuracy of the model. Therefore, the dataset was increased to around 400 instances, each representing a different type of fault in the PV system (NO, LF, HF, and SF).

The metric selected for validating the performance of the classifiers was the accuracy. It was paramount that the metric selection was correctly made to obtain a correct evaluation of the model. Due to the nature of our dataset, it could be classified as a balanced dataset; based on which, ‘accuracy’ was the correct metric to capture the performance of the models.

The accuracy of all models under consideration is shown in Table 3. It is important to note that all models were trained using the original dataset shown previously in Figure 2a without any additional noise.

If the metric for the model selection was accurate, we could select any of the tested ML models that provided 100%. However, this would not provide us with the most effective model for our application. For example, MLP could be selected, which is based on deep learning architecture consisting of many hidden layers. Although it would provide high accuracy rates, the amount of power consumption would require specialist hardware for deployment on an edge device. However, an RF selection would also offer high accuracy without demanding vast amounts of energy from the device resources.

To verify our findings in Table 3, the output confusion matrix of the two ML algorithms is presented in Figure 6. According to Figure 6a, the faulty conditions of the PV system were correctly classified by the KNN model (100%). However, there was a minor drop in the PV fault detection accuracy (97%) using the LDR model Figure 6b. Nine samples of the high percentage of PV Fault (HF) were identified as either LF or SF faulty conditions. This was expected as there was a correlation between all PV faulty conditions; three samples were identified as HF rather than SF.

In summary, it is worth noting that the high precision of the ML models was related to the fact that the examined PV system data had no noisy or missing samples.

3.2. ML Model Accuracy Using the Past 15 Years of PV Installation Data: A Case Study for a Noisy Dataset

To further validate the accuracy of the ML algorithms, the data of a relatively old PV installation (installed in 2006) were used (Figure 7a). The PV system contained 32 polycrystalline silicon PV modules with a nominal peak power of 4.16 kW. The PV modules were organized into eight PV strings and each string was made of four series-connected PV modules. The PV system was connected to an MPPT unit, which had an output efficiency of not less than 95%.

We collected the data over one week (Figure 7b). As explained earlier, the data from every day were filtered between 9:00 a.m.–3:00 p.m.; hence, 420 samples were collected for every day (based on a resolution of one minute). The PV system was affected by different types of faults:

(1): Days 1 and 6: normal operation, “NO”.
(2): Days 2 and 5: four PV modules were disconnected, which corresponded with an LF faulty condition.
(3): Days 3 and 7: twelve PV modules were disconnected, which corresponded with an HF faulty condition.
(4): Day 4: five PV strings were disconnected, which corresponded with an SF faulty condition.

Following both stages illustrated earlier in Figure 5, the accuracy of the ML algorithms was evaluated and is summarized in Table 4. The accuracy of the CART, RF, and NB models remained the highest at 100%, demonstrating that these ML models were computationally effective even with noisy datasets.

There was a notable drop in the accuracy of the remaining ML models with the lowest maximum (MLP model) approaching 44%. Figure 8 demonstrates the confusion matrix for this model. In almost every PV faulty condition, the MLP model could not predict the correct fault.

3.3. ML Model Accuracy Using the Past 15 Years of PV Installation Data: A Case Study for a Missing Dataset

We further evaluated the ML models using the PV dataset shown in Figure 7b, excluding several data. We interrupted the dataset with missing periods, as summarized in Table 5. The output dataset following this procedure is shown in Figure 9.

Following the procedure in Figure 5, we processed the data shown in Figure 9 into every ML model to check their detection accuracy. The results of this test are shown in Table 6. We observed that the NB model accuracy was the highest (94%) among the ML models. Notably, CART and RF achieved 100% in the previous case studies; however, the missing dataset tended to reduce their detection accuracy to 83% and 81%, respectively.

The confusion matrix of the adequate ML model (NB) is shown in Figure 10. As can be seen, limited inaccurate predictions were formed, making the NB model the most appropriate to use with any data for PV fault detection. For further information, this model is an extension of the old-fashioned Naïve Bayes model. The Gaussian (in other words, normal distribution) model was the easiest to work with because we only required the estimation of the mean and standard deviation of the samples, making this perfectly suitable with the PV dataset, especially when dealing with noisy or missing datasets.

In contrast, the CART model depends on the linkage of a different dataset and an entire exposition of the data, as explained well by Jaworski et al. [23]. When missing data were present, this model was unreliable. This observation was also valid for the RF model.

MLP provided the lowest classification accuracy in both scenarios. The primary justification for this was the size of the dataset. MLP is categorized as a deep learning algorithm. It is based on the concept of back-propagation, continuously carrying out weight optimization until the global minima have been reached. For this process to be genuinely achieved, vast amounts of data are required to train the algorithm. Otherwise, the model does not have enough data points to carry out the weight optimization and provide a model that is generalized.

Conversely, KNN does not require huge amounts of data and its performance is improved in cases where low dimensional data are used. The justification for the low classification of this algorithm on a relatively small dataset is the fact that the algorithm requires the normalization of the input values. This is due to the architecture of the model being reliant upon calculating the distance between data points. This was seen as an additional pre-processing step that was not implemented.

Furthermore, the kernel selected for KNN was ‘Euclidian Distance’ as opposed to ‘Manhattan’ or any other criteria. The statistical characteristics of the dataset may have contributed to the low accuracy obtained. The accuracy could potentially be increased by performing hyper-parameter tuning of the kernel criteria; however, the purpose of this research was to present robust models that did not require specific parameter tuning for an acceptable level of accuracy.

The same could be said for SVM, as a ‘linear’ kernel defines the hyper-plane and margins rather than a radial basis function (RBF) or any other kernel. Again, the objective here was to achieve optimally performing models that could be generalizable without requiring model-specific parameter optimization.

4. Conclusions

In conclusion, we believe that our incremental approach to arriving at the most effective ML model was successful. Through statistical testing, we demonstrated a workflow to establish an optimal approach for fault detection. Although we implemented this workflow for PV fault detection, it could be followed for other uses. Depending on the type of application, scale of the dataset, and availability of the target class, developers can implement the presented framework for determining the most effective framework. As shown in [25], the selection of an architecture can also have a significant impact on the deployed system and, hence, determining the most effective architecture does not only provide high accuracy, but is also effective in terms of computational demand.

The importance of testing developed models on unseen or noisy data is shown in our results section. When tested on the original dataset, most models boasted 100% accuracy. However, the transformation of the original dataset by removing specific data points significantly diminished the accuracy of the algorithms, demonstrating their vulnerability and sensitivity to manipulated data. The existence of missing values may have also affected the resultant performance, which shall be explored in future work by using advanced interpolation techniques [26].

Naïve Bayes (NB) provided a detection accuracy ranging from 94% to 100% on both the original and missing dataset based on our testing. We observed various points that may have contributed to its high efficacy. Firstly, NB is a strong contender when the training dataset consists of a small number of inputs, as per our chosen dataset. For other applications where a high number of input features are required for model training, decision-based models such as Random Forest (RF) usually perform better. This is evident from the comprehensive implementation of RF models in various industry-specific applications that require a larger number of input features. It is important to note that NB assumes a Gaussian distribution for its input dimensions.

Author Contributions

Conceptualization, M.H., H.A.-A. and R.H.; Formal analysis, M.H.; Funding acquisition, H.A.-A. and R.H.; Investigation, M.H. and H.A.-A.; Methodology, M.H.; Project administration, H.A.-A. and R.H.; Visualization, H.A.-A. and R.H.; Writing—original draft, M.H.; Writing—review & editing, H.A.-A. and R.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hu, Y.; Cao, W.; Ma, J.; Finney, S.J.; Li, D. Identifying PV Module Mismatch Faults by a Thermography-Based Temperature Distribution Analysis. IEEE Trans. Device Mater. Reliab. 2014, 14, 951–960. [Google Scholar] [CrossRef] [Green Version]
Schirone, L.; Califano, F.P.; Pastena, M. Fault Detection in a photovoltaic plant by time domain reflectometry. Prog. Photovolt. Res. Appl. 1994, 2, 35–44. [Google Scholar] [CrossRef]
Silvestre, S.; da Silva, M.A.; Chouder, A.; Guasch, D.; Karatepe, E. New procedure for fault detection in grid connected PV systems based on the evaluation of current and voltage indicators. Energy Convers. Manag. 2014, 86, 241–249. [Google Scholar] [CrossRef]
Li, X.; Wen, H.; Hu, Y.; Jiang, L. Drift-free current sensorless MPPT algorithm in photovoltaic systems. Sol. Energy 2019, 177, 118–126. [Google Scholar] [CrossRef]
Yuchuan, W.; Qinli, L.; Yaqin, S. Application of BP neural network fault diagnosis in solar Photovoltaic System. In Proceedings of the IEEE International Conference on Mechatronics and Automation, Changchun, China, 9–12 August 2009; pp. 9–12. [Google Scholar]
Mellit, A.; Sağlam, Ş.; Kalogirou, S.A. Artificial neural network-based model for estimating the produced power of a photovoltaic module. Renew. Energy 2013, 60, 71–78. [Google Scholar] [CrossRef]
Dhimish, M. Defining the best-fit machine learning classifier to early diagnose photovoltaic solar cells hot-spots. Case Stud. Therm. Eng. 2021, 25, 100980. [Google Scholar] [CrossRef]
Moreno, G.; Martin, P.; Santos, C.; Rodriguez, F.J.; Santiso, E. A Day-Ahead Irradiance Forecasting Strategy for the Integration of Photovoltaic Systems in Virtual Power Plants. IEEE Access 2020, 8, 204226–204240. [Google Scholar] [CrossRef]
Chine, W.; Mellit, A.; Lughi, V.; Malek, A.; Sulligoi, G.; Pavan, A.M. A novel fault diagnosis technique for photovoltaic systems based on artificial neural networks. Renew. Energy 2016, 90, 501–512. [Google Scholar] [CrossRef]
Hussain, M.; Dhimish, M.; Holmes, V.; Mather, P. Deployment of AI-based RBF network for photovoltaics fault detection procedure. AIMS Electron. Electr. Eng. 2020, 4, 1–18. [Google Scholar] [CrossRef]
Hussain, M.; Chen, T.; Titrenko, S.; Su, P.; Mahmud, M. A Gradient Guided Architecture Coupled with Filter Fused Representations for Micro-Crack Detection in Photovoltaic Cell Surfaces. IEEE Access 2022, 10, 58950–58964. [Google Scholar] [CrossRef]
Garoudja, E.; Harrou, F.; Sun, Y.; Kara, K.; Chouder, A.; Silvestre, S. Statistical fault detection in photovoltaic systems. Sol. Energy 2017, 150, 485–499. [Google Scholar] [CrossRef]
Mhamdi, S.; Girard, P.; Virazel, A.; Bosio, A.; Faehn, E.; Ladhar, A. Cell-Aware Defect Diagnosis of Customer Returns Based on Supervised Learning. IEEE Trans. Device Mater. Reliab. 2020, 20, 329–340. [Google Scholar] [CrossRef]
Dhimish, M.; Chen, Z. Novel Open-Circuit Photovoltaic Bypass Diode Fault Detection Algorithm. IEEE J. Photovolt. 2019, 9, 1819–1827. [Google Scholar] [CrossRef] [Green Version]
Yao, H.; Fu, D.; Zhang, P.; Li, M.; Liu, Y. MSML: A Novel Multilevel Semi-Supervised Machine Learning Framework for Intrusion Detection System. IEEE Internet Things J. 2018, 6, 1949–1959. [Google Scholar] [CrossRef] [Green Version]
Zhao, Y.; Ball, R.; Mosesian, J.; de Palma, J.-F.; Lehman, B. Graph-Based Semi-supervised Learning for Fault Detection and Classification in Solar Photovoltaic Arrays. IEEE Trans. Power Electron. 2014, 30, 2848–2858. [Google Scholar] [CrossRef]
Gaggero, G.B.; Rossi, M.; Girdinio, P.; Marchese, M. Detecting System Fault/Cyberattack within a Photovoltaic System Connected to the Grid: A Neural Network-Based Solution. J. Sens. Actuator Netw. 2020, 9, 20. [Google Scholar] [CrossRef] [Green Version]
Park, S.; Park, S.; Kim, M.; Hwang, E. Clustering-Based Self-Imputation of Unlabeled Fault Data in a Fleet of Photovoltaic Generation Systems. Energies 2020, 13, 737. [Google Scholar] [CrossRef] [Green Version]
Parvez, I.; Aghili, M.; Sarwat, A.I.; Rahman, S.; ALAM, F. Online power quality disturbance detection by support vector machine in smart meter. J. Mod. Power Syst. Clean Energy 2018, 7, 1328–1339. [Google Scholar] [CrossRef] [Green Version]
Rezvani, S.; Wang, X.; Pourpanah, F. Intuitionistic Fuzzy Twin Support Vector Machines. IEEE Trans. Fuzzy Syst. 2019, 27, 2140–2151. [Google Scholar] [CrossRef]
Dhimish, M.; Holmes, V. Fault detection algorithm for grid-connected photovoltaic plants. Sol. Energy 2016, 137, 236–245. [Google Scholar] [CrossRef]
Li, M.J.; Ng, M.K.; Cheung, Y.-M.; Huang, J.Z. Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters. IEEE Trans. Knowl. Data Eng. 2008, 20, 1519–1534. [Google Scholar] [CrossRef] [Green Version]
Jaworski, M.; Duda, P.; Rutkowski, L. New Splitting Criteria for Decision Trees in Stationary Data Streams. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 2516–2529. [Google Scholar] [CrossRef]
Dhimish, M.; Badran, G. Photovoltaic Hot-Spots Fault Detection Algorithm Using Fuzzy Systems. IEEE Trans. Device Mater. Reliab. 2019, 19, 671–679. [Google Scholar] [CrossRef] [Green Version]
Vieira, R.G.; Dhimish, M.; De Araújo, F.M.U.; Guerra, M.I.S. PV Module Fault Detection Using Combined Artificial Neural Network and Sugeno Fuzzy Logic. Electronics 2020, 9, 2150. [Google Scholar] [CrossRef]
Hussain, M.; Chen, T.; Hill, R. Moving toward Smart Manufacturing with an Autonomous Pallet Racking Inspection System Based on MobileNetV2. J. Manuf. Mater. Process. 2022, 6, 75. [Google Scholar] [CrossRef]

Figure 1. (a) Examined PV system; (b) distribution of power–irradiance correlation for 70 days; (c) distribution of power–irradiance correlation over 10 weeks.

Figure 2. (a) All datasets captured over 10 weeks; (b) visual presentation of how the measured power changed over the day; (c) optimal correlation in the dataset; (d) worst correlation in the dataset.

Figure 3. Power measured for the same fault in different ways: (a) measurement overlap; (b) measurement differences.

Figure 4. Hierarchical clustering using the single linkage clustering technique.

Figure 5. ML flowchart for PV fault identification.

Figure 6. Output confusion matrix for (a) KNN and (b) LDR.

Figure 7. (a) Image of the second examined PV system; (b) output power measured over one week.

Figure 8. Output confusion matrix for MLP model.

Figure 9. PV dataset with missing samples.

Figure 10. Output confusion matrix for the best accurate ML model, Gaussian Naïve Bayes (NB).

Table 1. Considered PV conditions.

Day	Type	Missing Period
Week 1	NO	Normal operation: no faulty PV module(s)
Week 2–4	LF	Low percentage of PV faults
Week 5–9	HF	High percentage of PV faults
Week 10	SF	Faulty PV string

Table 2. Dunn and Silhouette indexes for various numbers of classes.

Classification Type	Index	Number of Classes
Classification Type	Index	8	9	10	11	12
Hierarchical	Dunn	0.72	1.14	4.44	0.57	0.72
K-Means	Silhouette	0.86	0.93	0.99	0.97	0.94

Table 3. Accuracy of fault prediction for different ML models.

Model	Accuracy (%)
Linear Discriminant Analysis (LDR)	97
K-Nearest Neighbor (KNN)	100
Decision Tree (CART)	100
Random Forest (RF)	100
Gaussian Naïve Bayes (NB)	100
Support Vector Machine (SVM)	100
Multi-Layer Perceptron (MLP)	100

Table 4. Accuracy of fault prediction for different ML models using a 10-year-old PV installation (Figure 7a).

Model	Accuracy (%)
Linear Discriminant Analysis (LDR)	86
K-Nearest Neighbor (KNN)	58
Decision Tree (CART)	100
Random Forest (RF)	100
Gaussian Naïve Bayes (NB)	100
Support Vector Machine (SVM)	81
Multi-Layer Perceptron (MLP)	44

Table 5. Day number vs. count of missing hours.

Day No.	Period of Missing Data Samples	Count of Missing Hours in the Filtered Time Window (from 9:00 a.m.–3:00 p.m.)
Day 1	11:00–14:00	3
Day 2	7:00–11:00	2
Day 3	16:00–19:00	0
Day 4	12:00–17:00	3
Day 5	10:00–15:00	5
Day 6	12:00–19:00	3
Day 7	6:00–12:00	3

Table 6. Accuracy of fault prediction for different ML models using a missing dataset.

Model	Accuracy (%)
Linear Discriminant Analysis (LDR)	77
K-Nearest Neighbor (KNN)	51
Decision Tree (CART)	83
Random Forest (RF)	81
Gaussian Naïve Bayes (NB)	94
Support Vector Machine (SVM)	73
Multi-Layer Perceptron (MLP)	42

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hussain, M.; Al-Aqrabi, H.; Hill, R. Statistical Analysis and Development of an Ensemble-Based Machine Learning Model for Photovoltaic Fault Detection. Energies 2022, 15, 5492. https://doi.org/10.3390/en15155492

AMA Style

Hussain M, Al-Aqrabi H, Hill R. Statistical Analysis and Development of an Ensemble-Based Machine Learning Model for Photovoltaic Fault Detection. Energies. 2022; 15(15):5492. https://doi.org/10.3390/en15155492

Chicago/Turabian Style

Hussain, Muhammad, Hussain Al-Aqrabi, and Richard Hill. 2022. "Statistical Analysis and Development of an Ensemble-Based Machine Learning Model for Photovoltaic Fault Detection" Energies 15, no. 15: 5492. https://doi.org/10.3390/en15155492

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Statistical Analysis and Development of an Ensemble-Based Machine Learning Model for Photovoltaic Fault Detection

Abstract

1. Introduction

1.1. Literature Review

1.2. Paper Contribution and Organization

2. Methodology

2.1. Dataset

2.2. Statistical Approach

2.3. Average of the Daily Measured Ratios and Hierarchical Clustering

2.4. K-Means on the Average of the Daily Measured Ratios

2.5. Machine Learning-Based Network

3. Results

3.1. ML-Based Detection Accuracy

3.2. ML Model Accuracy Using the Past 15 Years of PV Installation Data: A Case Study for a Noisy Dataset

3.3. ML Model Accuracy Using the Past 15 Years of PV Installation Data: A Case Study for a Missing Dataset

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI