Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (282)

Search Parameters:
Keywords = imbalanced data analysis

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 5791 KiB  
Article
Vibration Analysis Using Multi-Layer Perceptron Neural Networks for Rotor Imbalance Detection in Quadrotor UAV
by Ba Tarfi Salem Abdullah Salem, Mohd Na’im Abdullah, Faizal Mustapha, Nur Shahirah Atifah Kanirai and Mazli Mustapha
Drones 2025, 9(2), 102; https://doi.org/10.3390/drones9020102 - 30 Jan 2025
Viewed by 451
Abstract
Rotor imbalance in quadrotor UAVs poses a critical challenge, compromising flight stability, increasing maintenance demands, and reducing overall operational efficiency. Traditional vibration analysis methods, such as Fast Fourier Transform (FFT) and wavelet analysis, often struggle with non-stationary signals and real-time data processing, limiting [...] Read more.
Rotor imbalance in quadrotor UAVs poses a critical challenge, compromising flight stability, increasing maintenance demands, and reducing overall operational efficiency. Traditional vibration analysis methods, such as Fast Fourier Transform (FFT) and wavelet analysis, often struggle with non-stationary signals and real-time data processing, limiting their effectiveness under dynamic UAV operating conditions. To address these challenges, this study develops a machine learning-based vibration analysis system using a Multi-Layer Perceptron (MLP) neural network for real-time rotor imbalance detection. The system integrates Micro-Electro-Mechanical Systems (MEMS) sensors for vibration data acquisition, preprocessing techniques for noise reduction and feature extraction, and an optimized MLP architecture tailored to high-dimensional vibration data. Experimental validation was conducted under controlled flight scenarios, collecting a comprehensive dataset of 800 samples representing both balanced and imbalanced rotor conditions. The optimized MLP model, featuring five hidden layers, achieved a Root Mean Squared Error (RMSE) of 0.1414 and a correlation coefficient (R2) of 0.9224 on the test dataset, demonstrating high accuracy and reliability. This study highlights the potential of MLP-based diagnostics to enhance UAV reliability, safety, and operational efficiency, providing a scalable and effective solution for rotor imbalance detection in dynamic environments. The findings offer significant implications for improving UAV performance in addition to minimizing downtime in various industrial and commercial applications. Full article
Show Figures

Figure 1

32 pages, 503 KiB  
Article
A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications
by Tajul Miftahushudur, Halil Mertkan Sahin, Bruce Grieve and Hujun Yin
Remote Sens. 2025, 17(3), 454; https://doi.org/10.3390/rs17030454 - 29 Jan 2025
Viewed by 294
Abstract
This survey explores recent advances in addressing class imbalance issues for developing machine learning models in precision agriculture, with a focus on techniques used for plant disease detection, soil management, and crop classification. We examine the impact of class imbalance on agricultural data [...] Read more.
This survey explores recent advances in addressing class imbalance issues for developing machine learning models in precision agriculture, with a focus on techniques used for plant disease detection, soil management, and crop classification. We examine the impact of class imbalance on agricultural data and evaluate various resampling methods, such as oversampling and undersampling, as well as algorithm-level approaches, to mitigate this challenge. The paper also highlights the importance of evaluation metrics, including F1-score, G-mean, and MCC, in assessing the performance of machine learning models under imbalanced conditions. Additionally, the review provides an in-depth analysis of emerging trends in the use of generative models, like GANs and VAEs, for data augmentation in agricultural applications. Despite the significant progress, challenges such as noisy data, incomplete datasets, and lack of publicly available datasets remain. This survey concludes with recommendations for future research directions, including the need for robust methods that can handle high-dimensional agricultural data effectively. Full article
(This article belongs to the Section Remote Sensing in Agriculture and Vegetation)
25 pages, 1815 KiB  
Article
Spatio-Temporal Agnostic Sampling for Imbalanced Multivariate Seasonal Time Series Data: A Study on Forest Fires
by Abdul Mutakabbir, Chung-Horng Lung, Kshirasagar Naik, Marzia Zaman, Samuel A. Ajila, Thambirajah Ravichandran, Richard Purcell and Srinivas Sampalli
Sensors 2025, 25(3), 792; https://doi.org/10.3390/s25030792 - 28 Jan 2025
Viewed by 386
Abstract
Natural disasters are mostly seasonal and caused by anthropological, climatic, and geological factors that impact human life, economy, ecology, and natural resources. This paper focuses on increasingly widespread forest fires, causing greater destruction in recent years. Data obtained from sensors for predicting forest [...] Read more.
Natural disasters are mostly seasonal and caused by anthropological, climatic, and geological factors that impact human life, economy, ecology, and natural resources. This paper focuses on increasingly widespread forest fires, causing greater destruction in recent years. Data obtained from sensors for predicting forest fires and assessing fire severity, i.e., area burned, are multivariate, seasonal, and highly imbalanced with a ratio of 100,000+ non-fire events to 1 fire event. This paper presents Spatio-Temporal Agnostic Sampling (STAS) to overcome the challenge of highly imbalanced data. This paper first presents a mathematical understanding of fire and non-fire events and then a thorough complexity analysis of the proposed STAS framework and two existing methods, NearMiss and SMOTE. Further, to investigate the applicability of STAS, binary classification models (to determine the probability of forest fire) and regression models (to assess the severity of forest fire) were built on the data generated from STAS. A total of 432 experiments were conducted to validate the robustness of the STAS parameters. Additional experiments with a temporal data split were conducted to further validate the results. The results show that 180 of the 216 binary classification models had an F1score>0.9 and 150 of the 216 regression models had an R2score>0.75. These results indicate the applicability of STAS for fire prediction with highly imbalanced multivariate seasonal time series data. Full article
(This article belongs to the Special Issue Feature Papers in the Internet of Things Section 2024)
Show Figures

Figure 1

23 pages, 1615 KiB  
Article
Enhancing Student Academic Success Prediction Through Ensemble Learning and Image-Based Behavioral Data Transformation
by Shuai Zhao, Dongbo Zhou, Huan Wang, Di Chen and Lin Yu
Appl. Sci. 2025, 15(3), 1231; https://doi.org/10.3390/app15031231 - 25 Jan 2025
Viewed by 497
Abstract
Predicting student academic success is a significant task in the field of educational data analysis, offering insights for personalized learning interventions. However, the existing research faces challenges such as imbalanced datasets, inefficient feature transformation methods, and limited exploration data integration. This research introduces [...] Read more.
Predicting student academic success is a significant task in the field of educational data analysis, offering insights for personalized learning interventions. However, the existing research faces challenges such as imbalanced datasets, inefficient feature transformation methods, and limited exploration data integration. This research introduces an innovative method for predicting student performance by transforming one-dimensional student online learning behavior data into two-dimensional images using four distinct text-to-image encoding methods: Pixel Representation (PR), Sine Wave Transformation (SWT), Recurrence Plot (RP), and Gramian Angular Field (GAF). We evaluated the transformed images using CNN and FCN individually as well as an ensemble network, EnCF. Additionally, traditional machine learning methods, such as Random Forest, Naive Bayes, AdaBoost, Decision Tree, SVM, Logistic Regression, Extra Trees, K-Nearest Neighbors, Gradient Boosting, and Stochastic Gradient Descent, were employed on the raw, untransformed data with the SMOTE method for comparison. The experimental results demonstrated that the Recurrence Plot (RP) method outperformed other transformation techniques when using CNN and achieved the highest classification accuracy of 0.9528 under the EnCF ensemble framework. Furthermore, the deep learning approaches consistently achieved better results than traditional machine learning, underscoring the advantages of image-based data transformation combined with advanced ensemble learning approaches. Full article
Show Figures

Figure 1

18 pages, 1641 KiB  
Article
User Profile Construction Based on High-Dimensional Features Extracted by Stacking Ensemble Learning
by Zhaoyang Wang, Li Li, Ketai He and Zhenyang Zhu
Appl. Sci. 2025, 15(3), 1224; https://doi.org/10.3390/app15031224 - 25 Jan 2025
Viewed by 393
Abstract
Online social networks, as platforms for personal expression, have evolved into complex networks integrating political and social dimensions. This evolution has shifted the focus of network governance from addressing hacking activities to mitigating unpredictable social behaviors, such as the malicious manipulation of public [...] Read more.
Online social networks, as platforms for personal expression, have evolved into complex networks integrating political and social dimensions. This evolution has shifted the focus of network governance from addressing hacking activities to mitigating unpredictable social behaviors, such as the malicious manipulation of public opinion, the doxing of ordinary users, and cyberbullying. However, the sparsity of data and the concealed nature of user behavior pose significant challenges to existing network reconnaissance technologies. In this study, we focus on constructing user profiles on online social network platforms by extracting features to build deep user profiles based on behavioral patterns. Drawing inspiration from the 5Cs principle of credit evaluation, we refine it into a 3Cs principle tailored for user profiling on social network platforms and associate it with user behavioral patterns. To further analyze user behavior, a high-dimensional feature extraction method is proposed using an improved stacking ensemble learning model. Based on experimental data analysis, the most suitable base algorithms for high-dimensional feature extraction are identified. Experimental results demonstrate that the integration of high-dimensional features improved the behavior prediction accuracy of the profiling model by 9.26% on balanced datasets and enhanced the AUC (area under the curve) metric by 3.69% on imbalanced datasets. The proposed method effectively increases the depth and generalization performance of user profiling. Full article
(This article belongs to the Special Issue AI Technology and Security in Cloud/Big Data)
Show Figures

Figure 1

40 pages, 1215 KiB  
Article
Major Issues in High-Frequency Financial Data Analysis: A Survey of Solutions
by Lu Zhang and Lei Hua
Mathematics 2025, 13(3), 347; https://doi.org/10.3390/math13030347 - 22 Jan 2025
Viewed by 579
Abstract
We review recent articles that focus on the main issues identified in high-frequency financial data analysis. The issues to be addressed include nonstationarity, low signal-to-noise ratios, asynchronous data, imbalanced data, and intraday seasonality. We focus on the research articles and survey papers published [...] Read more.
We review recent articles that focus on the main issues identified in high-frequency financial data analysis. The issues to be addressed include nonstationarity, low signal-to-noise ratios, asynchronous data, imbalanced data, and intraday seasonality. We focus on the research articles and survey papers published since 2020 on recent developments and new ideas that address the issues, while commonly used approaches in the literature are also reviewed. The methods for addressing the issues are mainly classified into two groups: data preprocessing methods and quantitative methods. The latter include various statistical, econometric, and machine learning methods. We also provide easy-to-read charts and tables to summarize all the surveyed methods and articles. Full article
(This article belongs to the Special Issue Recent Advances in Statistical Machine Learning)
Show Figures

Figure 1

23 pages, 5680 KiB  
Article
Machine Learning-Based Alzheimer’s Disease Stage Diagnosis Utilizing Blood Gene Expression and Clinical Data: A Comparative Investigation
by Manash Sarma and Subarna Chatterjee
Diagnostics 2025, 15(2), 211; https://doi.org/10.3390/diagnostics15020211 - 17 Jan 2025
Viewed by 682
Abstract
Background/Objectives: This study presents a comparative analysis of the multistage diagnosis of Alzheimer’s disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer’s [...] Read more.
Background/Objectives: This study presents a comparative analysis of the multistage diagnosis of Alzheimer’s disease (AD), including mild cognitive impairment (MCI), utilizing two distinct types of biomarkers: blood gene expression and clinical biomarker samples. Both of these samples, obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), were independently analyzed utilizing machine learning (ML)-based multiclassifiers. This study applied novel machine learning-based data augmentation techniques to gene expression profile data that are high-dimensional, low-sample-size (HDLSS) and inherently highly imbalanced. The investigation obtained the highest multiclassification performance to date in the multistage diagnosis of Alzheimer’s disease utilizing the blood gene expression profiles of Alzheimer’s Disease Neuroimaging Initiative (ADNI) participants. Based on the performance results obtained, and other factors such as early prediction capabilities, this study compares the efficacies of the two types of biomarkers for multistage diagnosis. This study presents the sole investigation in which multiclassification-based AD stage diagnosis was conducted utilizing blood gene expression data. We obtained the best multiclassification result in both modalities of the ADNI data in terms of F1-score and were able to identify new genetic biomarkers. Methods: The combination of the XGBoost and SFBS (Sequential Floating Backward Selection) methods was used to select the features. We were able to select the 95 most effective gene probe sets out of 49,386. For the clinical study data, eight of the most effective biomarkers were selected using SFBS. A deep learning (DL) classifier was used to identify the stages—cognitive normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD)/dementia. DL, support vector machine (SVM), gradient boosting (GB), and random forest (RF) classifiers were used for the AD stage detection from gene expression profile data. Because of the high data imbalance in genomic data, borderline oversampling/data augmentation was applied in the model training and original samples for validation. Results: Utilizing clinical data, the highest ROC AUC scores attained were 0.989, 0.927, and 0.907 for the identification of the CN, MCI, and dementia stages, respectively. The highest F1 scores achieved were 0.971, 0.939, and 0.886. Employing gene expression data, we obtained ROC AUC scores of 0.763, 0.761, and 0.706 for the CN, MCI, and dementia stages, respectively, and F1 scores of 0.71, 0.77, and 0.53 for CN, MCI, and dementia, respectively. Conclusions: This represents the best outcome to date for AD stage diagnosis from ADNI blood gene expression profile data utilizing multiclassification techniques. The results indicated that our multiclassification model effectively manages the imbalanced data of a high-dimension, low-sample-size (HDLSS) nature to identify samples of the minority class. MAPK14, PLG, FZD2, FXYD6, and TEP1 are among the novel genes identified as being associated with AD risk. Full article
(This article belongs to the Special Issue Artificial Intelligence in Alzheimer’s Disease Diagnosis)
Show Figures

Figure 1

24 pages, 4267 KiB  
Article
The Use of Machine Learning Methods in Road Safety Research in Poland
by Anna Borucka and Sebastian Sobczuk
Appl. Sci. 2025, 15(2), 861; https://doi.org/10.3390/app15020861 - 16 Jan 2025
Viewed by 592
Abstract
Every year, thousands of accidents occur in Poland, often resulting in severe injuries or even death. The implementation of solutions supporting road safety analysis and management processes is necessary to reduce the risk of accidents and minimize their consequences. One of the rapidly [...] Read more.
Every year, thousands of accidents occur in Poland, often resulting in severe injuries or even death. The implementation of solutions supporting road safety analysis and management processes is necessary to reduce the risk of accidents and minimize their consequences. One of the rapidly developing tools that can play a key role in this area is machine learning. The aim of this study was to develop mathematical models based on ML algorithms describing road safety in Poland. First, variables with the strongest impact on safety were extracted. Then, mathematical modeling was performed using the k-Nearest Neighbors, Random Forest, and RPart algorithms. The best choice for imbalanced data, especially when the goal is to identify rare classes, is the RF model. The KNN model provides a compromise in situations where the highest overall accuracy is desired. On the other hand, the RPart model can be used as a fast, basic model, but it requires improvements to handle rare classes. The results not only identified factors that significantly affect the severity of injuries or the number of fatalities in accidents but, above all, also demonstrated the ability of ML-based models to predict threats and their consequences. Full article
Show Figures

Figure 1

16 pages, 239 KiB  
Article
SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models
by Gazi Husain, Daniel Nasef, Rejath Jose, Jonathan Mayer, Molly Bekbolatova, Timothy Devine and Milan Toma
Algorithms 2025, 18(1), 37; https://doi.org/10.3390/a18010037 - 10 Jan 2025
Viewed by 596
Abstract
Class imbalance is a prevalent challenge in machine learning that arises from skewed data distributions in one class over another, causing models to prioritize the majority class and underperform on the minority classes. This bias can significantly undermine accurate predictions in real-world scenarios, [...] Read more.
Class imbalance is a prevalent challenge in machine learning that arises from skewed data distributions in one class over another, causing models to prioritize the majority class and underperform on the minority classes. This bias can significantly undermine accurate predictions in real-world scenarios, highlighting the importance of the robust handling of imbalanced data for dependable results. This study examines one such scenario of real-time monitoring systems for fall risk assessment in bedridden patients where class imbalance may compromise the effectiveness of machine learning. It compares the effectiveness of two resampling techniques, the Synthetic Minority Oversampling Technique (SMOTE) and SMOTE combined with Edited Nearest Neighbors (SMOTEENN), in mitigating class imbalance and improving predictive performance. Using a controlled sampling strategy across various instance levels, the performance of both methods in conjunction with decision tree regression, gradient boosting regression, and Bayesian regression models was evaluated. The results indicate that SMOTEENN consistently outperforms SMOTE in terms of accuracy and mean squared error across all sample sizes and models. SMOTEENN also demonstrates healthier learning curves, suggesting improved generalization capabilities, particularly for a sampling strategy with a given number of instances. Furthermore, cross-validation analysis reveals that SMOTEENN achieves higher mean accuracy and lower standard deviation compared to SMOTE, indicating more stable and reliable performance. These findings suggest that SMOTEENN is a more effective technique for handling class imbalance, potentially contributing to the development of more accurate and generalizable predictive models in various applications. Full article
(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))
Show Figures

Graphical abstract

29 pages, 4960 KiB  
Article
Effective Text Classification Through Supervised Rough Set-Based Term Weighting
by Rasım Çekik
Symmetry 2025, 17(1), 90; https://doi.org/10.3390/sym17010090 - 9 Jan 2025
Viewed by 529
Abstract
This research presents an innovative approach in text mining based on rough set theory. This study fundamentally utilizes the concept of symmetry from rough set theory to construct indiscernibility matrices and model uncertainties in data analysis, ensuring both methodological structure and solution processes [...] Read more.
This research presents an innovative approach in text mining based on rough set theory. This study fundamentally utilizes the concept of symmetry from rough set theory to construct indiscernibility matrices and model uncertainties in data analysis, ensuring both methodological structure and solution processes remain symmetric. The effective management and analysis of large-scale textual data heavily relies on automated text classification technologies. In this context, term weighting plays a crucial role in determining classification performance. Particularly, supervised term weighting methods that utilize class information have emerged as the most effective approaches. However, the optimal representation of class–term relationships remains an area requiring further research. This study proposes the Rough Multivariate Weighting Scheme (RMWS) and presents its mathematical derivative, the Square Root Rough Multivariate Weighting Scheme (SRMWS). The RMWS model employs rough sets to identify information-carrying documents within the document–term–class space and adopts a computational methodology incorporating α, β, and γ coefficients. Moreover, the distribution of the term among classes is again effectively revealed. Comprehensive experimental studies were conducted on three different datasets featuring imbalanced-multiclass, balanced-multiclass, and imbalanced-binary class structures to evaluate the model’s effectiveness. The results show that RMWS and its derivative SRMWS methods outperform existing approaches by exhibiting superior performance on balanced and unbalanced datasets without being affected by class imbalance and number of classes. Furthermore, the SRMWS method is found to be the most effective for SVM and KNN classifiers, while the RMWS method achieves the best results for NB classifiers. These results show that the proposed methods significantly improve the text classification performance. Full article
(This article belongs to the Section Engineering and Materials)
Show Figures

Figure 1

17 pages, 1306 KiB  
Article
An Effective Methodology for Diabetes Prediction in the Case of Class Imbalance
by Borislava Toleva, Ivan Atanasov, Ivan Ivanov and Vincent Hooper
Bioengineering 2025, 12(1), 35; https://doi.org/10.3390/bioengineering12010035 - 6 Jan 2025
Viewed by 556
Abstract
Diabetes causes an increase in the level of blood sugar, which leads to damage to various parts of the human body. Diabetes data are used not only for providing a deeper understanding of the treatment mechanisms but also for predicting the probability that [...] Read more.
Diabetes causes an increase in the level of blood sugar, which leads to damage to various parts of the human body. Diabetes data are used not only for providing a deeper understanding of the treatment mechanisms but also for predicting the probability that one might become sick. This paper proposes a novel methodology to perform classification in the case of heavy class imbalance, as observed in the PIMA diabetes dataset. The proposed methodology uses two novel steps, namely resampling and random shuffling prior to defining the classification model. The methodology is tested with two versions of cross validation that are appropriate in cases of class imbalance—k-fold cross validation and stratified k-fold cross validation. Our findings suggest that when having imbalanced data, shuffling the data randomly prior to a train/test split can help improve estimation metrics. Our methodology can outperform existing machine learning algorithms and complex deep learning models. Applying our proposed methodology is a simple and fast way to predict labels with class imbalance. It does not require additional techniques to balance classes. It does not involve preselecting important variables, which saves time and makes the model easy for analysis. This makes it an effective methodology for initial and further modeling of data with class imbalance. Moreover, our methodologies show how to increase the effectiveness of the machine learning models based on the standard approaches and make them more reliable. Full article
Show Figures

Figure 1

23 pages, 1682 KiB  
Review
Wind Turbine SCADA Data Imbalance: A Review of Its Impact on Health Condition Analyses and Mitigation Strategies
by Adaiton Oliveira-Filho, Monelle Comeau, James Cave, Charbel Nasr, Pavel Côté and Antoine Tahan
Energies 2025, 18(1), 59; https://doi.org/10.3390/en18010059 - 27 Dec 2024
Viewed by 588
Abstract
The rapidly increasing installed capacity of Wind Turbines (WTs) worldwide emphasizes the need for Operation and Maintenance (O&M) strategies favoring high availability, reliability, and cost-effective operation. Optimal decision-making and planning are supported by WT health condition analyses based on data from the Supervisory [...] Read more.
The rapidly increasing installed capacity of Wind Turbines (WTs) worldwide emphasizes the need for Operation and Maintenance (O&M) strategies favoring high availability, reliability, and cost-effective operation. Optimal decision-making and planning are supported by WT health condition analyses based on data from the Supervisory Control and Data Acquisition (SCADA) system. However, SCADA data are highly imbalanced, with a predominance of healthy condition samples. Although this imbalance can negatively impact analyses such as detection, Condition Monitoring (CM), diagnosis, and prognosis, it is often overlooked in the literature. This review specifically addresses the problem of SCADA data imbalance, focusing on strategies to mitigate this condition. Five categories of such strategies were identified: Normal Behavior Models (NBMs), data-level strategies, algorithm-level strategies, cost-sensitive learning, and data augmentation techniques. This review evidenced that the choice among these strategies is mainly dictated by the availability of data and the intended analysis. Moreover, algorithm-level strategies are predominant in analyzing SCADA data because these strategies do not require the costly and time-consuming task of data labeling. An extensive public SCADA database could ease the problem of abnormal data scarcity and help handle the problem of data imbalance. However, long-dated requests to create such a database are still unaddressed. Full article
(This article belongs to the Special Issue Computational and Experimental Fluid Dynamics for Wind Energy)
Show Figures

Figure 1

39 pages, 4291 KiB  
Review
Machine Learning and Deep Learning for Crop Disease Diagnosis: Performance Analysis and Review
by Habiba Njeri Ngugi, Andronicus A. Akinyelu and Absalom E. Ezugwu
Agronomy 2024, 14(12), 3001; https://doi.org/10.3390/agronomy14123001 - 17 Dec 2024
Viewed by 1466
Abstract
Crop diseases pose a significant threat to global food security, with both economic and environmental consequences. Early and accurate detection is essential for timely intervention and sustainable farming. This paper presents a review of machine learning (ML) and deep learning (DL) techniques for [...] Read more.
Crop diseases pose a significant threat to global food security, with both economic and environmental consequences. Early and accurate detection is essential for timely intervention and sustainable farming. This paper presents a review of machine learning (ML) and deep learning (DL) techniques for crop disease diagnosis, focusing on Support Vector Machines (SVMs), Random Forest (RF), k-Nearest Neighbors (KNNs), and deep models like VGG16, ResNet50, and DenseNet121. The review method includes an in-depth analysis of algorithm performance using key metrics such as accuracy, precision, recall, and F1 score across various datasets. We also highlight the data imbalances in commonly used datasets, particularly PlantVillage, and discuss the challenges posed by these imbalances. The research highlights critical insights regarding ML and DL models in crop disease detection. A primary challenge identified is the imbalance in the PlantVillage dataset, with a high number of healthy images and a strong bias toward certain disease categories like fungi, leaving other categories like mites and molds underrepresented. This imbalance complicates model generalization, indicating a need for preprocessing steps to enhance performance. This study also shows that combining Vision Transformers (ViTs) with Green Chromatic Coordinates and hybridizing these with SVM achieves high classification accuracy, emphasizing the value of advanced feature extraction techniques in improving model efficacy. In terms of comparative performance, DL architectures like ResNet50, VGG16, and convolutional neural network demonstrated robust accuracy (95–99%) across diverse datasets, underscoring their effectiveness in managing complex image data. Additionally, traditional ML models exhibited varied strengths; for instance, SVM performed better on balanced datasets, while RF excelled with imbalanced data. Preprocessing methods like K-means clustering, Fuzzy C-Means, and PCA, along with ensemble approaches, further improved model accuracy. Lastly, the study underscores that high-quality, well-labeled datasets, stakeholder involvement, and comprehensive evaluation metrics such as F1 score and precision are crucial for optimizing ML and DL models, making them more effective for real-world applications in sustainable agriculture. Full article
(This article belongs to the Collection Machine Learning in Digital Agriculture)
Show Figures

Figure 1

29 pages, 4651 KiB  
Article
Hybrid Vision Transformer and Convolutional Neural Network for Multi-Class and Multi-Label Classification of Tuberculosis Anomalies on Chest X-Ray
by Rizka Yulvina, Stefanus Andika Putra, Mia Rizkinia, Arierta Pujitresnani, Eric Daniel Tenda, Reyhan Eddy Yunus, Dean Handimulya Djumaryo, Prasandhya Astagiri Yusuf and Vanya Valindria
Computers 2024, 13(12), 343; https://doi.org/10.3390/computers13120343 - 17 Dec 2024
Viewed by 855
Abstract
Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a leading cause of global mortality. While TB detection can be performed through chest X-ray (CXR) analysis, numerous studies have leveraged AI to automate and enhance the diagnostic process. However, existing approaches often focus on partial [...] Read more.
Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a leading cause of global mortality. While TB detection can be performed through chest X-ray (CXR) analysis, numerous studies have leveraged AI to automate and enhance the diagnostic process. However, existing approaches often focus on partial or incomplete lesion detection, lacking comprehensive multi-class and multi-label solutions for the full range of TB-related anomalies. To address this, we present a hybrid AI model combining vision transformer (ViT) and convolutional neural network (CNN) architectures for efficient multi-class and multi-label classification of 14 TB-related anomalies. Using 133 CXR images from Dr. Cipto Mangunkusumo National Central General Hospital and 214 images from the NIH datasets, we tackled data imbalance with augmentation, class weighting, and focal loss. The model achieved an accuracy of 0.911, a loss of 0.285, and an AUC of 0.510. Given the complexity of handling not only multi-class but also multi-label data with imbalanced and limited samples, the AUC score reflects the challenging nature of the task rather than any shortcoming of the model itself. By classifying the most distinct TB-related labels in a single AI study, this research highlights the potential of AI to enhance both the accuracy and efficiency of detecting TB-related anomalies, offering valuable advancements in combating this global health burden. Full article
Show Figures

Figure 1

29 pages, 9712 KiB  
Article
Cloud–Edge–End Collaborative Federated Learning: Enhancing Model Accuracy and Privacy in Non-IID Environments
by Ling Li, Lidong Zhu and Weibang Li
Sensors 2024, 24(24), 8028; https://doi.org/10.3390/s24248028 - 16 Dec 2024
Viewed by 549
Abstract
Cloud–edge–end computing architecture is crucial for large-scale edge data processing and analysis. However, the diversity of terminal nodes and task complexity in this architecture often result in non-independent and identically distributed (non-IID) data, making it challenging to balance data heterogeneity and privacy protection. [...] Read more.
Cloud–edge–end computing architecture is crucial for large-scale edge data processing and analysis. However, the diversity of terminal nodes and task complexity in this architecture often result in non-independent and identically distributed (non-IID) data, making it challenging to balance data heterogeneity and privacy protection. To address this, we propose a privacy-preserving federated learning method based on cloud–edge–end collaboration. Our method fully considers the three-tier architecture of cloud–edge–end systems and the non-IID nature of terminal node data. It enhances model accuracy while protecting the privacy of terminal node data. The proposed method groups terminal nodes based on the similarity of their data distributions and constructs edge subnetworks for training in collaboration with edge nodes, thereby mitigating the negative impact of non-IID data. Furthermore, we enhance WGAN-GP with attention mechanism to generate balanced synthetic data while preserving key patterns from original datasets, reducing the adverse effects of non-IID data on global model accuracy while preserving data privacy. In addition, we introduce data resampling and loss function weighting strategies to mitigate model bias caused by imbalanced data distribution. Experimental results on real-world datasets demonstrate that our proposed method significantly outperforms existing approaches in terms of model accuracy, F1-score, and other metrics. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

Back to TopTop