Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (829)

Search Parameters:
Keywords = imbalanced datasets

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 1272 KiB  
Article
Hybrid Oversampling and Undersampling Method (HOUM) via Safe-Level SMOTE and Support Vector Machine
by Duygu Yilmaz Eroglu and Mestan Sahin Pir
Appl. Sci. 2024, 14(22), 10438; https://doi.org/10.3390/app142210438 - 13 Nov 2024
Viewed by 258
Abstract
The improvements in collecting and processing data using machine learning algorithms have increased the interest in data mining. This trend has led to the development of real-life decision support systems (DSSs) in diverse areas such as biomedical informatics, fraud detection, natural language processing, [...] Read more.
The improvements in collecting and processing data using machine learning algorithms have increased the interest in data mining. This trend has led to the development of real-life decision support systems (DSSs) in diverse areas such as biomedical informatics, fraud detection, natural language processing, face recognition, autonomous vehicles, image processing, and each part of the real production environment. The imbalanced datasets in some of these studies, which result in low performance measures, have highlighted the need for additional efforts to address this issue. The proposed method (HOUM) is used to address the issue of imbalanced datasets for classification problems in this study. The aim of the model is to prevent the overfitting problem caused by oversampling and valuable data loss caused by undersampling in imbalanced data and obtain successful classification results. The HOUM is a hybrid approach that tackles imbalanced class distribution challenges, refines datasets, and improves model robustness. In the first step, majority-class data points that are distant from the decision boundary obtained via SVM are reduced. If the data are not balanced, SLS is employed to augment the minority-class data. This loop continues until the dataset becomes balanced. The main contribution of the proposed method is reproducing informative minority data using SLS and diminishing non-informative majority data using the SVM before applying classification techniques. Firstly, the efficiency of the proposed method, the HOUM, is verified by comparison with the SMOTE, SMOTEENN, and SMOTETomek techniques using eight datasets. Then, the results of the W-SIMO and RusAda algorithms, which were developed for imbalanced datasets, are compared with those of the HOUM. The strength of the HOUM is revealed through this comparison. The proposed HOUM algorithm utilizes a real dataset obtained from a project endorsed by The Scientific and Technical Research Council of Turkey. The collected data include quality control and processing parameters of yarn data. The aim of this project is to prevent yarn breakage errors during the weaving process on looms. This study introduces a decision support system (DSS) designed to prevent yarn breakage during fabric weaving. The high performance of the algorithm may encourage producers to manage yarn flow and enhance the HOUM’s efficiency as a DSS. Full article
Show Figures

Figure 1

14 pages, 6173 KiB  
Article
Enhancing Cover Management Factor Classification Through Imbalanced Data Resolution
by Kieu Anh Nguyen and Walter Chen
Environments 2024, 11(11), 250; https://doi.org/10.3390/environments11110250 - 12 Nov 2024
Viewed by 274
Abstract
This study addresses the persistent challenge of class imbalance in land use and land cover (LULC) classification within the Shihmen Reservoir watershed in Taiwan, where LULC is used to map the Cover Management factor (C-factor). The dominance of forests in the LULC categories [...] Read more.
This study addresses the persistent challenge of class imbalance in land use and land cover (LULC) classification within the Shihmen Reservoir watershed in Taiwan, where LULC is used to map the Cover Management factor (C-factor). The dominance of forests in the LULC categories leads to an imbalanced dataset, resulting in poor prediction performance for minority classes when using machine learning techniques. To overcome this limitation, we applied the Synthetic Minority Over-sampling Technique (SMOTE) and the 90-model SMOTE-variants package in Python to balance the dataset. Due to the multi-class nature of the data and memory constraints, 42 models were successfully used to create a balanced dataset, which was then integrated with a Random Forest algorithm for C-factor classification. The results show a marked improvement in model accuracy across most SMOTE variants, with the Selected Synthetic Minority Over-sampling Technique (Selected_SMOTE) emerging as the best-performing method, achieving an overall accuracy of 0.9524 and a sensitivity of 0.6892. Importantly, the previously observed issue of poor minority class prediction was resolved using the balanced dataset. This study provides a robust solution to the class imbalance issue in C-factor classification, demonstrating the effectiveness of SMOTE variants and the Random Forest algorithm in improving model performance and addressing imbalanced class distributions. The success of Selected_SMOTE underscores the potential of balanced datasets in enhancing machine learning outcomes, particularly in datasets dominated by a majority class. Additionally, by addressing imbalance in LULC classification, this research contributes to Sustainable Development Goal 15, which focuses on the protection, restoration, and sustainable use of terrestrial ecosystems. Full article
Show Figures

Figure 1

20 pages, 10441 KiB  
Article
Proto-DS: A Self-Supervised Learning-Based Nondestructive Testing Approach for Food Adulteration with Imbalanced Hyperspectral Data
by Kunkun Pang, Yisen Liu, Songbin Zhou, Yixiao Liao, Zexuan Yin, Lulu Zhao and Hong Chen
Foods 2024, 13(22), 3598; https://doi.org/10.3390/foods13223598 - 11 Nov 2024
Viewed by 524
Abstract
Conventional food fraud detection using hyperspectral imaging (HSI) relies on the discriminative power of machine learning. However, these approaches often assume a balanced class distribution in an ideal laboratory environment, which is impractical in real-world scenarios with diverse label distributions. This results in [...] Read more.
Conventional food fraud detection using hyperspectral imaging (HSI) relies on the discriminative power of machine learning. However, these approaches often assume a balanced class distribution in an ideal laboratory environment, which is impractical in real-world scenarios with diverse label distributions. This results in suboptimal performance when less frequent classes are overshadowed by the majority class during training. Thus, the critical research challenge emerges of how to develop an effective classifier on a small-scale imbalanced dataset without significant bias from the dominant class. In this paper, we propose a novel nondestructive detection approach, which we call the Dice Loss Improved Self-Supervised Learning-Based Prototypical Network (Proto-DS), designed to address this imbalanced learning challenge. The proposed amalgamation mitigates the label bias on the most frequent class, further improving robustness. We validate our proposed method on three collected hyperspectral food image datasets with varying degrees of data imbalance: Citri Reticulatae Pericarpium (Chenpi), Chinese herbs, and coffee beans. Comparisons with state-of-the-art imbalanced learning techniques, including the Synthetic Minority Oversampling Technique (SMOTE) and class-importance reweighting, reveal our method’s superiority. Notably, our experiments demonstrate that Proto-DS consistently outperforms conventional approaches, achieving the best average balanced accuracy of 88.18% across various training sample sizes, whereas the Logistic Model Tree (LMT), Multi-Layer Perceptron (MLP), and Convolutional Neural Network (CNN) approaches attain only 59.42%, 60.38%, and 66.34%, respectively. Overall, self-supervised learning is key to improving imbalanced learning performance and outperforms related approaches, while both prototypical networks and the Dice loss can further enhance classification performance. Intriguingly, self-supervised learning can provide complementary information to existing imbalanced learning approaches. Combining these approaches may serve as a potential solution for building effective models with limited training data. Full article
Show Figures

Graphical abstract

21 pages, 7007 KiB  
Article
LEM-Detector: An Efficient Detector for Photovoltaic Panel Defect Detection
by Xinwen Zhou, Xiang Li, Wenfu Huang and Ran Wei
Appl. Sci. 2024, 14(22), 10290; https://doi.org/10.3390/app142210290 - 8 Nov 2024
Viewed by 375
Abstract
Photovoltaic panel defect detection presents significant challenges due to the wide range of defect scales, diverse defect types, and severe background interference, often leading to a high rate of false positives and missed detections. To address these challenges, this paper proposes the LEM-Detector, [...] Read more.
Photovoltaic panel defect detection presents significant challenges due to the wide range of defect scales, diverse defect types, and severe background interference, often leading to a high rate of false positives and missed detections. To address these challenges, this paper proposes the LEM-Detector, an efficient end-to-end photovoltaic panel defect detector based on the transformer architecture. To address the low detection accuracy for Crack and Star crack defects and the imbalanced dataset, a novel data augmentation method, the Linear Feature Augmentation (LFA) module, specifically designed for linear features, is introduced. LFA effectively improves model training performance and robustness. Furthermore, the Efficient Feature Enhancement Module (EFEM) is presented to enhance the receptive field, suppress redundant information, and emphasize meaningful features. To handle defects of varying scales, complementary semantic information from different feature layers is leveraged for enhanced feature fusion. A Multi-Scale Multi-Feature Pyramid Network (MMFPN) is employed to selectively aggregate boundary and category information, thereby improving the accuracy of multi-scale target recognition. Experimental results on a large-scale photovoltaic panel dataset demonstrate that the LEM-Detector achieves a detection accuracy of 94.7% for multi-scale defects, outperforming several state-of-the-art methods. This approach effectively addresses the challenges of photovoltaic panel defect detection, paving the way for more reliable and accurate defect identification systems. This research will contribute to the automatic detection of surface defects in industrial production, ultimately enhancing production efficiency. Full article
Show Figures

Figure 1

23 pages, 632 KiB  
Article
Filtering Useful App Reviews Using Naïve Bayes—Which Naïve Bayes?
by Pouya Ataei, Sri Regula, Daniel Staegemann and Saurabh Malgaonkar
AI 2024, 5(4), 2237-2259; https://doi.org/10.3390/ai5040110 - 5 Nov 2024
Viewed by 430
Abstract
App reviews provide crucial feedback for software maintenance and evolution, but manually extracting useful reviews from vast volumes is time-consuming and challenging. This study investigates the effectiveness of six Naïve Bayes variants for automatically filtering useful app reviews. We evaluated these variants on [...] Read more.
App reviews provide crucial feedback for software maintenance and evolution, but manually extracting useful reviews from vast volumes is time-consuming and challenging. This study investigates the effectiveness of six Naïve Bayes variants for automatically filtering useful app reviews. We evaluated these variants on datasets from five popular apps, comparing their performance in terms of accuracy, precision, recall, F-measure, and processing time. Our results show that Expectation Maximization-Multinomial Naïve Bayes with Laplace smoothing performed best overall, achieving up to 89.2% accuracy and 0.89 F-measure. Complement Naïve Bayes with Laplace smoothing demonstrated particular effectiveness for imbalanced datasets. Generally, incorporating Laplace smoothing and Expectation Maximization improved performance, albeit with increased processing time. This study also examined the impact of data imbalance on classification performance. Our findings suggest that these advanced Naïve Bayes variants hold promise for filtering useful app reviews, especially when dealing with limited labeled data or imbalanced datasets. This research contributes to the body of evidence around app review mining and provides insights for enhancing software maintenance and evolution processes. Full article
(This article belongs to the Section AI Systems: Theory and Applications)
Show Figures

Figure 1

22 pages, 4307 KiB  
Article
Cervical Cancer Prediction Based on Imbalanced Data Using Machine Learning Algorithms with a Variety of Sampling Methods
by Mădălina Maria Muraru, Zsuzsa Simó and László Barna Iantovics
Appl. Sci. 2024, 14(22), 10085; https://doi.org/10.3390/app142210085 - 5 Nov 2024
Viewed by 491
Abstract
Cervical cancer affects a large portion of the female population, making the prediction of this disease using Machine Learning (ML) of utmost importance. ML algorithms can be integrated into complex, intelligent, agent-based systems that can offer decision support to resident medical doctors or [...] Read more.
Cervical cancer affects a large portion of the female population, making the prediction of this disease using Machine Learning (ML) of utmost importance. ML algorithms can be integrated into complex, intelligent, agent-based systems that can offer decision support to resident medical doctors or even experienced medical doctors. For instance, an experienced medical doctor may diagnose a case but need expert support that related to another medical specialty. Data imbalance is frequent in healthcare data and has a negative influence on predictions made using ML algorithms. Cancer data, in general, and cervical cancer data, in particular, are frequently imbalanced. For this study, we chose a messy, real-life cervical cancer dataset available in the Kaggle repository that includes large amounts of missing and noisy values. To identify the best imbalanced technique for this medical dataset, the performances of eleven important resampling methods are compared, combined with the following state-of-the-art ML models that are frequently applied in predictive healtchare research: K-Nearest Neighbors (KNN) (with k values of 2 and 3), binary Logistic Regression (bLR), and Random Forest (RF). The studied resampling methods include seven undersampling methods and four oversampling methods. For this dataset, the imbalance ratio was 12.73, with a 95% confidence interval ranging from 9.23% to 16.22%. The obtained results show that resampling methods help improve the classification ability of prediction models applied to cervical cancer data. The applied oversampling techniques for handling imbalanced data generally outperformed the undersampling methods. The average balanced accuracy for oversampling was 77.44%, compared to 62.28% for undersampling. When detecting the minority class, oversampling achieved an average score of 60.80%, while undersampling scored 41.36%. The logistic regression classifier had the greatest impact on balanced techniques, while random forest achieved promising performance, even before applying balancing techniques. Initially, KNN2 outperformed KNN3 across all metrics, including balanced accuracy, for which KNN2 achieved 53.57%, compared to 52.71% for KNN3. However, after applying oversampling techniques, KNN3 significantly improved its balanced accuracy to 73.78%, while that of KNN2 increased to 63.89%. Additionally, KNN3 outperformed KNN2 in minority class performance, scoring 55.72% compared to KNN2’s 33.93%. Full article
Show Figures

Figure 1

17 pages, 1440 KiB  
Article
Electroencephalogram Emotion Recognition via AUC Maximization
by Minheng Xiao and Shi Bo
Algorithms 2024, 17(11), 489; https://doi.org/10.3390/a17110489 - 1 Nov 2024
Viewed by 421
Abstract
Imbalanced datasets pose significant challenges in areas including neuroscience, cognitive science, and medical diagnostics, where accurately detecting minority classes is essential for robust model performance. This study addressed the issue of class imbalance, using the ‘liking’ label in the DEAP dataset as an [...] Read more.
Imbalanced datasets pose significant challenges in areas including neuroscience, cognitive science, and medical diagnostics, where accurately detecting minority classes is essential for robust model performance. This study addressed the issue of class imbalance, using the ‘liking’ label in the DEAP dataset as an example. Such imbalances were often overlooked by prior research, which typically focused on the more balanced arousal and valence labels and predominantly used accuracy metrics to measure model performance. To tackle this issue, we adopted numerical optimization techniques aimed at maximizing the area under the curve (AUC), thus enhancing the detection of underrepresented classes. Our approach, which began with a linear classifier, was compared against traditional linear classifiers, including logistic regression and support vector machines (SVMs). Our method significantly outperformed these models, increasing recall from 41.6% to 79.7% and improving the F1-score from 0.506 to 0.632. These results underscore the effectiveness of AUC maximization methods in neuroscience research by offering a robust solution for managing imbalanced datasets, developing more precise diagnostic tools and interventions for detecting critical minority classes in real-world scenarios. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

32 pages, 5045 KiB  
Article
Ensemble-Based Machine Learning Algorithm for Loan Default Risk Prediction
by Abisola Akinjole, Olamilekan Shobayo, Jumoke Popoola, Obinna Okoyeigbo and Bayode Ogunleye
Mathematics 2024, 12(21), 3423; https://doi.org/10.3390/math12213423 - 31 Oct 2024
Viewed by 550
Abstract
Predicting credit default risk is important to financial institutions, as accurately predicting the likelihood of a borrower defaulting on their loans will help to reduce financial losses, thereby maintaining profitability and stability. Although machine learning models have been used in assessing large applications [...] Read more.
Predicting credit default risk is important to financial institutions, as accurately predicting the likelihood of a borrower defaulting on their loans will help to reduce financial losses, thereby maintaining profitability and stability. Although machine learning models have been used in assessing large applications with complex attributes for these predictions, there is still a need to identify the most effective techniques for the model development process, including the technique to address the issue of data imbalance. In this research, we conducted a comparative analysis of random forest, decision tree, SVMs (Support Vector Machines), XGBoost (Extreme Gradient Boosting), ADABoost (Adaptive Boosting) and the multi-layered perceptron, to predict credit defaults using loan data from LendingClub. Additionally, XGBoost was used as a framework for testing and evaluating various techniques. Moreover, we applied this XGBoost framework to handle the issue of class imbalance observed, by testing various resampling methods such as Random Over-Sampling (ROS), the Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Random Under-Sampling (RUS), and hybrid approaches like the SMOTE with Tomek Links and the SMOTE with Edited Nearest Neighbours (SMOTE + ENNs). The results showed that balanced datasets significantly outperformed the imbalanced dataset, with the SMOTE + ENNs delivering the best overall performance, achieving an accuracy of 90.49%, a precision of 94.61% and a recall of 92.02%. Furthermore, ensemble methods such as voting and stacking were employed to enhance performance further. Our proposed model achieved an accuracy of 93.7%, a precision of 95.6% and a recall of 95.5%, which shows the potential of ensemble methods in improving credit default predictions and can provide lending platforms with the tool to reduce default rates and financial losses. In conclusion, the findings from this study have broader implications for financial institutions, offering a robust approach to risk assessment beyond the LendingClub dataset. Full article
(This article belongs to the Special Issue Data-Driven Approaches in Revenue Management and Pricing Analytics)
Show Figures

Figure 1

16 pages, 3506 KiB  
Article
HADNet: A Novel Lightweight Approach for Abnormal Sound Detection on Highway Based on 1D Convolutional Neural Network and Multi-Head Self-Attention Mechanism
by Cong Liang, Qian Chen, Qiran Li, Qingnan Wang, Kang Zhao, Jihui Tu and Ammar Jafaripournimchahi
Electronics 2024, 13(21), 4229; https://doi.org/10.3390/electronics13214229 - 28 Oct 2024
Viewed by 481
Abstract
Video surveillance is an effective tool for traffic management and safety, but it may face challenges in extreme weather, low visibility, areas outside the monitoring field of view, or during nighttime conditions. Therefore, abnormal sound detection is used in traffic management and safety [...] Read more.
Video surveillance is an effective tool for traffic management and safety, but it may face challenges in extreme weather, low visibility, areas outside the monitoring field of view, or during nighttime conditions. Therefore, abnormal sound detection is used in traffic management and safety as an auxiliary tool to complement video surveillance. In this paper, a novel lightweight method for abnormal sound detection based on 1D CNN and Multi-Head Self-Attention Mechanism on the embedded system is proposed, which is named HADNet. First, 1D CNN is employed for local feature extraction, which minimizes information loss from the audio signal during time-frequency conversion and reduces computational complexity. Second, the proposed block based on Multi-Head Self-Attention Mechanism not only effectively mitigates the issue of disappearing gradients, but also enhances detection accuracy. Finally, the joint loss function is employed to detect abnormal audio. This choice helps address issues related to unbalanced training data and class overlap, thereby improving model performance on imbalanced datasets. The proposed HADNet method was evaluated on the MIVIA Road Events and UrbanSound8K datasets. The results demonstrate that the proposed method for abnormal audio detection on embedded systems achieves high accuracy of 99.6% and an efficient detection time of 0.06 s. This approach proves to be robust and suitable for practical applications in traffic management and safety. By addressing the challenges posed by traditional video surveillance methods, HADNet offers a valuable and complementary solution for enhancing safety measures in diverse traffic conditions. Full article
(This article belongs to the Special Issue Fault Detection Technology Based on Deep Learning)
Show Figures

Figure 1

21 pages, 48158 KiB  
Article
ETFT: Equiangular Tight Frame Transformer for Imbalanced Semantic Segmentation
by Seonggyun Jeong and Yong Seok Heo
Sensors 2024, 24(21), 6913; https://doi.org/10.3390/s24216913 - 28 Oct 2024
Viewed by 430
Abstract
Semantic segmentation often suffers from class imbalance, where the label ratio for each class in the dataset is not uniform. Recent studies have addressed the issue of class imbalance in semantic segmentation by leveraging the neural collapse phenomenon in conjunction with an Equiangular [...] Read more.
Semantic segmentation often suffers from class imbalance, where the label ratio for each class in the dataset is not uniform. Recent studies have addressed the issue of class imbalance in semantic segmentation by leveraging the neural collapse phenomenon in conjunction with an Equiangular Tight Frame (ETF). While the use of ETF aids in enhancing the discriminability of minor classes, class correlation is another crucial factor that must be taken into account. However, managing the balance between class correlation and discrimination through neural collapse remains challenging, as these properties inherently conflict with one another. Moreover, this control is established during the training stage, resulting in a fixed classifier. There is no guarantee that this classifier will consistently perform well with different input images. To address this problem, we propose an Equiangular Tight Frame Transformer (ETFT), a transformer-based model that jointly processes the features and classifier using ETF structure, and dynamically generates the classifier as a function of the input for imbalanced semantic segmentation. Specifically, the classifier initialized with the ETF structure is jointly processed with the input patch tokens during the attention process. As a result, the transformed patch tokens, aided by the ETF structure, achieve discriminability between classes while preserving contextual correlation. The classifier, initially structured as an ETF, is adjusted to incorporate the correlation information, benefiting from the attention mechanism. Furthermore, the learned classifier is combined with the fixed ETF classifier, leveraging the advantages of both. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods for imbalanced semantic segmentation on both the ADE20K and Cityscapes datasets. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

21 pages, 6992 KiB  
Article
Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores
by Maria Cristina Hinojosa Lee, Johan Braet and Johan Springael
Appl. Sci. 2024, 14(21), 9863; https://doi.org/10.3390/app14219863 - 28 Oct 2024
Viewed by 793
Abstract
This study compares various F1-score variants—micro, macro, and weighted—to assess their performance in evaluating text-based emotion classification. Lexicon distillation is employed using the multilabel emotion-annotated datasets XED and GoEmotions. The aim of this paper is to understand when each F1-score variant is better [...] Read more.
This study compares various F1-score variants—micro, macro, and weighted—to assess their performance in evaluating text-based emotion classification. Lexicon distillation is employed using the multilabel emotion-annotated datasets XED and GoEmotions. The aim of this paper is to understand when each F1-score variant is better suited for evaluating text-based multilabel emotion classification. Unigram lexicons were derived from the annotated GoEmotions and XED datasets through a binary classification approach. The distilled lexicons were then applied to the GoEmotions and XED annotated datasets to calculate their emotional content, and the results were compared. The findings highlight the behavior of each F1-score variant under different class distributions, emphasizing the importance of appropriate metric selection for reliable model performance evaluation in imbalanced multilabel datasets. Additionally, this study also investigates the effect of the aggregation of negative emotions into broader categories on said F1 metrics. The contribution of this study is to provide insights into how different F1-score variants could improve the reliability of multilabel emotion classifier evaluation, particularly in the context of class imbalance present in the case of phishing emails. Full article
(This article belongs to the Special Issue Affective Computing: Technology and Application)
Show Figures

Figure 1

14 pages, 339 KiB  
Article
OUCH: Oversampling and Undersampling Cannot Help Improve Accuracy in Our Bayesian Classifiers That Predict Preeclampsia
by Franklin Parrales-Bravo, Rosangela Caicedo-Quiroz, Elena Tolozano-Benitez, Víctor Gómez-Rodríguez, Lorenzo Cevallos-Torres, Jorge Charco-Aguirre and Leonel Vasquez-Cevallos
Mathematics 2024, 12(21), 3351; https://doi.org/10.3390/math12213351 - 25 Oct 2024
Viewed by 536
Abstract
Unbalanced data can have an impact on the machine learning (ML) algorithms that build predictive models. This manuscript studies the influence of oversampling and undersampling strategies on the learning of the Bayesian classification models that predict the risk of suffering preeclampsia. Given the [...] Read more.
Unbalanced data can have an impact on the machine learning (ML) algorithms that build predictive models. This manuscript studies the influence of oversampling and undersampling strategies on the learning of the Bayesian classification models that predict the risk of suffering preeclampsia. Given the properties of our dataset, only the oversampling and undersampling methods that operate with numerical and categorical attributes will be taken into consideration. In particular, synthetic minority oversampling techniques for nominal and continuous data (SMOTE-NC), SMOTE—Encoded Nominal and Continuous (SMOTE-ENC), random oversampling examples (ROSE), random undersampling examples (UNDER), and random oversampling techniques (OVER) are considered. According to the results, when balancing the class in the training dataset, the accuracy percentages do not improve. However, in the test dataset, both positive and negative cases of preeclampsia were accurately classified by the models, which were built on a balanced training dataset. In contrast, models built on the imbalanced training dataset were not good at detecting positive cases of preeclampsia. We can conclude that while imbalanced training datasets can be addressed by using oversampling and undersampling techniques before building prediction models, an improvement in model accuracy is not always guaranteed. Despite this, the sensitivity and specificity percentages improve in binary classification problems in most cases, such as the one we are dealing with in this manuscript. Full article
(This article belongs to the Special Issue Applied Statistics in Real-World Problems)
Show Figures

Figure 1

22 pages, 4938 KiB  
Article
Enhancing Machine Learning Models Through PCA, SMOTE-ENN, and Stochastic Weighted Averaging
by Youngjin Han and Inwhee Joe
Appl. Sci. 2024, 14(21), 9772; https://doi.org/10.3390/app14219772 - 25 Oct 2024
Viewed by 620
Abstract
Predicting survival outcomes in critical accidents has been a focal point in machine learning research. This study addresses several limitations of existing methods, including insufficient management of data imbalance, lack of emphasis on hyperparameter tuning, and proneness to overfitting. Many existing models struggle [...] Read more.
Predicting survival outcomes in critical accidents has been a focal point in machine learning research. This study addresses several limitations of existing methods, including insufficient management of data imbalance, lack of emphasis on hyperparameter tuning, and proneness to overfitting. Many existing models struggle to generalize effectively on imbalanced datasets or depend on default hyperparameter settings, resulting in biased predictions. By integrating Principal Component Analysis (PCA), hyperparameter optimization, and resampling methods, as well as combining Edited Nearest Neighbors (ENN) with the Synthetic Minority Oversampling Technique (SMOTE), the model significantly improves predictive accuracy and model generalization. An ensemble model combining seven machine learning algorithms—Logistic Regression, Support Vector Machine, KNN, Random Forest, XGBoost, LightGBM, and CatBoost—was applied to predict survival outcomes. Stochastic Weighted Averaging (SWA) was applied to mitigate overfitting and enhance generalization. The accuracy increased from 91.97% to 94.89% after SWA was applied in this specific scenario. The combination of PCA-based dimensionality reduction, hyperparameter tuning, and resampling techniques (ENN + SMOTE) ensured the model handled data imbalance and optimized predictive accuracy. The final model demonstrated excellent performance, with Area Under the Curve (AUC) and Average Precision (AP) values both reaching 0.98, indicating high accuracy and precision. These improvements were validated using the Titanic dataset in a binary classification problem of predicting passenger survival. The results emphasize that ensemble learning, enhanced by SWA, offers a powerful framework for handling imbalanced and complex datasets, providing significant advancements in predictive modeling accuracy. This study provides insights into how machine learning techniques can be effectively combined to solve classification challenges in real-world scenarios. Full article
Show Figures

Figure 1

17 pages, 2458 KiB  
Article
Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities
by Minhan Kim and Seokjin Lee
Appl. Sci. 2024, 14(21), 9644; https://doi.org/10.3390/app14219644 - 22 Oct 2024
Viewed by 476
Abstract
Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which [...] Read more.
Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which are widely used for classification tasks involving sound recognition and other objectives. Machine learning typically achieves better performance with large amounts of high-quality training data. Given the high cost of data collection, development datasets often suffer from imbalanced data or lack high-quality samples, leading to performance degradations in machine learning models. The present study aims to address this data issue through data augmentation techniques. Specifically, since the proposed method targets indoor activities in domestic activity detection, room transfer functions were used for data augmentation. The results show that the proposed method achieves a 0.59% improvement in the F1-Score (micro) from that of the baseline system for the development dataset. Additionally, test data including microphones that were not used during training achieved an F1-Score improvement of 0.78% over that of the baseline system. This demonstrates the enhanced model generalization performance of the proposed method on samples having different room transfer functions to those of the trained dataset. Full article
Show Figures

Figure 1

20 pages, 1607 KiB  
Article
Securing the Edge: CatBoost Classifier Optimized by the Lyrebird Algorithm to Detect Denial of Service Attacks in Internet of Things-Based Wireless Sensor Networks
by Sennanur Srinivasan Abinayaa, Prakash Arumugam, Divya Bhavani Mohan, Anand Rajendran, Abderezak Lashab, Baoze Wei and Josep M. Guerrero
Future Internet 2024, 16(10), 381; https://doi.org/10.3390/fi16100381 - 19 Oct 2024
Viewed by 647
Abstract
The security of Wireless Sensor Networks (WSNs) is of the utmost importance because of their widespread use in various applications. Protecting WSNs from harmful activity is a vital function of intrusion detection systems (IDSs). An innovative approach to WSN intrusion detection (ID) utilizing [...] Read more.
The security of Wireless Sensor Networks (WSNs) is of the utmost importance because of their widespread use in various applications. Protecting WSNs from harmful activity is a vital function of intrusion detection systems (IDSs). An innovative approach to WSN intrusion detection (ID) utilizing the CatBoost classifier (Cb-C) and the Lyrebird Optimization Algorithm is presented in this work (LOA). As is typical in ID settings, Cb-C excels at handling datasets that are imbalanced. The lyrebird’s remarkable capacity to imitate the sounds of its surroundings served as inspiration for the LOA, a metaheuristic optimization algorithm. The WSN-DS dataset, acquired from Prince Sultan University in Saudi Arabia, is used to assess the suggested method. Among the models presented, LOA-Cb-C produces the highest accuracy of 99.66%; nevertheless, when compared with the other methods discussed in this article, its error value of 0.34% is the lowest. Experimental results reveal that the suggested strategy improves WSN-IoT security over the existing methods in terms of detection accuracy and the false alarm rate. Full article
Show Figures

Figure 1

Back to TopTop