-
Towards Understanding the Survival of Patients with High-Grade Gastroenteropancreatic Neuroendocrine Neoplasms: An Investigation of Ensemble Feature Selection in the Prediction of Overall Survival
Authors:
Anna Jenul,
Henning Langen Stokmo,
Stefan Schrunner,
Mona-Elisabeth Revheim,
Geir Olav Hjortland,
Oliver Tomic
Abstract:
Determining the most informative features for predicting the overall survival of patients diagnosed with high-grade gastroenteropancreatic neuroendocrine neoplasms is crucial to improve individual treatment plans for patients, as well as the biological understanding of the disease. Recently developed ensemble feature selectors like the Repeated Elastic Net Technique for Feature Selection (RENT) an…
▽ More
Determining the most informative features for predicting the overall survival of patients diagnosed with high-grade gastroenteropancreatic neuroendocrine neoplasms is crucial to improve individual treatment plans for patients, as well as the biological understanding of the disease. Recently developed ensemble feature selectors like the Repeated Elastic Net Technique for Feature Selection (RENT) and the User-Guided Bayesian Framework for Feature Selection (UBayFS) allow the user to identify such features in datasets with low sample sizes. While RENT is purely data-driven, UBayFS is capable of integrating expert knowledge a priori in the feature selection process. In this work we compare both feature selectors on a dataset comprising of 63 patients and 134 features from multiple sources, including basic patient characteristics, baseline blood values, tumor histology, imaging, and treatment information. Our experiments involve data-driven and expert-driven setups, as well as combinations of both. We use findings from clinical literature as a source of expert knowledge. Our results demonstrate that both feature selectors allow accurate predictions, and that expert knowledge has a stabilizing effect on the feature set, while the impact on predictive performance is limited. The features WHO Performance Status, Albumin, Platelets, Ki-67, Tumor Morphology, Total MTV, Total TLG, and SUVmax are the most stable and predictive features in our study.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Ranking Feature-Block Importance in Artificial Multiblock Neural Networks
Authors:
Anna Jenul,
Stefan Schrunner,
Bao Ngoc Huynh,
Runar Helin,
Cecilia Marie Futsæther,
Kristian Hovde Liland,
Oliver Tomic
Abstract:
In artificial neural networks, understanding the contributions of input features on the prediction fosters model explainability and delivers relevant information about the dataset. While typical setups for feature importance ranking assess input features individually, in this study, we go one step further and rank the importance of groups of features, denoted as feature-blocks. A feature-block can…
▽ More
In artificial neural networks, understanding the contributions of input features on the prediction fosters model explainability and delivers relevant information about the dataset. While typical setups for feature importance ranking assess input features individually, in this study, we go one step further and rank the importance of groups of features, denoted as feature-blocks. A feature-block can contain features of a specific type or features derived from a particular source, which are presented to the neural network in separate input branches (multiblock ANNs). This work presents three methods pursuing distinct strategies to rank features in multiblock ANNs by their importance: (1) a composite strategy building on individual feature importance rankings, (2) a knock-in, and (3) a knock-out strategy. While the composite strategy builds on state-of-the-art feature importance rankings, knock-in and knock-out strategies evaluate the block as a whole via a mutual information criterion. Our experiments consist of a simulation study validating all three approaches, followed by a case study on two distinct real-world datasets to compare the strategies. We conclude that each strategy has its merits for specific application scenarios.
△ Less
Submitted 14 April, 2022; v1 submitted 21 September, 2021;
originally announced September 2021.
-
A User-Guided Bayesian Framework for Ensemble Feature Selection in Life Science Applications (UBayFS)
Authors:
Anna Jenul,
Stefan Schrunner,
Jürgen Pilz,
Oliver Tomic
Abstract:
Feature selection represents a measure to reduce the complexity of high-dimensional datasets and gain insights into the systematic variation in the data. This aspect is of specific importance in domains that rely on model interpretability, such as life sciences. We propose UBayFS, an ensemble feature selection technique embedded in a Bayesian statistical framework. Our approach considers two sourc…
▽ More
Feature selection represents a measure to reduce the complexity of high-dimensional datasets and gain insights into the systematic variation in the data. This aspect is of specific importance in domains that rely on model interpretability, such as life sciences. We propose UBayFS, an ensemble feature selection technique embedded in a Bayesian statistical framework. Our approach considers two sources of information: data and domain knowledge. We build a meta-model from an ensemble of elementary feature selectors and aggregate this information in a multinomial likelihood. The user guides UBayFS by weighting features and penalizing specific feature blocks or combinations, implemented via a Dirichlet-type prior distribution and a regularization term. In a quantitative evaluation, we demonstrate that our framework (a) allows for a balanced trade-off between user knowledge and data observations, and (b) achieves competitive performance with state-of-the-art methods.
△ Less
Submitted 11 December, 2021; v1 submitted 30 April, 2021;
originally announced April 2021.
-
Towards a General Framework to Embed Advanced Machine Learning in Process Control Systems
Authors:
Stefan Schrunner,
Michael Scheiber,
Anna Jenul,
Anja Zernig,
Andre Kästner,
Roman Kern
Abstract:
Since high data volume and complex data formats delivered in modern high-end production environments go beyond the scope of classical process control systems, more advanced tools involving machine learning are required to reliably recognize failure patterns. However, currently, such systems lack a general setup and are only available as application-specific solutions. We propose a process control…
▽ More
Since high data volume and complex data formats delivered in modern high-end production environments go beyond the scope of classical process control systems, more advanced tools involving machine learning are required to reliably recognize failure patterns. However, currently, such systems lack a general setup and are only available as application-specific solutions. We propose a process control framework entitled Health Factor for Process Control (HFPC) to bridge the gap between conventional statistical tools and novel machine learning (ML) algorithms. HFPC comprises two main concepts: (a) pattern type to account for qualitative characteristics (error patterns) and (b) intensity to quantify the level of a deviation. While the system retains large model generality, allowing a broad scope of potential application areas, we demonstrate its favorable mathematical properties in a theoretical analysis. In a case study from the semiconductor industry, we underline that (a) our framework is of practical relevance and goes beyond conventional process control, and (b) achieves high-quality experimental results. We conclude that our work contributes to the integration of ML in real-world process control and paves the way to automated decision support in manufacturing.
△ Less
Submitted 31 March, 2022; v1 submitted 24 March, 2021;
originally announced March 2021.
-
Principal component-based image segmentation: a new approach to outline in vitro cell colonies
Authors:
Delmon Arous,
Stefan Schrunner,
Ingunn Hanson,
Nina F. J. Edin,
Eirik Malinen
Abstract:
The in vitro clonogenic assay is a technique to study the ability of a cell to form a colony in a culture dish. By optical imaging, dishes with stained colonies can be scanned and assessed digitally. Identification, segmentation and counting of stained colonies play a vital part in high-throughput screening and quantitative assessment of biological assays. Image processing of such pictured/scanned…
▽ More
The in vitro clonogenic assay is a technique to study the ability of a cell to form a colony in a culture dish. By optical imaging, dishes with stained colonies can be scanned and assessed digitally. Identification, segmentation and counting of stained colonies play a vital part in high-throughput screening and quantitative assessment of biological assays. Image processing of such pictured/scanned assays can be affected by image/scan acquisition artifacts like background noise and spatially varying illumination, and contaminants in the suspension medium. Although existing approaches tackle these issues, the segmentation quality requires further improvement, particularly on noisy and low contrast images. In this work, we present an objective and versatile machine learning procedure to amend these issues by characterizing, extracting and segmenting inquired colonies using principal component analysis, k-means clustering and a modified watershed segmentation algorithm. The intention is to automatically identify visible colonies through spatial texture assessment and accordingly discriminate them from background in preparation for successive segmentation. The proposed segmentation algorithm yielded a similar quality as manual counting by human observers. High F1 scores (>0.9) and low root-mean-square errors (around 14%) underlined good agreement with ground truth data. Moreover, it outperformed a recent state-of-the-art method. The methodology will be an important tool in future cancer research applications.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
RENT -- Repeated Elastic Net Technique for Feature Selection
Authors:
Anna Jenul,
Stefan Schrunner,
Kristian Hovde Liland,
Ulf Geir Indahl,
Cecilia Marie Futsaether,
Oliver Tomic
Abstract:
Feature selection is an essential step in data science pipelines to reduce the complexity associated with large datasets. While much research on this topic focuses on optimizing predictive performance, few studies investigate stability in the context of the feature selection process. In this study, we present the Repeated Elastic Net Technique (RENT) for Feature Selection. RENT uses an ensemble of…
▽ More
Feature selection is an essential step in data science pipelines to reduce the complexity associated with large datasets. While much research on this topic focuses on optimizing predictive performance, few studies investigate stability in the context of the feature selection process. In this study, we present the Repeated Elastic Net Technique (RENT) for Feature Selection. RENT uses an ensemble of generalized linear models with elastic net regularization, each trained on distinct subsets of the training data. The feature selection is based on three criteria evaluating the weight distributions of features across all elementary models. This fact leads to the selection of features with high stability that improve the robustness of the final model. Furthermore, unlike established feature selectors, RENT provides valuable information for model interpretation concerning the identification of objects in the data that are difficult to predict during training. In our experiments, we benchmark RENT against six established feature selectors on eight multivariate datasets for binary classification and regression. In the experimental comparison, RENT shows a well-balanced trade-off between predictive performance and stability. Finally, we underline the additional interpretational value of RENT with an exploratory post-hoc analysis of a healthcare dataset.
△ Less
Submitted 22 November, 2021; v1 submitted 27 September, 2020;
originally announced September 2020.