Version 1
: Received: 26 August 2023 / Approved: 28 August 2023 / Online: 29 August 2023 (10:11:39 CEST)
How to cite:
Dutta, A. Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset. Preprints2023, 2023081933. https://doi.org/10.20944/preprints202308.1933.v1
Dutta, A. Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset. Preprints 2023, 2023081933. https://doi.org/10.20944/preprints202308.1933.v1
Dutta, A. Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset. Preprints2023, 2023081933. https://doi.org/10.20944/preprints202308.1933.v1
APA Style
Dutta, A. (2023). Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset. Preprints. https://doi.org/10.20944/preprints202308.1933.v1
Chicago/Turabian Style
Dutta, A. 2023 "Using Machine Learning to Identify the Risk Factors of Pancreatic Cancer from the NIH PLCO Dataset" Preprints. https://doi.org/10.20944/preprints202308.1933.v1
Abstract
Background: Pancreatic cancer (PC) is a disease with poor prognosis and survival rate. There is apertinent need to identify the risk factors of this disease. The purpose of this study is to identify a subset offactors (a.k.a. features) as predictors of PC from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancerdataset consisting of responses to 65 questions about demographics, cancer and health history, medication usage,and smoking habits from 154,897 participants. Method: There are two challenges to selecting the subset of features that predict PC with highest probability:the problem is computationally intractable, and the PLCO dataset is highly imbalanced. We use aninnovative method to use the dataset in a balanced way, without involving up- or down-sampling. We use ninefeature selection methods to select the optimal subset of features from the preprocessed and balanced dataset.Results: Our preprocessed dataset consists of 32 risk factors (8 demographics, 5 cancer history, 13 healthhistory, 2 medication usage, 4 smoking habits). Risk factors belonging to cancer and health history, followedby smoking habits, were consistently chosen by the feature selection methods. We also discuss findings in themedical sciences literature that corroborate our findings.Conclusions: The study found that risk factors belonging to cancer and health history are the mostprominent ones for PC. In particular, previously diagnosed with PC is chosen as the most prominent risk factorby majority of methods. While most of our findings are consistent with the literature, some of our findings shedlight on novel factors that may not have received their due attention by the research community.
Engineering, Electrical and Electronic Engineering
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.