research-article

Revolutionizing Biomarker Discovery: Leveraging Generative AI for Bio-Knowledge-Embedded Continuous Space Exploration

Authors:

Yanjie FuAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 5046 - 5053

https://doi.org/10.1145/3627673.3680041

Published: 21 October 2024 Publication History

Abstract

Biomarker discovery is vital in advancing personalized medicine, offering insights into disease diagnosis, prognosis, and therapeutic efficacy. Traditionally, the identification and validation of biomarkers heavily depend on extensive experiments and statistical analyses. These approaches are time-consuming, demand extensive domain expertise, and are constrained by the complexity of biological systems. These limitations motivate us to ask: Can we automatically identify the effective biomarker subset without substantial human efforts? Inspired by the success of generative AI, we think that the intricate knowledge of biomarker identification can be compressed into a continuous embedding space, thus enhancing the search for better biomarkers. Thus, we propose a new biomarker identification framework with two important modules:1) training data preparation and 2) embedding-optimization-generation. The first module uses a multi-agent system to automatically collect pairs of biomarker subsets and their corresponding prediction accuracy as training data. These data establish a strong knowledge base for biomarker identification. The second module employs an encoder-evaluator-decoder learning paradigm to compress the knowledge of the collected data into a continuous space. Then, it utilizes gradient-based search techniques and autoregressive-based reconstruction to efficiently identify the optimal subset of biomarkers. Finally, we conduct extensive experiments on three real-world datasets to show the efficiency, robustness, and effectiveness of our method.

References

[1]

Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. 2015. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology 33, 8 (2015), 831--838.

[2]

Genomes Project Consortium, A Auton, LD Brooks, RM Durbin, EP Garrison, and HM Kang. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68--74.

[3]

Joost CF De Winter. 2019. Using the Student's t-test with extremely small sample sizes. Practical Assessment, Research, and Evaluation 18, 1 (2019), 10.

[4]

Wei Fan, Kunpeng Liu, Hao Liu, Ahmad Hariri, Dejing Dou, and Yanjie Fu. 2021. Autogfs: Automated group-based feature selection via interactive reinforcement learning. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM). SIAM, 342--350.

[5]

George Forman et al. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, Mar (2003), 1289--1305.

[6]

Nanxu Gong, Wangyang Ying, Dongjie Wang, and Yanjie Fu. 2024. Neuro-Symbolic Embedding for Short and Effective Feature Selection via Autoregressive Generation. arXiv preprint arXiv:2404.17157 (2024).

[7]

Pablo M Granitto, Cesare Furlanello, Franco Biasioli, and Flavia Gasperi. 2006. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and intelligent laboratory systems 83, 2 (2006), 83--90.

[8]

Mark A Hall. 1999. Feature selection for discrete and numeric class machine learning. (1999).

[9]

Amin Hashemi, Mohammad Bagher Dowlatshahi, and Hossein Nezamabadi-pour. 2022. Ensemble of feature selection algorithms: a multi-criteria decision-making approach. International Journal of Machine Learning and Cybernetics 13, 1 (2022), 49--69.

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[11]

Yanyong Huang, Zongxin Shen, Yuxin Cai, Xiuwen Yi, Dongjie Wang, Fengmao Lv, and Tianrui Li. 2023. C2IMUFS: Complementary and Consensus Learning-Based Incomplete Multi-View Unsupervised Feature Selection. IEEE Transactions on Knowledge and Data Engineering 35, 10 (2023), 10681--10694. https://doi.org/ 10.1109/TKDE.2023.3266595

Digital Library

[12]

YeongSeog Kim, W Nick Street, and Filippo Menczer. 2000. Feature selection in unsupervised learning via evolutionary search. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 365--369.

Digital Library

[13]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[14]

Ron Kohavi and George H John. 1997. Wrappers for feature subset selection. Artificial intelligence 97, 1--2 (1997), 273--324.

[15]

Riccardo Leardi. 1996. Genetic algorithms in feature selection. In Genetic algorithms in molecular modeling. Elsevier, 67--86.

[16]

Ismael Lemhadri, Feng Ruan, and Rob Tibshirani. 2021. Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics. PMLR, 10--18.

[17]

Yuk Fai Leung and Duccio Cavalieri. 2003. Fundamentals of cDNA microarray data analysis. TRENDS in Genetics 19, 11 (2003), 649--659.

[18]

Kunpeng Liu, Yanjie Fu, Pengfei Wang, Le Wu, Rui Bo, and Xiaolin Li. 2019. Automating feature subspace exploration via multi-agent reinforcement learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 207--215.

Digital Library

[19]

Kunpeng Liu, Pengfei Wang, Dongjie Wang, Wan Du, Dapeng Oliver Wu, and Yanjie Fu. 2021. Efficient Reinforced Feature Selection via Early Stopping Traverse Strategy. In 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 399--408.

[20]

Xiao-Li Meng, Robert Rosenthal, and Donald B Rubin. 1992. Comparing correlated correlation coefficients. Psychological bulletin 111, 1 (1992), 172.

[21]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529--533.

[22]

Patrenahalli M. Narendra and Keinosuke Fukunaga. 1977. A branch and bound algorithm for feature subset selection. IEEE Transactions on computers 9 (1977), 917--922.

Digital Library

[23]

Zhiyuan Ning, Chunlin Tian, Meng Xiao,Wei Fan, PengyangWang, Li Li, Pengfei Wang, and Yuanchun Zhou. 2024. FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization. arXiv preprint arXiv:2405.06312 (2024).

[24]

Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and minredundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8 (2005), 1226--1238.

Digital Library

[25]

Robin L Plackett. 1983. Karl Pearson and the chi-squared test. International statistical review/revue internationale de statistique (1983), 59--72.

[26]

Lusheng Song, Minkyo Song, M Constanza Camargo, Jennifer Van Duine, Stacy Williams, Yunro Chung, Kyoung-Mee Kim, Jolanta Lissowska, Armands Sivins, Weimin Gao, et al. 2021. Identification of anti-Epstein-Barr virus (EBV) antibody signature in EBV-associated gastric carcinoma. Gastric Cancer 24 (2021), 858--867.

[27]

Lusheng Song, Minkyo Song, Charles S Rabkin, Yunro Chung, Stacy Williams, Javier Torres, Alejandro H Corvalan, Robinson Gonzalez, Enrique Bellolio, Mahasish Shome, et al. 2023. Identification of anti-Helicobacter pylori antibody signatures in gastric intestinal metaplasia. Journal of Gastroenterology 58, 2 (2023), 112--124.

[28]

Lusheng Song, Minkyo Song, Charles S Rabkin, Stacy Williams, Yunro Chung, Jennifer Van Duine, Linda M Liao, Kailash Karthikeyan, Weimin Gao, Jin G Park, et al. 2020. Helicobacter pylori immunoproteomic profiles in gastric cancer. Journal of Proteome Research 20, 1 (2020), 409--419.

[29]

Lars St, Svante Wold, et al. 1989. Analysis of variance (ANOVA). Chemometrics and intelligent laboratory systems 6, 4 (1989), 259--272.

[30]

V Sugumaran, V Muralidharan, and KI Ramachandran. 2007. Feature selection using decision tree and classification through proximal support vector machine for fault diagnostics of roller bearing. Mechanical systems and signal processing 21, 2 (2007), 930--942.

[31]

Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267--288.

[32]

Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina De Vries, Yukinori Okada, Alicia R Martin, Hilary C Martin, Tuuli Lappalainen, and Danielle Posthuma. 2021. Genome-wide association studies. Nature Reviews Methods Primers 1, 1 (2021), 59.

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[34]

Meng Xiao, Dongjie Wang, Min Wu, Pengfei Wang, Yuanchun Zhou, and Yanjie Fu. 2023. Beyond discrete selection: Continuous embedding space optimization for generative feature selection. In 2023 IEEE International Conference on Data Mining (ICDM). IEEE, 688--697.

[35]

Jihoon Yang and Vasant Honavar. 1998. Feature subset selection using a genetic algorithm. In Feature extraction, construction and selection. Springer, 117--136.

[36]

Yiming Yang and Jan O Pedersen. 1997. A comparative study on feature selection in text categorization. In Icml, Vol. 97. Nashville, TN, USA, 35.

Digital Library

[37]

Wangyang Ying, Dongjie Wang, Haifeng Chen, and Yanjie Fu. 2024. Feature Selection as Deep Sequential Generative Learning. arXiv preprint arXiv:2403.03838 (2024).

[38]

Lei Yu and Huan Liu. 2003. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th international conference on machine learning (ICML-03). 856--863.

Digital Library

[39]

Weiliang Zhang, Zhen Meng, Dongjie Wang, Min Wu, Kunpeng Liu, Yuanchun Zhou, and Meng Xiao. 2024. Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization. arXiv preprint arXiv:2406.07418 (2024).

Index Terms

Revolutionizing Biomarker Discovery: Leveraging Generative AI for Bio-Knowledge-Embedded Continuous Space Exploration
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Biomarker identification from gene expression: an effective computational pipeline

Discovering biomarkers from microarray data is an extremely important research subject, as biomarkers help to diagnose disease types, find therapeutic plans for a disease, and contain crucial biological information about organisms. In this paper, a ...
Development of a library with feature selection algorithm based on microarray gene expression dataset for biomarker identification

Gene expression data is used to find significant genes related to specific disease, such as lung cancer. These significant genes can be used as biomarkers to diagnose disease, and data mining techniques are useful in finding such biomarkers. Feature ...
Biomarker identification by knowledge-driven multilevel ICA and motif analysis

Traditional statistical methods often fail to identify biologically meaningful biomarkers from expression data alone. In this paper, we develop a novel strategy, namely knowledge-driven multi-level Independent Component Analysis (ICA), to infer ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
72
Total Downloads

Downloads (Last 12 months)72
Downloads (Last 6 weeks)10

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten