Abstract
The analysis of high throughput gene expression patients/controls experiments is based on the determination of differentially expressed genes according to standard statistical tests. A typical bioinformatics approach to this problem is composed of two separate steps: first, a subset of genes with altered expression level is identified; then the pathways which are statistically enriched by those genes are selected, assuming they play a relevant role for the biological condition under study. Often, the set of selected pathways contains elements that are not related to the condition. This is due to the fact that the statistical significance is not sufficient for biological relevance. To overcome these problems, we propose a method based on a large mixed integer program that implements a new feature selection model to simultaneously identify the genes whose over- and under-expressions, combined together, discriminate different cancer subtypes, as well as the pathways that are enriched by these genes. The innovation in this model is the solutions are driven towards the enrichment of pathways. That may indeed introduce a bias in the search; such a bias is counter-balanced by a wide exploration of the solution space, varying the involved parameters in their feasible region, and then using a global optimization approach. The conjoint analysis of the pool of solutions obtained by this exploration should indeed provide a robust final set of genes and pathways, overcoming the potential drawbacks of relying solely on statistical significance. Experimental results on transcriptomes for different types of cancer from the Cancer Genome Atlas are presented. The method is able to identify crisp relations between the considered subtypes of cancer and few selected pathways, eventually validated by the biological analysis.
Similar content being viewed by others
References
Huang, D.W., et al.: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009)
Zhang, B., Shi, Z., Duncan, D.T., Prodduturi, N., Marnett, L.J., Liebler, D.C.: Relating protein adduction to gene expression changes: a systems approach. Mol. BioSyst. 7(7), 2118–27 (2011)
Chen, T.W., Gan, R.C.R., Wu, T.H., Huang, P.J., Lee, C.Y., Chen, Y.Y.M., Chen, C.C., Tang, P.: FastAnnotator: an efficient transcript annotation web tool. BMC Genom. 13(7), S9 (2012)
Tripathi, K.P., Evangelista, D., Zuccaro, A., Guarracino, M.R.: Transcriptator: an automated computational pipeline to annotate assembled reads and identify non coding rna. PLoS One 10(11), e0140268 (2015)
Guarracino, M.R., Cuciniello, S., Pardalos, P.M.: Classification and characterization of gene expression data with generalized eigenvalues. J. Optim. Theory Appl. 141(3), 533–545 (2009)
Fay, D.S., Gerow, K.A.: Biologist’s guide to statistical thinking and analysis. In: WormBook (ed.) The C. elegans Research Community, WormBook (2013). doi:10.1895/wormbook.1.159.1
Martnez-Abran, A.: Statistical significance and biological relevance: a call for a more cautious interpretation of results in ecology. Acta Oecol. doi:10.1016/j.actao.2008.02.004
Lovell, D.P.: Biological importance and statistical significance. J. Agric. Food Chem. 61(35), 8340–8348 (2013). doi:10.1021/jf401124y
European Food Safety Authority: Statistical significance and biological relevance. EFSA J. 9(9), 2372 (2011). doi:10.2903/j.efsa.2011.2372
Huang, D.W., Sherman, B.T., Lempicki, R.A.: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1), 1–13 (2009). doi:10.1093/nar/gkn923
Subramanian, A., Tamayoa, P., Moothaa, V.K., Mukherjee, S., Eberta, B.L., Gillettea, M.A., Paulovichg, A., Pomeroyh, S.L., Goluba, T.R., Landera, E.S., Mesirova, J.P.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS 102(43), 15545–15550 (2005)
Holland, P.W.: Statistics and causal inference. J. Am. Stat. Assoc. 81(396), 945–960 (1986). doi:10.1080/01621459.1986.10478354
Guyon, I.: An introduction to variable and feature selection. J. Mach. Learn. Res. Arch. 3, 1157–1182 (2003)
Pearl, J.: Causality: models, reasoning and inference. Econ. Theory 19, 675–685 (2003)
Sun, M., Xiong, M.: A mathematical programming approach for gene selection and tissue classification. Bioinformatics 19(10), 1243–1251 (2003)
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., Smyth, G.K.: Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. (2015). doi:10.1093/nar/gkv007
IBM ILOG CPLEX - High-performance mathematical programming engine. http://www.ibm.com/software/integration/optimization/cplex
Maldonado, S., Perez, J., Weber, R., Labb, M.: Feature selection for support vector machines via mixed integer linear programming. Inf. Sci. 279, 163–175 (2014)
Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Dordrecht (2000)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Felici, G., de Angelis, V., Mancinelli, G.: Feature selection for data mining. In: Felici, G., Trintaphyllou, E. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, Berlin (2006)
Mosca, Ettore, Milanesi, Luciano: Network-based analysis of omics with multi-objective optimization. Mol. BioSyst. 9(12), 2971–2980 (2013)
Felici, G., Bertolazzi, P., Guarracino, M., Chinchuluun, A., Pardalos, P.: Logic formulas based knowledge discovery and its application to the classification of biological data. In: Mondaini, R.P. (ed.) BIOMAT 2008, 2009. World Scientific, Singapore, pp. 265-279. ISBN: 978-981-4271-81-3
Bertolazzi, P., Felici, G., Weitschek, E.: Learning to classify species with barcodes. BMC Bioinf. 10, 1–12 (2009)
Bertolazzi, P., Felici, G., Festa, P., Fiscon, G., Weitschek, E.: Integer programming models for feature selection: new extensions and a randomized solution algorithm. Eur. J. Oper. Res. 250, 389–399 (2016)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman W.H, New York (1979)
Bertolazzi, P., Felici, G., Lancia, G.: Biological data mining. In: Chen, J.K., Lonardi, S. (eds.) Application of Feature Selection and Classification to Computational Molecular Biology, pp. 257–294. Chapman & Hall, London (2010)
Boros, E., Ibaraki, T., Makino, K.: Logical analysis of binary data with missing bits. Artif. Intell. 107, 219–263 (1999)
Fiscon, G., Weitschek, E., Cella, E., Lo Presti, A., Giovanetti, M., Babakir-Mina, M., Ciotti, M., Ciccozzi, M., Pierangeli, A., Bertolazzi, P., Felici, G.: MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification. BioData Min. (2016) (to appear)
Berretta, R., Mendes, A., Moscato, P.: Integer programming models and algorithms for molecular classification of cancer from microarray data. In: ACSC ’05 Proceedings of the Twenty-eighth Australasian conference on Computer Science, vol 38, pp. 361–370 (2005)
Drukker, C.A., et al.: A prospective evaluation of a breast cancer prognosis signature in the observational RASTER study. Int. J. Cancer 133(4), 929–36 (2013)
Li, D., Xia, H., Li, Z., Hua, L., Li, L.: Identification of novel breast cancer subtype-specific biomarkers by integrating genomics analysis of DNA copy number aberrations and miRNA-mRNA dual expression profiling. BioMed Res. Int. 2015 (2015). doi:10.1155/2015/746970
Goldman, M., Craft, B., Swatloski, T., Ellrott, K., Cline, M., Diekhans, M., Ma, S., Wilks, C., Stuart, J., Haussler, D., Zhu, J.: The UCSC Cancer Genomics Browser: update 2013. Nucleic Acids Res. 41(Database Issue), 949–954 (2012). doi:10.1093/nar/gks1008
Tian, F., Wang, Y., Seiler, M., Hu, Z.: Functional characterization of breast cancer using pathway profiles. BMC Med. Genom. 7(1), 45 (2014). doi:10.1186/1755-8794-7-45
Gautier, L., Cope, L., Bolstad, B.M., Irizarry, R.A.: Affy analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20(3), 1367–4803 (2004). doi:10.1093/bioinformatics/btg405
Student: The probable error of a mean. Biometrika, 6(1), 1–25 (1908). doi:10.1093/biomet/6.1.1
Jiang, P., Du, W., Wu, M.: Regulation of the pentose phosphate pathway in cancer. Protein Cell 5(8), 592–602 (2014)
Hoppertona, K.E., Duncana, R.E., Bazineta, R.P., Archera, M.C.: Fatty acid synthase plays a role in cancer metabolism beyond providing fatty acids for phospholipid synthesis or sustaining elevations in glycolytic activity. Exp. Cell Res. 320(2), 302–310 (2014)
Argiles, J., Costelli, P., Carbo, N., LopezSoriano, F.: Branched-chain amino acid catabolism and cancer cachexia (review). Oncol. Rep. (1996). doi:10.3892/or.3.4.687
Birk, J.U., Lone, S., Susanne, T., Britta, H., Anja, N., Inge, B., Mef, N.: Mismatch repair defective breast cancer in the hereditary nonpolyposis colorectal cancer syndrome. Breast Cancer Res. Treat. 120(3), 777–782 (2010)
Abdel-Fatah, Tarek M.A., Perry, C., Arora, A., Thompson, N., Doherty, R., Moseley, P.M., Green, A.R., Chan, S.Y.T., Ellis, I.O., Madhusudan, S.: Is there a role for base excision repair in estrogen/estrogen receptor-driven breast cancers. Antioxid. Redox Signal. 21(16), 2262–2268 (2014). doi:10.1089/ars.2014.6077
So, E.Y., Ouchi, T.: The application of Toll like receptors for cancer therapy. Int. J. Biol. Sci. 6(7), 675–681 (2010). doi:10.7150/ijbs.6.675
Patt, D.A., Duan, Z., Fang, S., Hortobagyi, G.N., Giordano, S.H.: Acute myeloid leukemia after adjuvant breast cancer. J. Clin. Oncol. 25, 3871–3876 (2007)
Nielsen, T.O., Parker, J.S., Leung, S., et al.: A comparison of PAM50 intrinsic subtyping with immunohistochemistry and clinical prognostic factors in tamoxifen-treated estrogen receptor-positive breast cancer. Clin. Cancer Res. 16(21), 5222–5232 (2010)
Uchida, N., Suda, T., Ishiguro, K.: Effect of chemotherapy for luminal a breast cancer. Yonago Acta Med. 56(2), 51–56 (2013)
Prat, A., et al.: Molecular characterization of basal-like and non-basal-like triple-negative breast cancer. Oncologist 18(2), 123–133 (2013)
Ossovskaya, V., et al.: Exploring Molecular Pathways of Triple-Negative Breast Cancer. Genes Cancer 2(9), 870–879 (2011)
Acknowledgements
This work was funded by: the INTEROMICS Italian flagship project, PON02-00612-3461281 and PON02-00619-3470457; The SysBioNet project, a MIUR initiative from the Italian Roadmap Research Infrastructures 2012; Mario R. Guarracino work has been conducted at National Research University Higher School of Economics and supported by a RSF Grant 14-41-00039.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Felici, G., Tripathi, K.P., Evangelista, D. et al. A mixed integer programming-based global optimization framework for analyzing gene expression data. J Glob Optim 69, 727–744 (2017). https://doi.org/10.1007/s10898-017-0530-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-017-0530-0