Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Analysis of complexity indices for classification problems: Cancer gene expression data

Published: 01 January 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Currently, cancer diagnosis at a molecular level has been made possible through the analysis of gene expression data. More specifically, one usually uses machine learning (ML) techniques to build, from cancer gene expression data, automatic diagnosis models (classifiers). Cancer gene expression data often present some characteristics that can have a negative impact in the generalization ability of the classifiers generated. Some of these properties are data sparsity and an unbalanced class distribution. We investigate the results of a set of indices able to extract the intrinsic complexity information from the data. Such measures can be used to analyze, among other things, which particular characteristics of cancer gene expression data mostly impact the prediction ability of support vector machine classifiers. In this context, we also show that, by applying a proper feature selection procedure to the data, one can reduce the influence of those characteristics in the error rates of the classifiers induced.

    References

    [1]
    Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K. and Walter, P., Molecular Biology of the Cell. 2004. Garland Science.
    [2]
    Blagus, R. and Lusa, L., Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics. v11 i1. 523
    [3]
    Buness, A., Ruschhaupt, M., Kuner, R. and Tresch, A., Classification across gene expression microarray studies. BMC Bioinformatics. v10 i1. 453
    [4]
    Chuang, H.-Y., Lee, E., Liu, Y.-T., Lee, D. and Ideker, T., Network-based classification of breast cancer metastasis. Molecular Systems Biology. v3 iOctober.
    [5]
    Costa, I.G., Lorena, A.C., Peres, L.R.M.P.y. and de Souto, M.C.P., Using supervised complexity measures in the analysis of cancer gene expression data sets. In: Proceedings of the Brazilian Symposium on Bioinformatics, Lecture Notes in Computer Science, vol. 5676. Springer. pp. 48-59.
    [6]
    Cristianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. 2000. Cambridge University Press.
    [7]
    de Souto, M.C.P., Costa, I.G., de Araujo, D.S.A., Ludermir, T.B. and Schliep, A., Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. v9. 497
    [8]
    de Souto, M.C.P., Lorena, A.C., Spolaor, N. and Costa, I.G., Complexity measures of supervised classifications tasks: a case study for cancer gene expression data. In: Proceedings of the IEEE International Joint Conference on Neural Networks, IEEE. pp. 1352-1358.
    [9]
    de Souto, M.C.P., Prudêncio, R.B.C., Soares, R.G.F., de Araujo, D.S.A., Costa, I.G., Ludermir, T.B. and Schliep, A., Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 3729-3735.
    [10]
    de Souto, M.C.P., Prudencio, R.B.C., Soares, R.G.F., de Araujo, D.S.A., Costa, I.G., Ludermir, T.B. and Schliep, A., Ranking and selecting clustering algorithms using a meta-learning approach. In: IEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence)., pp. 3729-3735.
    [11]
    B. de Souza, A. de Carvalho, C. Soares, Empirical evaluation of ranking prediction methods for gene expression data classification, in: A. Kuri-Morales, G. Simari (Eds.), Advances in Artificial Intelligence, IBERAMIA 2010, Lecture Notes in Computer Science, vol. 6433, Springer, Berlin, Heidelberg, 2010, pp. 194-203.
    [12]
    Duda, Richard O., Hart, Peter E. and Stork, David G., Pattern Classification. 2001. second ed. Wiley, New York.
    [13]
    Dudoit, S., Fridlyand, J. and Speed, T.P., Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. v97 i457. 77-87.
    [14]
    Dupuy, A. and Simon, R., Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute. v99 i2. 147-157.
    [15]
    Ein-Dor, L., Kela, I., Getz, G., Givol, D. and Domany, E., Outcome signature genes in breast cancer: Is there a unique set?. Bioinformatics. v21 i2. 171-178.
    [16]
    Ein-Dor, L., Zuk, O. and Domany, E., Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proceedings of the National Academy of Sciences of the United States of America. 5923-5928.
    [17]
    Friedman, H. and Rafsky, L.C., Multivariate generalization of the Wald-Wolfowitz and Smirnov two-sample tests. Annals of Statistics. v7. 697-717.
    [18]
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D. and Lander, E.S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. v286 i5439. 531-537.
    [19]
    Haibe-Kains, B., Desmedt, C., Sotiriou, C. and Bontempi, G., A comparative study of survival models for breast cancer prognostication based on microarray data: Does a single gene beat them all?. Bioinformatics. v24 i19. 2200-2208.
    [20]
    Ho, T.K. and Basu, M., Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence. v24 i3. 289-300.
    [21]
    Irizarry, R.A., Warren, Da., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.N., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S.C., Hoffman, E., Jedlicka, A.E., Kawasaki, E., Martinez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Ye, S.Q. and Ye, W., Multiple-laboratory comparison of microarray platforms. Nature Methods. v2 i5. 345-350.
    [22]
    Kalousis, A., Gama, J. and Hilario, M., On data and algorithms: understanding inductive performance. Machine Learning. v54 iMarch. 275-312.
    [23]
    Ho, T.K. and Bernadó-Mansilla, E., Classifier domains of competence in data complexity space. In: Jain, L., Wu, X., Basu, M., Ho, T.K. (Eds.), Data Complexity in Pattern Recognition, Advanced Information and Knowledge Processing, Springer, London. pp. 135-152.
    [24]
    Lorena, A.C., Costa, I.G. and de Souto, M.C.P., On the complexity of gene expression classification data sets. In: Proceedings of the Eighth International Conference on Hybrid Intelligent Systems, IEEE Computer Society. pp. 825-830.
    [25]
    Lorena, A.C., Spolaor, N., Costa, I.G. and de Souto, M.C.P., On the complexity of gene marker selection. In: Proceedings of Brazilian Symposium on Neural Networks, IEEE Computer Society. pp. 85-90.
    [26]
    Lottaz, C., Kostka, D., Markowetz, F. and Spang, R., Computational diagnostics with gene expression profiles. Methods in Molecular Biology. v453. 281-296.
    [27]
    Mansilla, E.B. and Ho, T.K., On classifier domains of competence. In: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04), Volume 1-Volume 01, ICPR '04, IEEE Computer Society, Washington, DC, USA. pp. 136-139.
    [28]
    . In: Mitchel, T. (Ed.), Machine Learning, MacGraw Hill, New York.
    [29]
    Nascimento, A.C.A., Prudêncio, R.B.C., de Souto, M.C.P. and Costa, I.G., Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Proceedings of the International Artificial Neural Networks Conference (ICANN), Lecture Notes in Computer Science, vol. 5769. Springer. pp. 20-29.
    [30]
    Okun, O. and Priisalu, H., Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artificial Intelligence in Medicine. v45 i2-3. 151-162.
    [31]
    Quackenbush, J., Computational analysis of cDNA microarray data. Nature Reviews. v6 i2. 418-428.
    [32]
    Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S. and Golub, T.R., Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America. v98 i26. 15149-15154.
    [33]
    Segal, E., Friedman, N., Koller, D. and Regev, A., A module map showing conditional activity of expression modules in cancer. Nature Genetics. v36 i10. 1090-1098.
    [34]
    Shi, L., Jones, W.D., Jensen, R.V., Harris, S.C., Perkins, R.G., Goodsaid, F.M., Guo, L., Croner, L.J., Boysen, C., Fang, H., Qian, F., Amur, S., Bao, W., Barbacioru, C.C., Bertholet, V., Cao, X.M., Chu, T.M., Collins, P.J., Fan, X.H., Frueh, F.W., Fuscoe, J.C., Guo, X., Han, J., Herman, D., Hong, H., Kawasaki, E.S., Li, Q.Z., Luo, Y., Ma, Y., Mei, N., Peterson, R.L., Puri, R.K., Shippy, R., Su, Z., Sun, Y.A., Sun, H., Thorn, B., Turpaz, Y., Wang, C., Wang, S.J., Warrington, J.A., Willey, J.C., Wu, J., Xie, Q., Zhang, L., Zhang, L., Zhong, S., Wolfinger, R.D. and Tong, W., The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies. BMC Bioinformatics. v9. S10
    [35]
    Slonim, D., From patterns to pathways: gene expression data analysis comes of age. Nature Genetics. v32. 502-508.
    [36]
    Smith, F.W., Pattern classifier design by linear programming. IEEE Transactions on Computers. v17 i4. 367-372.
    [37]
    Sokal, R.R. and Rohlf, F.J., Biometry. 1995. W. H. Freeman and Company, New York.
    [38]
    Sontrop, H.M., Moerland, P.D., van den Ham, R., Reinders, M.J.T. and Verhaegh, W.F.J., A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability. BMC Bioinformatics. v10. 389
    [39]
    Spang, R., Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine. BIOSILICO. v1 i2. 64-68.
    [40]
    Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D. and Levy, S., A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. v21 i5. 631-643.
    [41]
    van't Veer, L.J. and Bernards, R., Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature. v452 i7187. 564-570.
    [42]
    Witten, I.H. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques. 2005. second ed. Morgan Kaufmann.
    [43]
    Yu, L. and Liu, H., Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research. v5 iDecember. 1205-1224.
    [44]
    Zucknick, M., Richardson, S. and Stronach, E., Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Statistical Applications in Genetics and Molecular Biology. v7 i1. 1-31.

    Cited By

    View all
    • (2023)Investigating the Performance of Data Complexity & Instance Hardness Measures as A Meta-Feature in Overlapping Classes ProblemProceedings of the 2023 7th International Conference on Cloud and Big Data Computing10.1145/3616131.3616132(1-9)Online publication date: 17-Aug-2023
    • (2023)Can Complexity Measures and Instance Hardness Measures Reflect the Actual Complexity of Microarray Data?Machine Learning, Optimization, and Data Science10.1007/978-3-031-53969-5_33(445-462)Online publication date: 22-Sep-2023
    • (2023)GNN-DES: A New End-to-End Dynamic Ensemble Selection Method Based on Multi-label Graph Neural NetworkGraph-Based Representations in Pattern Recognition10.1007/978-3-031-42795-4_6(59-69)Online publication date: 6-Sep-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Neurocomputing
    Neurocomputing  Volume 75, Issue 1
    January, 2012
    226 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 January 2012

    Author Tags

    1. Classification
    2. Complexity indices
    3. Gene expression data
    4. Linear separability

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Investigating the Performance of Data Complexity & Instance Hardness Measures as A Meta-Feature in Overlapping Classes ProblemProceedings of the 2023 7th International Conference on Cloud and Big Data Computing10.1145/3616131.3616132(1-9)Online publication date: 17-Aug-2023
    • (2023)Can Complexity Measures and Instance Hardness Measures Reflect the Actual Complexity of Microarray Data?Machine Learning, Optimization, and Data Science10.1007/978-3-031-53969-5_33(445-462)Online publication date: 22-Sep-2023
    • (2023)GNN-DES: A New End-to-End Dynamic Ensemble Selection Method Based on Multi-label Graph Neural NetworkGraph-Based Representations in Pattern Recognition10.1007/978-3-031-42795-4_6(59-69)Online publication date: 6-Sep-2023
    • (2022)Hostility measure for multi-level study of data complexityApplied Intelligence10.1007/s10489-022-03793-w53:7(8073-8096)Online publication date: 26-Jul-2022
    • (2022)On the joint-effect of class imbalance and overlap: a critical reviewArtificial Intelligence Review10.1007/s10462-022-10150-355:8(6207-6275)Online publication date: 1-Dec-2022
    • (2022)Active Learning Using Difficult InstancesAI 2022: Advances in Artificial Intelligence10.1007/978-3-031-22695-3_52(747-760)Online publication date: 5-Dec-2022
    • (2022)Study on the Complexity of Omics Data: An Analysis for Cancer Survival PredictionAdvances in Bioinformatics and Computational Biology10.1007/978-3-031-21175-1_6(44-55)Online publication date: 21-Sep-2022
    • (2021)Is it hard to learn a classifier on this dataset?Proceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD)10.1145/3430984.3430997(299-306)Online publication date: 2-Jan-2021
    • (2021)Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospectKnowledge and Information Systems10.1007/s10115-021-01577-163:7(1961-1989)Online publication date: 1-Jun-2021
    • (2019)How Complex Is Your Classification Problem?ACM Computing Surveys10.1145/334771152:5(1-34)Online publication date: 13-Sep-2019
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media