Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Leveraging class hierarchy for detecting missing annotations on hierarchical multi-label classification

Published: 01 January 2023 Publication History

Abstract

With the development of new sequencing technologies, availability of genomic data has grown exponentially. Over the past decade, numerous studies have used genomic data to identify associations between genes and biological functions. While these studies have shown success in annotating genes with functions, they often assume that genes are completely annotated and fail to take into account that datasets are sparse and noisy. This work proposes a method to detect missing annotations in the context of hierarchical multi-label classification. More precisely, our method exploits the relations of functions, represented as a hierarchy, by computing probabilities based on the paths of functions in the hierarchy. By performing several experiments on a variety of rice (Oriza sativa Japonica), we showcase that the proposed method accurately detects missing annotations and yields superior results when compared to state-of-art methods from the literature.

Highlights

One of the first works to address detection of missing annotations in HMC datasets.
A novel state-of-art method based on post-processing predicted probabilities.
An experimental evaluation including 8 new datasets related to Oryza sativa Japonica.
Exploiting the hierarchy of functions helps to better identify missing gene functions.

References

[1]
Ranganathan S., Gribskov M.R., Nakai K., Schönbach C., Encyclopedia of Bioinformatics and Computational Biology, 1052465484, Elsevier, 2019, OCLC.
[2]
Rust A.G., Mongin E., Birney E., Genome annotation techniques: New approaches and challenges, Drug Discov. Today 7 (11) (2002) S70–S76,.
[3]
Vandepoele K., Quimbaya M., Casneuf T., De Veylder L., Van de Peer Y., Unraveling transcriptional control in arabidopsis using cis-regulatory elements and coexpression networks, Plant Physiol. 150 (2) (2009) 535–546,.
[4]
van Dam S., Võsa U., van der Graaf A., Franke L., de Magalhães J.P., Gene Co-expression analysis for functional classification and gene–disease predictions, Brief. Bioinform. (2017) bbw139,.
[5]
Zhou Y., Young J.A., Santrosyan A., Chen K., Yan S.F., Winzeler E.A., In silico gene function prediction using ontology-based pattern identification, Bioinformatics 21 (7) (2005) 1237–1245,.
[6]
Deng M., Zhang K., Mehta S., Chen T., Sun F., Prediction of protein function using protein-protein interaction data, J. Comput. Biol. 10 (6) (2003) 947–960,.
[7]
Luo F., Yang Y., Zhong J., Gao H., Khan L., Thompson D.K., Zhou J., Constructing gene Co-expression networks and predicting functions of unknown genes by random matrix theory, BMC Bioinformatics 8 (1) (2007) 299,.
[8]
Jiang X., Nariai N., Steffen M., Kasif S., Kolaczyk E.D., Integration of relational and hierarchical network information for protein function prediction, BMC Bioinformatics 9 (1) (2008) 350,.
[9]
Cho H., Berger B., Peng J., Compact integration of multi-network topology for functional analysis of genes, Cell Syst. 3 (6) (2016) 540–548.e5,.
[10]
Nakano F.K., Lietaert M., Vens C., Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets, BMC Bioinformatics 20 (1) (2019) 485,.
[11]
Gene Ontology Consortium F.K., The gene ontology resource: 20 years and still going strong, Nucleic Acids Res. 47 (D1) (2019) D330–D338,.
[12]
Vens C., Struyf J., Schietgat L., Džeroski S., Blockeel H., Decision trees for hierarchical multi-label classification, Mach. Learn. 73 (2) (2008) 185–214,.
[13]
Silla C.N., Freitas A.A., A survey of hierarchical classification across different application domains, Data Min. Knowl. Discov. 22 (1–2) (2011) 31–72,.
[14]
Yu G., Zhu H., Domeniconi C., Predicting protein functions using incomplete hierarchical labels, BMC Bioinformatics 16 (1) (2015) 1,.
[15]
Sabzevari M., Martínez-Muñoz G., Suárez A., A two-stage ensemble method for the detection of class-label noise, Neurocomputing 275 (2018) 2374–2383,.
[16]
Tharmakulasingam M., Gardner B., La Ragione R., Fernando A., Rectified classifier chains for prediction of antibiotic resistance from multi-labelled data with missing labels, IEEE/ACM Trans. Comput. Biol. Bioinform. (2022),.
[17]
Valentini G., True path rule hierarchical ensembles, in: Multiple Classifier Systems, Springer, Berlin, Heidelberg, 2009, pp. 232–241.
[18]
Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Harris M.A., Hill D.P., Issel-Tarver L., Kasarskis A., Lewis S., Matese J.C., Richardson J.E., Ringwald M., Rubin G.M., Sherlock G., Gene Ontology: Tool for the unification of biology, Nature Genet. 25 (1) (2000) 25–29,.
[19]
Petsko G.A., Guilt by association, Genome Biol. 10 (4) (2009) 104,.
[20]
Ramírez-Corona M., Sucar L.E., Morales E.F., Hierarchical multilabel classification based on path evaluation, Internat. J. Approx. Reason. 68 (2016) 179–193,.
[21]
Zhao Y., Fu G., Wang J., Guo M., Yu G., Gene function prediction based on gene ontology hierarchy preserving hashing, Genomics 111 (3) (2019) 334–342,.
[22]
Schietgat L., Vens C., Struyf J., Blockeel H., Kocev D., Džeroski S., Predicting gene function using hierarchical multi-label decision tree ensembles, BMC Bioinformatics 11 (1) (2010) 2,.
[23]
Zhou G., Wang J., Zhang X., Guo M., Yu G., Predicting functions of maize proteins using graph convolutional network, BMC Bioinformatics 21 (S16) (2020) 420,.
[24]
Cruz D.F., De Meyer S., Ampe J., Sprenger H., Herman D., Van Hautegem T., De Block J., Inzé D., Nelissen H., Maere S., Using single-plant-omics in the field to link maize genes to functions and phenotypes, Mol. Syst. Biol. 16 (12) (2020),.
[25]
Huang J., Xu L., Qian K., Wang J., Yamanishi K., Multi-label learning with missing and completely unobserved labels, Data Min. Knowl. Discov. 35 (3) (2021) 1061–1086,.
[26]
Cheng Y., Qian K., Min F., Global and local attention-based multi-label learning with missing labels, Inform. Sci. 594 (2022) 20–42,.
[27]
Romero M., Ramírez O., Finke J., Rocha C., Feature extraction using spectral clustering for gene function prediction using hierarchical multi-label classification, 2022,.
[28]
Kumar S., Rastogi R., Low rank label subspace transformation for multi-label learning with missing labels, Inform. Sci. 596 (2022) 53–72,.
[29]
Tan A., Ji X., Liang J., Tao Y., Wu W.-Z., Pedrycz W., Weak multi-label learning with missing labels via instance granular discrimination, Inform. Sci. 594 (2022) 200–216,.
[30]
Abu-El-Haija S., Kapoor A., Perozzi B., Lee J., N-GCN: Multi-scale graph convolution for semi-supervised node classification, 2018,.
[31]
Hamilton W.L., Ying R., Leskovec J., Inductive representation learning on large graphs, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, pp. 1025–1035.
[32]
Kipf T.N., Welling M., Semi-supervised classification with graph convolutional networks, 2016,.
[33]
Makrodimitris S., van Ham R.C.H.J., Reinders M.J.T., Automatic gene function prediction in the 2020’s, Genes 11 (11) (2020) 1264,.
[34]
Chen Q., Li Y., Tan K., Qiao Y., Pan S., Jiang T., Chen Y.-P.P., Network-based methods for gene function prediction, Brief. Funct. Genom. 20 (4) (2021) 249–257,.
[35]
Xiao S., Wang S., Dai Y., Guo W., Graph neural networks in node classification: Survey and evaluation, Mach. Vis. Appl. 33 (1) (2021) 4,.
[36]
Kurata N., Yamazaki Y., Oryzabase: An integrated biological and genome information database for rice, Plant Physiol. 140 (1) (2006) 12–17,.
[37]
Childs K.L., Davidson R.M., Buell C.R., Gene coexpression network analysis as a source of functional annotation for rice genes, PLoS One 6 (7) (2011).
[38]
Sakai H., Lee S.S., Tanaka T., Numa H., Kim J., Kawahara Y., Wakimoto H., Yang C.-c., Iwamoto M., Abe T., Yamada Y., Muto A., Inokuchi H., Ikemura T., Matsumoto T., Sasaki T., Itoh T., Rice annotation project database (RAP-DB): An integrative and interactive database for rice genomics, Plant Cell Physiol. 54 (2) (2013) e6,.
[39]
Huang D.W., Sherman B.T., Lempicki R.A., Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources, Nat. Protoc. 4 (1) (2009) 44–57,.
[40]
Obayashi T., Aoki Y., Tadaka S., Kagaya Y., Kinoshita K., ATTED-II in 2018: A plant coexpression database based on investigation of the statistical property of the mutual rank index, Plant Cell Physiol. 59 (1) (2018) e3,.
[41]
Aoki K., Ogata Y., Shibata D., Approaches for extracting practical information from gene co-expression networks in plant biology, Plant Cell Physiol. 48 (3) (2007) 381–390.
[42]
Romero M., Finke J., Rocha C., A top-down supervised learning approach to hierarchical multi-label classification in networks, Appl. Netw. Sci. 7 (1) (2022) 8,.
[43]
Kleinberg J.M., Authoritative sources in a hyperlinked environment, J. ACM 46 (5) (1999) 604–632,.
[44]
Ju W., Li J., Yu W., Zhang R., iGraph: An incremental data processing system for dynamic graph, Front. Comput. Sci. 10 (3) (2016) 462–476,.
[45]
Grover A., Leskovec J., node2vec: Scalable feature learning for networks, 2016, arXiv:1607.00653.
[46]
Cao J., Kwong S., Wang R., A noise-detection based AdaBoost algorithm for mislabeled data, Pattern Recognit. 45 (12) (2012) 4451–4465.
[47]
Sluban B., Gamberger D., Lavrač N., Ensemble-based noise detection: Noise ranking and visual performance evaluation, Data Min. Knowl. Discov. 28 (2) (2014) 265–303.
[48]
Samami M., Akbari E., Abdar M., Plawiak P., Nematzadeh H., Basiri M.E., Makarenkov V., A mixed solution-based high agreement filtering method for class noise detection in binary classification, Physica A 553 (2020),.
[49]
Zhang H., Chen F., Shen Z., Hao Q., Zhu C., Savvides M., Solving missing-annotation object detection with background recalibration loss, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 1888–1892,.
[50]
Demšar J., Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30. URL: http://jmlr.org/papers/v7/demsar06a.html.

Cited By

View all
  • (2024)Follow the Path: Hierarchy-Aware Extreme Multi-Label Completion for Semantic Text TaggingProceedings of the ACM Web Conference 202410.1145/3589334.3645558(2094-2105)Online publication date: 13-May-2024

Index Terms

  1. Leveraging class hierarchy for detecting missing annotations on hierarchical multi-label classification
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Computers in Biology and Medicine
          Computers in Biology and Medicine  Volume 152, Issue C
          Jan 2023
          1242 pages

          Publisher

          Pergamon Press, Inc.

          United States

          Publication History

          Published: 01 January 2023

          Author Tags

          1. Detecting missing annotations
          2. Hierarchical multi-label classification
          3. Structured output prediction
          4. Gene function prediction
          5. Gene ontology hierarchy
          6. Random forest
          7. Tree ensembles

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 03 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Follow the Path: Hierarchy-Aware Extreme Multi-Label Completion for Semantic Text TaggingProceedings of the ACM Web Conference 202410.1145/3589334.3645558(2094-2105)Online publication date: 13-May-2024

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media