Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents

Published: 01 January 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Although publicly accessible databases containing protein-protein interaction (PPI)-related information are important resources to bench and in silico research scientists alike, the amount of time and effort required to keep them up to date is often burdonsome. In an effort to help identify relevant PPI publications, text-mining tools, from the machine learning discipline, can be applied to help in this process. Here, we describe and evaluate two document classification algorithms that we submitted to the BioCreative II.5 PPI Classification Challenge Task. This task asked participants to design classifiers for identifying documents containing PPI-related information in the primary literature, and evaluated them against one another. One of our systems was the overall best-performing system submitted to the challenge task. It utilizes a novel approach to k-nearest neighbor classification, which we describe here, and compare its performance to those of two support vector machine-based classification systems, one of which was also evaluated in the challenge task.

    References

    [1]
    I. Xenarios, L. Salwinski, X. Duan, P. Higney, S. Kim, and D. Eisenberg, "DIP, the Database of Interacting Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions," Nucleic Acids Research, vol. 30, no. 1, pp. 303-305, 2002.
    [2]
    P. Pagel et al., "The MIPS Mammalian Protein-Protein Interaction Database," Bioinformatics, vol. 21, no. 6, pp. 832-834, 2005.
    [3]
    I. Donaldson et al., "Prebind and Textomy - Mining the Biomedical Literature for Protein-Protein Interactions Using a Support Vector Machine," BMC Bioinformatics, vol. 4, no. 1, pp. 11-23, 2003.
    [4]
    G. Bader and C. Hogue, "Bind-A Data Specification for Storing and Describing Biomolecular Interactions, Molecular Complexes and Pathways," Bioinformatics, vol. 16, no. 5, pp. 465-477, 2000.
    [5]
    G. Bader, I. Donaldson, C. Wolting, B. Ouellette, T. Pawson, and C. Hogue, "Bind-The Biomolecular Interaction Network Database," Nucleic Acids Research, vol. 29, no. 1, pp. 242-245, 2001.
    [6]
    C. von Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, and P. Bork, "Comparative Assessment of Large-Scale Data Sets of Protein-Protein Interactions," Nature, vol. 417, no. 6887, pp. 399-403, 2002.
    [7]
    A. Cohen and W. Hersh, "A Survey of Current Work in Biomedical Text Mining," Briefings in Bioinformatics, vol. 6, no. 1, pp. 57-61, 2005.
    [8]
    J. Yang, A. Cohen, and M. McDonagh, "Syriac: The Systematic Review Information Automated Collection System a Data Warehouse for Facilitating Automated Biomedical Text Classification," Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., pp. 825-829, 2008.
    [9]
    A. Cohen, K. Ambert, and M. McDonagh, "Cross-Topic Learning for Work Prioritization in Systematic Review Creation and Update," J. Am. Medical Informatics Assoc., vol. 16, pp. 690-704, 2009.
    [10]
    F. Leitner, S. Mardis, M. Krallinger, G. Cesareni, L. Hirschman, and A. Valencia, "An Overview of Biocreative ii. 5," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 385-399, July-Sept. 2010.
    [11]
    T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. 10th European Conf. Machine Learning, pp. 137-142, 1998.
    [12]
    C.-J. Lin and C-C. Chang, "Libsvm: A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2011.
    [13]
    Y. Tsuruoka, J. McNaught, and S. Ananiadou, "Normalizing Biomedical Terms by Minimizing Ambiguity and Variability," BMC Bioinformatics, vol. 9, no. Suppl 3, p. S2, 2008.
    [14]
    J. Hakenberg, C. Plake, L. Royer, H. Strobelt, U. Leser, and M. Schroeder, "Gene Mention Normalization and Interaction Extraction with Context Models and Sentence Motifs," Genome Biology, vol. 9(Suppl 2): S14, 2008.
    [15]
    U. Fayyad and K. Irani, "Multi-Interval Discretization of Continuous Attributes as Preprocessing for Classification Learning," Proc. 13th Int'l Join Conf. Artificial Intelligence, pp. 1022-1027, 1993.
    [16]
    A. Cohen, "An Effective General Purpose Approach for Automated Biomedical Document Classification," Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., pp. 161-165, 2006.
    [17]
    K. Ambert and A. Cohen, "A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection," J. Am. Medical Informatics Assoc., vol. 16, no. 4, pp. 590-595, 2009.
    [18]
    I. Mani, "knn Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction," Proc. Workshop Learning from Imbalanced Data Sets II, 2009.
    [19]
    S. Tan, "Neighbor-Weighted k-Nearest Neighbor for Unbalanced Text Corpus," Expert Systems with Applications, vol. 28, no. 4, pp. 667-671, 2005.
    [20]
    G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, "Knn Model-Based Approach in Classification," Proc. on the Move to Meaningful Internet Systems, pp. 986- 996, 2003.
    [21]
    E. Han, G. Karypis, and V. Kumar, "Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification," Advances in Knowledge Discovery and Data Mining, pp. 53-65, 2001.
    [22]
    L. Baoli, L. Qin, and Y. Shiwen, "An Adaptive k-Nearest Neighbor Text Categorization Strategy," ACM Trans. Asian Language Information Processing, vol. 3, no. 4, pp. 215-226, 2004.
    [23]
    F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
    [24]
    J. Kent, "Information Gain and a General Measure of Correlation," Biometrika, vol. 70, no. 1, pp. 163-173, 1983.
    [25]
    M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," Proc. KDD Workshop Text Mining, pp. 525-526, 2000.
    [26]
    B. Zadrozny, J. Langford, and N. Abe, "Cost-Sensitive Learning by Cost-Proportionate Example Weighting," Proc. Third IEEE Int'l Conf. Data Mining, pp. 435-442, 2003.
    [27]
    G. Forman, "Tackling Concept Drift by Temporal Inductive Transfer," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 252-259, 2006.

    Cited By

    View all

    Index Terms

    1. k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
        IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 9, Issue 1
        January 2012
        341 pages

        Publisher

        IEEE Computer Society Press

        Washington, DC, United States

        Publication History

        Published: 01 January 2012
        Published in TCBB Volume 9, Issue 1

        Author Tags

        1. Protein-protein interaction
        2. information gain
        3. k-nearest neighbor
        4. support vector machine
        5. text classification.

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 12 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2017)Enhancing Protein Conformational Space Sampling Using Distance Profile-Guided Differential EvolutionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2016.256661714:6(1288-1301)Online publication date: 1-Nov-2017
        • (2016)Natural neighborPattern Recognition Letters10.1016/j.patrec.2016.05.00780:C(30-36)Online publication date: 1-Sep-2016
        • (2016)An ensemble method for detecting shilling attacks based on ordered item sequencesSecurity and Communication Networks10.1002/sec.13899:7(680-696)Online publication date: 10-May-2016
        • (2014)Hubness-aware shared neighbor distances for high-dimensional $$k$$-nearest neighbor classificationKnowledge and Information Systems10.1007/s10115-012-0607-539:1(89-122)Online publication date: 1-Apr-2014
        • (2013)Identification of DNA-Binding and Protein-Binding Proteins Using Enhanced Graph Wavelet FeaturesIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2013.11710:4(1017-1031)Online publication date: 1-Jul-2013
        • (undefined)Enhancement of Medical Named Entity Recognition Using Graph-Based Features2015 IEEE International Conference on Systems, Man, and Cybernetics10.1109/SMC.2015.331(1895-1900)

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media