article

k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents

Authors:

Kyle H. Ambert,

Aaron M. CohenAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 9, Issue 1

Pages 305 - 310

https://doi.org/10.1109/TCBB.2011.32

Published: 01 January 2012 Publication History

Abstract

Although publicly accessible databases containing protein-protein interaction (PPI)-related information are important resources to bench and in silico research scientists alike, the amount of time and effort required to keep them up to date is often burdonsome. In an effort to help identify relevant PPI publications, text-mining tools, from the machine learning discipline, can be applied to help in this process. Here, we describe and evaluate two document classification algorithms that we submitted to the BioCreative II.5 PPI Classification Challenge Task. This task asked participants to design classifiers for identifying documents containing PPI-related information in the primary literature, and evaluated them against one another. One of our systems was the overall best-performing system submitted to the challenge task. It utilizes a novel approach to k-nearest neighbor classification, which we describe here, and compare its performance to those of two support vector machine-based classification systems, one of which was also evaluated in the challenge task.

References

[1]

I. Xenarios, L. Salwinski, X. Duan, P. Higney, S. Kim, and D. Eisenberg, "DIP, the Database of Interacting Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions," Nucleic Acids Research, vol. 30, no. 1, pp. 303-305, 2002.

[2]

P. Pagel et al., "The MIPS Mammalian Protein-Protein Interaction Database," Bioinformatics, vol. 21, no. 6, pp. 832-834, 2005.

Digital Library

[3]

I. Donaldson et al., "Prebind and Textomy - Mining the Biomedical Literature for Protein-Protein Interactions Using a Support Vector Machine," BMC Bioinformatics, vol. 4, no. 1, pp. 11-23, 2003.

[4]

G. Bader and C. Hogue, "Bind-A Data Specification for Storing and Describing Biomolecular Interactions, Molecular Complexes and Pathways," Bioinformatics, vol. 16, no. 5, pp. 465-477, 2000.

[5]

G. Bader, I. Donaldson, C. Wolting, B. Ouellette, T. Pawson, and C. Hogue, "Bind-The Biomolecular Interaction Network Database," Nucleic Acids Research, vol. 29, no. 1, pp. 242-245, 2001.

[6]

C. von Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, and P. Bork, "Comparative Assessment of Large-Scale Data Sets of Protein-Protein Interactions," Nature, vol. 417, no. 6887, pp. 399-403, 2002.

[7]

A. Cohen and W. Hersh, "A Survey of Current Work in Biomedical Text Mining," Briefings in Bioinformatics, vol. 6, no. 1, pp. 57-61, 2005.

[8]

J. Yang, A. Cohen, and M. McDonagh, "Syriac: The Systematic Review Information Automated Collection System a Data Warehouse for Facilitating Automated Biomedical Text Classification," Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., pp. 825-829, 2008.

[9]

A. Cohen, K. Ambert, and M. McDonagh, "Cross-Topic Learning for Work Prioritization in Systematic Review Creation and Update," J. Am. Medical Informatics Assoc., vol. 16, pp. 690-704, 2009.

[10]

F. Leitner, S. Mardis, M. Krallinger, G. Cesareni, L. Hirschman, and A. Valencia, "An Overview of Biocreative ii. 5," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 3, pp. 385-399, July-Sept. 2010.

Digital Library

[11]

T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. 10th European Conf. Machine Learning, pp. 137-142, 1998.

Digital Library

[12]

C.-J. Lin and C-C. Chang, "Libsvm: A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2011.

[13]

Y. Tsuruoka, J. McNaught, and S. Ananiadou, "Normalizing Biomedical Terms by Minimizing Ambiguity and Variability," BMC Bioinformatics, vol. 9, no. Suppl 3, p. S2, 2008.

[14]

J. Hakenberg, C. Plake, L. Royer, H. Strobelt, U. Leser, and M. Schroeder, "Gene Mention Normalization and Interaction Extraction with Context Models and Sentence Motifs," Genome Biology, vol. 9(Suppl 2): S14, 2008.

[15]

U. Fayyad and K. Irani, "Multi-Interval Discretization of Continuous Attributes as Preprocessing for Classification Learning," Proc. 13th Int'l Join Conf. Artificial Intelligence, pp. 1022-1027, 1993.

[16]

A. Cohen, "An Effective General Purpose Approach for Automated Biomedical Document Classification," Proc. Am. Medical Informatics Assoc. (AMIA) Ann. Symp., pp. 161-165, 2006.

[17]

K. Ambert and A. Cohen, "A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection," J. Am. Medical Informatics Assoc., vol. 16, no. 4, pp. 590-595, 2009.

[18]

I. Mani, "knn Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction," Proc. Workshop Learning from Imbalanced Data Sets II, 2009.

[19]

S. Tan, "Neighbor-Weighted k-Nearest Neighbor for Unbalanced Text Corpus," Expert Systems with Applications, vol. 28, no. 4, pp. 667-671, 2005.

Digital Library

[20]

G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, "Knn Model-Based Approach in Classification," Proc. on the Move to Meaningful Internet Systems, pp. 986- 996, 2003.

[21]

E. Han, G. Karypis, and V. Kumar, "Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification," Advances in Knowledge Discovery and Data Mining, pp. 53-65, 2001.

Digital Library

[22]

L. Baoli, L. Qin, and Y. Shiwen, "An Adaptive k-Nearest Neighbor Text Categorization Strategy," ACM Trans. Asian Language Information Processing, vol. 3, no. 4, pp. 215-226, 2004.

Digital Library

[23]

F. Sebastiani, "Machine Learning in Automated Text Categorization," ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.

Digital Library

[24]

J. Kent, "Information Gain and a General Measure of Correlation," Biometrika, vol. 70, no. 1, pp. 163-173, 1983.

[25]

M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," Proc. KDD Workshop Text Mining, pp. 525-526, 2000.

[26]

B. Zadrozny, J. Langford, and N. Abe, "Cost-Sensitive Learning by Cost-Proportionate Example Weighting," Proc. Third IEEE Int'l Conf. Data Mining, pp. 435-442, 2003.

Digital Library

[27]

G. Forman, "Tackling Concept Drift by Temporal Inductive Transfer," Proc. 29th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 252-259, 2006.

Digital Library

Cited By

Zhang GZhou XYu XHao XYu L(2017)Enhancing Protein Conformational Space Sampling Using Distance Profile-Guided Differential EvolutionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2016.256661714:6(1288-1301)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1109/TCBB.2016.2566617
Zhu QFeng JHuang J(2016)Natural neighborPattern Recognition Letters10.1016/j.patrec.2016.05.00780:C(30-36)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1016/j.patrec.2016.05.007
Zhang FChen H(2016)An ensemble method for detecting shilling attacks based on ordered item sequencesSecurity and Communication Networks10.1002/sec.13899:7(680-696)Online publication date: 10-May-2016
https://dl.acm.org/doi/10.1002/sec.1389
Show More Cited By

Index Terms

k-Information Gain Scaled Nearest Neighbors: A Novel Approach to Classifying Protein-Protein Interaction-Related Documents
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

Prediction of Protein-Protein Interaction Using Distance Frequency of Amino Acids Grouped with their Physicochemical Properties
BIC-TA '11: Proceedings of the 2011 Sixth International Conference on Bio-Inspired Computing: Theories and Applications

Protein-protein interactions (PPIs) play a key role in many cellular processes. These interactions form the basis of phenomena such as DNA replication and transcription, metabolic pathway, signaling pathway, and cell cycle control. Knowing how proteins ...
Prediction of protein-protein interaction types using the decision templates based on multiple classier fusion

Protein-protein interactions (PPIs) play a key role in many cellular processes, such as the regulation of enzymes, signal transduction or mediating the adhesion of cells. Knowing about the multitude of PPIs that allow the cell to function can help the ...
Kernel Difference-Weighted k-Nearest Neighbors Classification
ICIC '07: Proceedings of the 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence

Nearest Neighbor (NN) rule is one of the simplest and most important methods in pattern recognition. In this paper, we propose a kernel difference-weighted <em>k</em>-nearest neighbor method (KDF-WKNN) for pattern classification. The proposed method ...

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 9, Issue 1

January 2012

341 pages

ISSN:1545-5963

Issue’s Table of Contents

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2012

Published in TCBB Volume 9, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
189
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang GZhou XYu XHao XYu L(2017)Enhancing Protein Conformational Space Sampling Using Distance Profile-Guided Differential EvolutionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2016.256661714:6(1288-1301)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1109/TCBB.2016.2566617
Zhu QFeng JHuang J(2016)Natural neighborPattern Recognition Letters10.1016/j.patrec.2016.05.00780:C(30-36)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1016/j.patrec.2016.05.007
Zhang FChen H(2016)An ensemble method for detecting shilling attacks based on ordered item sequencesSecurity and Communication Networks10.1002/sec.13899:7(680-696)Online publication date: 10-May-2016
https://dl.acm.org/doi/10.1002/sec.1389
Tomašev NMladenić D(2014)Hubness-aware shared neighbor distances for high-dimensional $$k$$-nearest neighbor classificationKnowledge and Information Systems10.1007/s10115-012-0607-539:1(89-122)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1007/s10115-012-0607-5
Zhu YZhou WDai DYan H(2013)Identification of DNA-Binding and Protein-Binding Proteins Using Enhanced Graph Wavelet FeaturesIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2013.11710:4(1017-1031)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.1109/TCBB.2013.117
Keretna SLim CCreighton D(undefined)Enhancement of Medical Named Entity Recognition Using Graph-Based Features2015 IEEE International Conference on Systems, Man, and Cybernetics10.1109/SMC.2015.331(1895-1900)
https://dl.acm.org/doi/10.1109/SMC.2015.331

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents