Clustering of highly homologous sequences to reduce the size of large protein databases

W Li; L Jaroszewski; A Godzik

doi:10.1093/bioinformatics/17.3.282

Clustering of highly homologous sequences to reduce the size of large protein databases

Bioinformatics. 2001 Mar;17(3):282-3. doi: 10.1093/bioinformatics/17.3.282.

Authors

W Li¹, L Jaroszewski, A Godzik

Affiliation

¹ San Diego Supercomputer Center, La Jolla, CA 92093, USA. liwz@sdsc.edu

PMID: 11294794
DOI: 10.1093/bioinformatics/17.3.282

Abstract

We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms
Databases, Factual*
Proteins / analysis*
Sequence Analysis
Software*

Substances

Proteins

Grants and funding

GM60049/GM/NIGMS NIH HHS/United States