Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Byma, Stuart; Dhasade, Akash; Altenhoff, Adrian; Dessimoz, Christophe; Larus, James R.

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:1908.10574 (cs)

[Submitted on 28 Aug 2019]

Title:Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Authors:Stuart Byma, Akash Dhasade, Adrian Altenhoff, Christophe Dessimoz, James R. Larus

View PDF

Abstract:This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of $n$ elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster, so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. This paper introduces ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. We apply ClusterMerge to the important problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full $O(n^2)$ comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604$\times$ on 768 cores (1400$\times$ faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.

Comments:	11 pages, 11 figures. Submitted for publication
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:1908.10574 [cs.DC]
	(or arXiv:1908.10574v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.1908.10574

Submission history

From: Stuart Byma [view email]
[v1] Wed, 28 Aug 2019 07:11:38 UTC (481 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators