Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2382936.2382978acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
short-paper

DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences

Published: 07 October 2012 Publication History

Abstract

The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in the genes and sequencing of them by using Roche/454 sequencers to generate half to a few millions of 16S rRNA fragments of about 400 base pairs. The main computational challenge of analyzing such data is to group these sequences into operational taxonomic units (OTUs). Common clustering algorithms (such as hierarchical clustering) require quadratic space and time complexity that makes them not suitable for large datasets with millions of sequences. An alternative is to use greedy heuristic clustering methods (such as CD-HIT and UCLUST); although these enable fast sequence analyzing, the hard-cutoff similarity threshold set for them and the random starting seeds can result in reduced accuracy and overestimation (too many clusters). In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method for both clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods.

References

[1]
Peterson J, Garges S, et al. (2009). "NIH Human Microbiome Project." Genome Research. 19(12): 2317--2323.
[2]
Cole JR, Chai B, et al. (2005). "The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis." Nucleic Acids Res. 33(suppl_1): D294--296.
[3]
Altschul, S. F., et al. (1990). "Basic Local Alignment Search Tool." Journal of Molecular Biology. 215: 403--410.
[4]
Edgar, R. C. (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Res. 32: 1792--1797.
[5]
Schloss, P. D. and J. Handelsman. (2005). "Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness." Appl. Environ. Microbiol. 71: 1501--1506.
[6]
Schloss, P. D., S. L. Westcott, et al. (2009). "Introducing mothur: opensource, platform-independent, community-supported software for describing and comparing microbial communities." Appl. Environ. Microbiol. 75: 7537--7541.
[7]
Sun, Y., Y. Cai, et al. (2009). "ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences." Nucleic Acids Res. 37(76).
[8]
Cai, Y., et al. (2011). "ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time." Nucleic Acids Res. 39(95).
[9]
Li, W. and A. Godzik (2006). "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences." Bioinformatics. 22: 1658--1659.
[10]
Edgar, R. C. (2010). "Search and clustering orders of magnitude faster than BLAST." Bioinformatics. 26.
[11]
Yuzhen Ye. "Identification and quantification of abundant species from pyrosequences of 16S rRNA by consensus alignment." The Proceedings of BIBM 2010, 153--157
[12]
Fox, G. C. (2011). "Deterministic Annealing and Robust Scalable Data Mining for the Data Deluge." PDAC'11, Seattle, Washington, ACM.
[13]
Hughes, A., Y. Ruan, et al. (2012). "Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets." BMC Bioinformatics 13(Suppl 2): S9.
[14]
J.Ekanayake, et al. "Twister: A Runtime for iterative MapReduce." Proceedings of MapReduce'10 of ACM HPDC 2010, Chicago, Illinois, ACM.
[15]
Ruan, Y., Z. Guo, et al. "HyMR: a Hybrid MapReduce Workflow System." Proceedings of ECMLS'12 of ACM HPDC 2012, Delft, Netherlands, ACM.
[16]
Rose, K., Gurewitz E., Fox, G. C. (1990). "Statistical mechanics and phase transitions in clustering." Phys. Rev. Lett. 65: 945--948.
[17]
Rose, K. (1998). "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems." Proceedings of the IEEE 86(11): 2210--2239.
[18]
O. Gotoh, (1982) "An improved algorithm for matching biological sequences." Journal of Molecular Biology. 162:705--708.
[19]
Needleman, Saul B. and Wunsch, Christian D. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." Journal of Molecular Biology 48 (3): 443--53.
[20]
Rose, K., Gurewwitz, E., and Fox, G. (1990). "A deterministic annealing approach to clustering." Pattern Recogn. Lett. 11: 589--594.
[21]
Bronstein, M. M., A. M. Bronstein, et al. (2006). "Multigrid multidimensional scaling." Numerical Linear Algebra with Applications. Wiley.
[22]
Borg, I., and Groenen, P. J. (2005) "Modern Multidimensional Scaling: Theory and Applications." Springer, 2005.
[23]
Bae, S.-H., J. Qiu, et al. (2010). "Multidimensional Scaling by Deterministic Annealing with Iterative Majorization algorithm." Proceedings of the 6th IEEE e-Science Conference, Brisbane, Australia.
[24]
PlotViz - A tool for visualizing large and high-dimensional data. http://salsahpc.indiana.edu/pviz3/
[25]
Bae, S.-H., J. Y. C., et al. (2010). "Dimension reduction and visualization of large high-dimensional data via interpolation." Proceedings of the 19th ACM HPDC Conference, Chicago, Illinois, ACM.
[26]
J. Barnes and P. Hut (1986). "A hierarchical O(N log N) force-calculation algorithm." Nature 324 (4): 446--449

Cited By

View all
  • (2020)Distributed Many-to-Many Protein Sequence Alignment using Sparse MatricesSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00079(1-14)Online publication date: Nov-2020
  • (2017)Sparse Inductive Embedding: An Explorative Data Visualization Technique2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2017.00099(618-622)Online publication date: Nov-2017
  • (2016)Phylogenetically Structured Differences in rRNA Gene Sequence Variation among Species of Arbuscular Mycorrhizal Fungi and Their Implications for Sequence ClusteringApplied and Environmental Microbiology10.1128/AEM.00816-1682:16(4921-4930)Online publication date: 3-Jun-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '12: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
October 2012
725 pages
ISBN:9781450316705
DOI:10.1145/2382936
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deterministic annealing
  2. exploratory data analysis
  3. interpolation
  4. multidimensional scaling
  5. pairwise data clustering

Qualifiers

  • Short-paper

Funding Sources

Conference

BCB' 12
Sponsor:

Acceptance Rates

BCB '12 Paper Acceptance Rate 33 of 159 submissions, 21%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Distributed Many-to-Many Protein Sequence Alignment using Sparse MatricesSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00079(1-14)Online publication date: Nov-2020
  • (2017)Sparse Inductive Embedding: An Explorative Data Visualization Technique2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI.2017.00099(618-622)Online publication date: Nov-2017
  • (2016)Phylogenetically Structured Differences in rRNA Gene Sequence Variation among Species of Arbuscular Mycorrhizal Fungi and Their Implications for Sequence ClusteringApplied and Environmental Microbiology10.1128/AEM.00816-1682:16(4921-4930)Online publication date: 3-Jun-2016
  • (2016)TSmap3D: Browser visualization of high dimensional time series data2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7841022(3583-3592)Online publication date: Dec-2016
  • (2016)Multidimensional Scaling for Genomic DataAdvances in Stochastic and Deterministic Global Optimization10.1007/978-3-319-29975-4_7(129-139)Online publication date: 5-Nov-2016
  • (2014)Integration of clustering and multidimensional scaling to determine phylogenetic trees as spherical phylograms visualized in 3 dimensionsProceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2014.126(720-729)Online publication date: 26-May-2014
  • (2013)A Robust and Scalable Solution for Interpolative Multidimensional Scaling with WeightingProceedings of the 2013 IEEE 9th International Conference on e-Science10.1109/eScience.2013.30(61-69)Online publication date: 22-Oct-2013
  • (2013)EDR: An energy-aware runtime load distribution system for data-intensive applications in the cloud2013 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2013.6702674(1-8)Online publication date: Sep-2013
  • (2013)Co-processing SPMD computation on CPUs and GPUs cluster2013 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2013.6702632(1-10)Online publication date: Sep-2013
  • (2013)Parallel deterministic annealing clustering and its application to LC-MS data analysis2013 IEEE International Conference on Big Data10.1109/BigData.2013.6691636(665-673)Online publication date: Oct-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media