research-article

Public Access

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems

Authors:

Srinivas AluruAuthors Info & Claims

BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Pages 422 - 433

https://doi.org/10.1145/2975167.2975211

Published: 02 October 2016 Publication History

Abstract

Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, there exist few bioinformatics tools and libraries that can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Using Kmerind, a user can easily instantiate application-specific indices, such as k-mer counter and position index, from biult-in or user-supplied components without extensive high performance computing expertise. Kmerind's k-mer counter performs similarly or better than existing, best-in-class k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read data set in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. To our knowledge, Kmerind is the first k-mer indexing library for distributed memory environments, and the first fully customizable and extensible library for general k-mer indexing and counting. Kmerind is available from https://github.com/ParBLiSS/kmerind.

References

[1]

M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53--86, Mar. 2004.

Digital Library

[2]

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403--410, Oct. 1990.

[3]

P. Audano and F. Vannberg. KAnalyze: A Fast Versatile Pipelined K-mer Toolkit. Bioinformatics, page btu 152, Mar. 2014.

[4]

S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander. ARACHNE: A Whole-Genome Shotgun Assembler. Genome Research, 12(1):177--189, Jan. 2002.

[5]

B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM, 13(7):422--426, July 1970.

Digital Library

[6]

G. Cormode and S. Muthukrishnan. An Improved Data Stream Summary: The Count-min Sketch and Its Applications. J. Algorithms, 55(1):58--75, Apr. 2005.

Digital Library

[7]

S. Deorowicz, M. Kokot, S. Grabowski, and A. Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569--1576, May 2015.

[8]

P. Ferragina and G. Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, 2000. Proceedings, pages 390--398, 2000.

Digital Library

[9]

R. J. Fisher, A. J. Fisher, and H. G. Dietz. Compiling For SIMD Within A Register. In 11th Annual Workshop on Languages and Compilers for Parallel Computing, LCPC '98, pages 290--304. Springer Verlag, 1998.

Digital Library

[10]

P. Flick. mxx. http://patflick.github.io/mxx/.

[11]

P. Flick and S. Aluru. Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 16:1--16:10. ACM, 2015.

Digital Library

[12]

E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, C. Aluru, R. Egan, L. Oliker, D. Rokhsar, and K. Yelick. HipMer: An Extreme-scale De Novo Genome Assembler. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 14:1--14:11. ACM, 2015.

Digital Library

[13]

Google SparseHash. https://github.com/sparsehash/sparsehash.

[14]

D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11(11):1--13, 2010.

[15]

W. J. Kent. BLAT--the BLAST-like alignment tool. Genome Research, 12(4):656--664, Apr. 2002.

[16]

Kmernator. https://github.com/JGI-Bioinformatics/Kmernator.

[17]

S. Kurtz, A. Narechania, J. C. Stein, and D. Ware. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517, 2008.

[18]

Y. Li and XifengYan. MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting. arXiv:1505.06550 {cs, q-bio}, May 2015. arXiv: 1505.06550.

[19]

Y. Liu, J. Schröder, and B. Schmidt. Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics, 29(3):308--315, 2013.

Digital Library

[20]

G. Marçais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764--770, Mar. 2011.

Digital Library

[21]

P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12:333, 2011.

[22]

B. Nystedt et. al. The Norway spruce genome sequence and conifer genome evolution. Nature, 497(7451):579--584, May 2013.

[23]

N. Philippe, M. Salson, T. Lecroq, M. Léonard, T. Commes, and E. Rivals. Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics, 12(1):242, June 2011.

[24]

G. Rizk, D. Lavenier, and R. Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics (Oxford, England), 29(5):652--653, Mar. 2013.

Digital Library

[25]

R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 30(14):1950--1957, July 2014.

[26]

J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and Ä. Birol. ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117--1123, June 2009.

[27]

N. Välimäki and E. Rivals. Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In Bioinformatics Research and Applications, number 7875 in Lecture Notes in Computer Science, pages 237--248. Springer, May 2013.

[28]

X. Yang, K. S. Dorman, and S. Aluru. Reptile: representative tiling for short read error correction. Bioinformatics (Oxford, England), 26(20):2526--2533, Oct. 2010.

Digital Library

[29]

D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5):821--829, May 2008.

[30]

Q. Zhang, J. Pell, R. Canino-Koning, A. C. Howe, and C. T. Brown. These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure. PLoS ONE, 9(7):e101271, July 2014.

Cited By

Mutlu OFirtina C(2023)Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247887(1-4)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247887
Ferraro Petrillo USorella MCattaneo GGiancarlo RRombo S(2019)Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statisticsBMC Bioinformatics10.1186/s12859-019-2694-820:S4Online publication date: 18-Apr-2019
https://doi.org/10.1186/s12859-019-2694-8
Ge JMeng JGuo NWei YBalaji PFeng S(2019)Counting Kmers for Biological Sequences at Large ScaleInterdisciplinary Sciences: Computational Life Sciences10.1007/s12539-019-00348-5Online publication date: 16-Nov-2019
https://doi.org/10.1007/s12539-019-00348-5
Show More Cited By

Index Terms

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems
1. Applied computing
  1. Document management and text processing
    1. Document preparation
  2. Life and medical sciences
2. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

SnpFilt

Display Omitted Reference-free assembly-based discovery of single nucleotide polymorphisms (SNP) from next generation sequencing data of bacterial genomes.A bioinformatics pipeline that constructs an assembly using SPAdes and then removes unreliable ...
MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Motivation:

An important step of "metagenomics" analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines employ a single-genome assembler with carefully optimized ...
Strand specific RNA-seq data for higher specificity
RACS '15: Proceedings of the 2015 Conference on research in adaptive and convergent systems

High-throughput RNA Sequencing (RNA-seq) has become a popular tool for transcriptome analysis. An important application of RNA-seq is to detect differential alternative splicing, that is, differences in exon splicing patterns under different biological ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

October 2016

675 pages

ISBN:9781450342254

DOI:10.1145/2975167

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

BCB '16

Sponsor:

SIGBio

BCB '16: ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

October 2 - 5, 2016

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
498
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)13

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mutlu OFirtina C(2023)Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247887(1-4)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247887
Ferraro Petrillo USorella MCattaneo GGiancarlo RRombo S(2019)Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statisticsBMC Bioinformatics10.1186/s12859-019-2694-820:S4Online publication date: 18-Apr-2019
https://doi.org/10.1186/s12859-019-2694-8
Ge JMeng JGuo NWei YBalaji PFeng S(2019)Counting Kmers for Biological Sequences at Large ScaleInterdisciplinary Sciences: Computational Life Sciences10.1007/s12539-019-00348-5Online publication date: 16-Nov-2019
https://doi.org/10.1007/s12539-019-00348-5
Xiao MLi JHong SYang YLi JWang JYang JDing WZhang L(2018)K-mer Counting: memory-efficient strategy, parallel computing and field of application for Bioinformatics2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM.2018.8621325(2561-2567)Online publication date: Dec-2018
https://doi.org/10.1109/BIBM.2018.8621325
Ge JGuo NMeng JWang BBalaji PFeng SZhou JWei Y(2018)K-mer Counting for Genomic Big DataBig Data – BigData 201810.1007/978-3-319-94301-5_28(345-351)Online publication date: 21-Jun-2018
https://doi.org/10.1007/978-3-319-94301-5_28
Jammula NChockalingam SAluru SHaspel NCowen LShehu AKahveci TPozzi G(2017)Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics AnalysesProceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics10.1145/3107411.3107491(417-424)Online publication date: 20-Aug-2017
https://dl.acm.org/doi/10.1145/3107411.3107491
Jain CFlick PPan TGreen OAluru S(2017)An Adaptive Parallel Algorithm for Computing Connected ComponentsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.267273928:9(2428-2439)Online publication date: 1-Sep-2017
https://doi.org/10.1109/TPDS.2017.2672739
Gao TGuo YWei YWang BLu YCicotti PBalaji PTaufer M(2017)Bloomfish: A Highly Scalable Distributed K-mer Counting Framework2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00033(170-179)Online publication date: Dec-2017
https://doi.org/10.1109/ICPADS.2017.00033
Nihalani RChockalingam SZhu SVazirani VAluru S(2017)Probabilistic estimation of overlap graphs for large sequence datasets2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM.2017.8217657(247-252)Online publication date: Nov-2017
https://doi.org/10.1109/BIBM.2017.8217657

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents