Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2975167.2975211acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Public Access

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems

Published: 02 October 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, there exist few bioinformatics tools and libraries that can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Using Kmerind, a user can easily instantiate application-specific indices, such as k-mer counter and position index, from biult-in or user-supplied components without extensive high performance computing expertise. Kmerind's k-mer counter performs similarly or better than existing, best-in-class k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read data set in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. To our knowledge, Kmerind is the first k-mer indexing library for distributed memory environments, and the first fully customizable and extensible library for general k-mer indexing and counting. Kmerind is available from https://github.com/ParBLiSS/kmerind.

    References

    [1]
    M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53--86, Mar. 2004.
    [2]
    S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403--410, Oct. 1990.
    [3]
    P. Audano and F. Vannberg. KAnalyze: A Fast Versatile Pipelined K-mer Toolkit. Bioinformatics, page btu 152, Mar. 2014.
    [4]
    S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander. ARACHNE: A Whole-Genome Shotgun Assembler. Genome Research, 12(1):177--189, Jan. 2002.
    [5]
    B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM, 13(7):422--426, July 1970.
    [6]
    G. Cormode and S. Muthukrishnan. An Improved Data Stream Summary: The Count-min Sketch and Its Applications. J. Algorithms, 55(1):58--75, Apr. 2005.
    [7]
    S. Deorowicz, M. Kokot, S. Grabowski, and A. Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569--1576, May 2015.
    [8]
    P. Ferragina and G. Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, 2000. Proceedings, pages 390--398, 2000.
    [9]
    R. J. Fisher, A. J. Fisher, and H. G. Dietz. Compiling For SIMD Within A Register. In 11th Annual Workshop on Languages and Compilers for Parallel Computing, LCPC '98, pages 290--304. Springer Verlag, 1998.
    [10]
    P. Flick. mxx. http://patflick.github.io/mxx/.
    [11]
    P. Flick and S. Aluru. Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 16:1--16:10. ACM, 2015.
    [12]
    E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, C. Aluru, R. Egan, L. Oliker, D. Rokhsar, and K. Yelick. HipMer: An Extreme-scale De Novo Genome Assembler. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, pages 14:1--14:11. ACM, 2015.
    [13]
    Google SparseHash. https://github.com/sparsehash/sparsehash.
    [14]
    D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11(11):1--13, 2010.
    [15]
    W. J. Kent. BLAT--the BLAST-like alignment tool. Genome Research, 12(4):656--664, Apr. 2002.
    [16]
    Kmernator. https://github.com/JGI-Bioinformatics/Kmernator.
    [17]
    S. Kurtz, A. Narechania, J. C. Stein, and D. Ware. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517, 2008.
    [18]
    Y. Li and XifengYan. MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting. arXiv:1505.06550 {cs, q-bio}, May 2015. arXiv: 1505.06550.
    [19]
    Y. Liu, J. Schröder, and B. Schmidt. Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics, 29(3):308--315, 2013.
    [20]
    G. Marçais and C. Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764--770, Mar. 2011.
    [21]
    P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics, 12:333, 2011.
    [22]
    B. Nystedt et. al. The Norway spruce genome sequence and conifer genome evolution. Nature, 497(7451):579--584, May 2013.
    [23]
    N. Philippe, M. Salson, T. Lecroq, M. Léonard, T. Commes, and E. Rivals. Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics, 12(1):242, June 2011.
    [24]
    G. Rizk, D. Lavenier, and R. Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics (Oxford, England), 29(5):652--653, Mar. 2013.
    [25]
    R. S. Roy, D. Bhattacharya, and A. Schliep. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics, 30(14):1950--1957, July 2014.
    [26]
    J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and Ä. Birol. ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117--1123, June 2009.
    [27]
    N. Välimäki and E. Rivals. Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In Bioinformatics Research and Applications, number 7875 in Lecture Notes in Computer Science, pages 237--248. Springer, May 2013.
    [28]
    X. Yang, K. S. Dorman, and S. Aluru. Reptile: representative tiling for short read error correction. Bioinformatics (Oxford, England), 26(20):2526--2533, Oct. 2010.
    [29]
    D. R. Zerbino and E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5):821--829, May 2008.
    [30]
    Q. Zhang, J. Pell, R. Canino-Koning, A. C. Howe, and C. T. Brown. These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure. PLoS ONE, 9(7):e101271, July 2014.

    Cited By

    View all
    • (2023)Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247887(1-4)Online publication date: 9-Jul-2023
    • (2019)Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statisticsBMC Bioinformatics10.1186/s12859-019-2694-820:S4Online publication date: 18-Apr-2019
    • (2019)Counting Kmers for Biological Sequences at Large ScaleInterdisciplinary Sciences: Computational Life Sciences10.1007/s12539-019-00348-5Online publication date: 16-Nov-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
    October 2016
    675 pages
    ISBN:9781450342254
    DOI:10.1145/2975167
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MPI
    2. SIMD
    3. distributed computing
    4. k-mer counting
    5. k-mer index
    6. next generation sequencing
    7. parallel computing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    BCB '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 885 submissions, 29%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)76
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247887(1-4)Online publication date: 9-Jul-2023
    • (2019)Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statisticsBMC Bioinformatics10.1186/s12859-019-2694-820:S4Online publication date: 18-Apr-2019
    • (2019)Counting Kmers for Biological Sequences at Large ScaleInterdisciplinary Sciences: Computational Life Sciences10.1007/s12539-019-00348-5Online publication date: 16-Nov-2019
    • (2018)K-mer Counting: memory-efficient strategy, parallel computing and field of application for Bioinformatics2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM.2018.8621325(2561-2567)Online publication date: Dec-2018
    • (2018)K-mer Counting for Genomic Big DataBig Data – BigData 201810.1007/978-3-319-94301-5_28(345-351)Online publication date: 21-Jun-2018
    • (2017)Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics AnalysesProceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics10.1145/3107411.3107491(417-424)Online publication date: 20-Aug-2017
    • (2017)An Adaptive Parallel Algorithm for Computing Connected ComponentsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.267273928:9(2428-2439)Online publication date: 1-Sep-2017
    • (2017)Bloomfish: A Highly Scalable Distributed K-mer Counting Framework2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00033(170-179)Online publication date: Dec-2017
    • (2017)Probabilistic estimation of overlap graphs for large sequence datasets2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM.2017.8217657(247-252)Online publication date: Nov-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media