K-mer Mapping and RDBMS Indexes

de Armas, Elvismary Molina; Ferreira, Paulo Cavalcanti Gomes; Haeusler, Edward Hermann; de Holanda, Maristela Terto; Lifschitz, Sérgio

doi:10.1007/978-3-030-46417-2_7

Elvismary Molina de Armas¹⁰,
Paulo Cavalcanti Gomes Ferreira¹²,
Edward Hermann Haeusler¹⁰,
Maristela Terto de Holanda¹¹ &
…
Sérgio Lifschitz¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11347))

Included in the following conference series:

Brazilian Symposium on Bioinformatics

413 Accesses
2 Citations

Abstract

K-mer Mapping, an internal process for De Novo NGS genome fragments assembly methods, constitutes a computational challenge due to its high main memory consumption. We present a study of index-based methods to deal with this problem, considering a RDBMS environment. We propose an ad-hoc I/O cost model and analyze the performance of hash and B-tree versions for index structures. Furthermore, we present a novel approach for an index based on hashing that takes into account the notion of minimum substrings. An actual RDBMS implementation for experiments with a sugarcane dataset shows that one can obtain considerable performance gains while reducing main memory requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Kohdista: an efficient method to index and query possible Rmap alignments

Article Open access 12 December 2019

ND-GiST: A Novel Method for Disk-Resident k-mer Indexing

References

Bradnam, K.R., Fass, J.N., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)
Article Google Scholar
Butler, J., et al.: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008)
Article Google Scholar
Claros, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P., Fernández-Pozo, N.: Why assembling plant genome sequences is so challenging. Biology 1(2), 439 (2012)
Google Scholar
Cook, J.J., Zilles, C.: Characterizing and optimizing the memory footprint of de novo short read DNA sequence assembly. In: Performance Analysis of Systems and Software, ISPASS, pp. 143–152, April 2009
Google Scholar
de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: Proceedings IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016)
Google Scholar
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1), 160 (2013)
Article Google Scholar
Earl, D., Bradnam, K., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21(12), 2224–2241 (2011)
Article Google Scholar
El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), 1–19 (2013)
Article Google Scholar
Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS ONE 8(9) (2013)
Google Scholar
Kundeti, V., Rajasekaran, S., Dinh, H.: Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. ArXiv (2010)
Google Scholar
Li, R., Zhu, H., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2009)
Article Google Scholar
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. PVLDB 6(3), 169–180 (2013)
Google Scholar
Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012)
Article Google Scholar
Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill Inc, New York, NY, USA (2003)
MATH Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Article Google Scholar
Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)
Article Google Scholar
Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)
Article Google Scholar
Schatz, M.C., Witkowski, J., McCombie, W.R.: Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13(4), 1–7 (2012)
Article Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Depto. Informática, PUC-Rio, Rio de Janeiro, Brazil
Elvismary Molina de Armas, Edward Hermann Haeusler & Sérgio Lifschitz
Depto. Ciência da Computação, UNB, Brasília, Brazil
Maristela Terto de Holanda
Depto. Bioquimica Médica, UFRJ, Rio de Janeiro, Brazil
Paulo Cavalcanti Gomes Ferreira

Authors

Elvismary Molina de Armas
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Cavalcanti Gomes Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Edward Hermann Haeusler
View author publications
You can also search for this author in PubMed Google Scholar
Maristela Terto de Holanda
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Lifschitz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elvismary Molina de Armas .

Editor information

Editors and Affiliations

Fluminense Federal University, Niterói, Brazil
Luis Kowada
Fluminense Federal University, Niterói, Brazil
Daniel de Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S. (2020). K-mer Mapping and RDBMS Indexes. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-46417-2_7
Published: 29 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46416-5
Online ISBN: 978-3-030-46417-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

K-mer Mapping and RDBMS Indexes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Kohdista: an efficient method to index and query possible Rmap alignments

ND-GiST: A Novel Method for Disk-Resident k-mer Indexing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

K-mer Mapping and RDBMS Indexes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Kohdista: an efficient method to index and query possible Rmap alignments

ND-GiST: A Novel Method for Disk-Resident k-mer Indexing

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation