Open access

Data Structures to Represent a Set of k-long DNA Sequences

Published: 08 March 2021


The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.


January 2022
Publication History

Published: 08 March 2021
Accepted: 01 October 2020
Revised: 01 June 2020
Received: 01 April 2019
Published in CSUR Volume 54, Issue 1


