Search | arXiv e-print repository

Indexing Finite-State Automata Using Forward-Stable Partitions

Authors: Ruben Becker, Sung-Hwan Kim, Nicola Prezza, Carlo Tosoni

Abstract: An index on a finite-state automaton is a data structure able to locate specific patterns on the automaton's paths and consequently on the regular language accepted by the automaton itself. Cotumaccio and Prezza [SODA '21], introduced a data structure able to solve pattern matching queries on automata, generalizing the famous FM-index for strings of Ferragina and Manzini [FOCS '00]. The efficiency… ▽ More An index on a finite-state automaton is a data structure able to locate specific patterns on the automaton's paths and consequently on the regular language accepted by the automaton itself. Cotumaccio and Prezza [SODA '21], introduced a data structure able to solve pattern matching queries on automata, generalizing the famous FM-index for strings of Ferragina and Manzini [FOCS '00]. The efficiency of their index depends on the width of a particular partial order of the automaton's states, the smaller the width of the partial order, the faster is the index. However, computing the partial order of minimal width is NP-hard. This problem was mitigated by Cotumaccio [DCC '22], who relaxed the conditions on the partial order, allowing it to be a partial preorder. This relaxation yields the existence of a unique partial preorder of minimal width that can be computed in polynomial time. In the paper at hand, we present a new class of partial preorders and show that they have the following useful properties: (i) they can be computed in polynomial time, (ii) their width is never larger than the width of Cotumaccio's preorders, and (iii) there exist infinite classes of automata on which the width of Cotumaccio's pre-order is linearly larger than the width of our preorder. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 20 pages, 3 figures, submitted in SPIRE 2024

arXiv:2404.14235 [pdf, other]

Computing the LCP Array of a Labeled Graph

Authors: Jarno Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, Nicola Prezza

Abstract: The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queri… ▽ More The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queries on the graph's paths. In this paper, we provide the first efficient algorithm building the LCP array of a directed labeled graph with $n$ nodes and $m$ edges labeled over an alphabet of size $σ$. After arguing that the natural generalization of a compact-space LCP-construction algorithm by Beller et al. [J. Discrete Algorithms 2013] runs in time $Ω(nσ)$, we present a new algorithm based on dynamic range stabbing building the LCP array in $O(n\log σ)$ time and $O(n\logσ)$ bits of working space. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2312.01359 [pdf, other]

Suffixient Sets

Authors: Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

Abstract: We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most… ▽ More We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most $2 \bar{r}$, where $\bar{r}$ is the number of runs in the Burrows-Wheeler Transform of the reverse of $T$. We then show that, given a straight-line program for $T$ with $g$ rules, we can build an $O (\bar{r} + g)$-space index with which, given a pattern $P [1..m]$, we can find the maximal exact matches (MEMs) of $P$ with respect to $T$ in $O (m \log (σ) / \log n + d \log n)$ time, where $σ$ is the size of the alphabet and $d$ is the number of times we would fully or partially descend edges in the suffix tree of $T$ while finding those MEMs. △ Less

Submitted 4 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2310.17980 [pdf, other]

Sketching and Streaming for Dictionary Compression

Authors: Ruben Becker, Matteo Canton, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza

Abstract: We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicativ… ▽ More We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function $δ$. We present a data sketch of $O(ε^{-3}\log n + ε^{-1}\log^2 n)$ words that allows computing a multiplicative $(1\pm ε)$-approximation of $δ$ with high probability, where $n$ is the string length. The sketches of two strings $S_1,S_2$ can be merged in $O(ε^{-1}\log^2 n)$ time to yield the sketch of $\{S_1,S_2\}$, speeding up by orders of magnitude tasks such as the computation of all-pairs \emph{Normalized Compression Distances} (NCD). If random access is available on the input, our sketch can be updated in $O(ε^{-1}\log^2 n)$ time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating $δ$, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and $O(ε^{-1}\log^2 n)$ worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction. △ Less

Submitted 9 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

arXiv:2307.07267 [pdf, other]

Random Wheeler Automata

Authors: Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Riccardo Maso, Nicola Prezza

Abstract: Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the co-lexicographic order of the strings labeling the automaton's paths; this property makes it possible to represent the automaton's topology in… ▽ More Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the co-lexicographic order of the strings labeling the automaton's paths; this property makes it possible to represent the automaton's topology in a constant number of bits per transition, as well as efficiently solving pattern matching queries on its accepted regular language. After their introduction, Wheeler automata have been the subject of a prolific line of research, both from the algorithmic and language-theoretic points of view. A recurring issue faced in these studies is the lack of large datasets of Wheeler automata on which the developed algorithms and theories could be tested. One possible way to overcome this issue is to generate random Wheeler automata. Motivated by this observation, in this paper we initiate the theoretical study of random Wheeler automata, focusing on the deterministic case (Wheeler DFAs -- WDFAs). We start by extending the Erdős-Rényi random graph model to WDFAs, and proceed by providing an algorithm generating uniform WDFAs according to this model. Our algorithm generates a uniform WDFA with $n$ states, $m$ transitions, and alphabet's cardinality $σ$ in $O(m)$ expected time ($O(m\log m)$ worst-case time w.h.p.) and constant working space for all alphabets of size $σ\le m/\ln m$. As a by-product, we also give formulas for the number of distinct WDFAs and obtain that $ nσ+ (n - σ) \log σ$ bits are necessary and sufficient to encode a WDFA with $n$ states and alphabet of size $σ$, up to an additive $Θ(n)$ term. We present an implementation of our algorithm and show that it is extremely fast in practice, with a throughput of over 8 million transitions per second. △ Less

Submitted 7 June, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

Comments: 17 pages, 3 figures

arXiv:2306.05684 [pdf, ps, other]

Space-time Trade-offs for the LCP Array of Wheeler DFAs

Authors: Nicola Cotumaccio, Travis Gagie, Dominik Köppl, Nicola Prezza

Abstract: Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particu… ▽ More Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant. In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time trade-off to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a $ k $-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in $ k $, thus improving a previous bound by Boucher et al. which was linear in $ k $ [DCC 2015]. △ Less

Submitted 9 June, 2023; originally announced June 2023.

arXiv:2306.04737 [pdf, other]

Optimal Wheeler Language Recognition

Authors: Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Alberto Policriti, Nicola Prezza

Abstract: A Wheeler automaton is a finite state automaton whose states admit a total Wheeler order, reflecting the co-lexicographic order of the strings labeling source-to-node paths. A Wheeler language is a regular language admitting an accepting Wheeler automaton. Wheeler languages admit efficient and elegant solutions to hard problems such as automata compression and regular expression matching, therefor… ▽ More A Wheeler automaton is a finite state automaton whose states admit a total Wheeler order, reflecting the co-lexicographic order of the strings labeling source-to-node paths. A Wheeler language is a regular language admitting an accepting Wheeler automaton. Wheeler languages admit efficient and elegant solutions to hard problems such as automata compression and regular expression matching, therefore deciding whether a regular language is Wheeler is relevant in applications requiring efficient solutions to those problems. In this paper, we show that it is possible to decide whether a DFA with n states and m transitions recognizes a Wheeler language in $O(mn)$ time. This is a significant improvement over the running time $O(n^{13} + m\log n)$ of the previous polynomial-time algorithm (Alanko et al., Information and Computation 2021). A proof-of-concept implementation of this algorithm is available in a public repository. We complement this upper bound with a conditional matching lower bound stating that, unless the strong exponential time hypothesis (SETH) fails, the problem cannot be solved in strongly subquadratic time. The same problem is known to be PSPACE-complete when the input is an NFA (D'Agostino et al., Theoretical Computer Science 2023). Together with that result, our paper essentially closes the algorithmic problem of Wheeler language recognition. △ Less

Submitted 18 December, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

arXiv:2305.05129 [pdf, other]

Sorting Finite Automata via Partition Refinement

Authors: Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, Nicola Prezza

Abstract: Wheeler nondeterministic finite automata (WNFAs) were introduced as a generalization of prefix sorting from strings to labeled graphs. WNFAs admit optimal solutions to classic hard problems on labeled graphs and languages. The problem of deciding whether a given NFA is Wheeler is known to be NP-complete. Recently, however, Alanko et al. showed how to side-step this complexity by switching to preor… ▽ More Wheeler nondeterministic finite automata (WNFAs) were introduced as a generalization of prefix sorting from strings to labeled graphs. WNFAs admit optimal solutions to classic hard problems on labeled graphs and languages. The problem of deciding whether a given NFA is Wheeler is known to be NP-complete. Recently, however, Alanko et al. showed how to side-step this complexity by switching to preorders: letting $Q$ be the set of states, $E$ the set of transitions, $|Q|=n$, and $|E|=m$, they provided a $O(mn^2)$-time algorithm computing a totally-ordered partition of the WNFA's states such that (1) equivalent states recognize the same regular language, and (2) the order of non-equivalent states is consistent with any Wheeler order, when one exists. Then, the output is a preorder of the states as useful for pattern matching as standard Wheeler orders. Further research generalized these concepts to arbitrary NFAs by introducing co-lex partial preorders: any NFA admits a partial preorder of its states reflecting the co-lex order of their accepted strings; the smaller the width of such preorder is, the faster regular expression matching queries can be performed. To date, the fastest algorithm for computing the smallest-width partial preorder on NFAs runs in $O(m^2+n^{5/2})$ time, while on DFAs the same can be done in $O(\min(n^2\log n,mn))$ time. In this paper, we provide much more efficient solutions to the problem above. Our results are achieved by extending a classic algorithm for the relational coarsest partition refinement problem to work with ordered partitions. Specifically, we provide a $O(m\log n)$-time algorithm computing a co-lex total preorder when the input is a WNFA, and an algorithm with the same time complexity computing the smallest-width co-lex partial order of any DFA. Also, we present implementations of our algorithms and show that they are very efficient in practice. △ Less

Submitted 18 December, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

arXiv:2305.03626 [pdf, other]

Verifiable Learning for Robust Tree Ensembles

Authors: Stefano Calzavara, Lorenzo Cazzaro, Giulio Ermanno Pibiri, Nicola Prezza

Abstract: Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security ve… ▽ More Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security verification algorithm running in polynomial time. We then propose a new approach called verifiable learning, which advocates the training of such restricted model classes which are amenable for efficient verification. We show the benefits of this idea by designing a new training algorithm that automatically learns a large-spread decision tree ensemble from labelled data, thus enabling its security verification in polynomial time. Experimental results on public datasets confirm that large-spread ensembles trained using our algorithm can be verified in a matter of seconds, using standard commercial hardware. Moreover, large-spread ensembles are more robust than traditional ensembles against evasion attacks, at the cost of an acceptable loss of accuracy in the non-adversarial setting. △ Less

Submitted 11 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: 19 pages, 5 figures; full version of the revised paper accepted at ACM CCS 2023 with corrected typo in footnote 1

arXiv:2304.10962 [pdf, other]

Faster Prefix-Sorting Algorithms for Deterministic Finite Automata

Authors: Sung-Hwan Kim, Francisco Olivares, Nicola Prezza

Abstract: Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these tech… ▽ More Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these techniques were extended to sets of strings described by means of finite automata: the theory of Wheeler graphs [Gagie et al., TCS'17] introduced automata whose states can be totally-sorted according to the co-lexicographic (co-lex in the following) order of the prefixes of words accepted by the automaton. More recently, in [Cotumaccio, Prezza, SODA'21] it was shown how to extend these ideas to arbitrary automata by means of partial co-lex orders. This work showed that a co-lex order of minimum width (thus optimizing search query times) on deterministic finite automata (DFAs) can be computed in $O(m^2 + n^{5/2})$ time, $m$ being the number of transitions and $n$ the number of states of the input DFA. In this paper, we exhibit new combinatorial properties of the minimum-width co-lex order of DFAs and exploit them to design faster prefix sorting algorithms. In particular, we describe two algorithms sorting arbitrary DFAs in $O(mn)$ and $O(n^2\log n)$ time, respectively, and an algorithm sorting acyclic DFAs in $O(m\log n)$ time. Within these running times, all algorithms compute also a smallest chain partition of the partial order (required to index the DFA). We present an experiment result to show that an optimized implementation of the $O(n^2\log n)$-time algorithm exhibits a nearly-linear behaviour on large deterministic pan-genomic graphs and is thus also of practical interest. △ Less

Submitted 21 April, 2023; originally announced April 2023.

arXiv:2301.05338 [pdf, ps, other]

Computing matching statistics on Wheeler DFAs

Authors: Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino

Abstract: Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho… ▽ More Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs. △ Less

Submitted 12 January, 2023; originally announced January 2023.

arXiv:2301.00754 [pdf, other]

Algorithms for Massive Data -- Lecture Notes

Authors: Nicola Prezza

Abstract: These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These not… ▽ More These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover both topics: compressed suffix arrays, probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams (pattern matching, counting). △ Less

Submitted 25 March, 2024; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: added chapter 1 on compressed data structures. Fixed a few mistakes (Blooom filter analysis) and typos

arXiv:2208.04931 [pdf, ps, other]

Co-lexicographically Ordering Automata and Regular Languages -- Part I

Authors: Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

Abstract: In the present work, we lay out a new theory showing that all automata can always be co-lexicographically partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width $p$ of one of their admissible co-lex partial orders - dubbed here the automaton's co-lex width. We first show that this new measure captures at once the complex… ▽ More In the present work, we lay out a new theory showing that all automata can always be co-lexicographically partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width $p$ of one of their admissible co-lex partial orders - dubbed here the automaton's co-lex width. We first show that this new measure captures at once the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width $p$: (i) has an equivalent powerset DFA whose size is exponential in $p$ rather than (as a classic analysis shows) in the NFA's size; (ii) can be encoded using just $Θ(\log p)$ bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to $p^2$ per matched character. Some consequences of this new parametrization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in $p$, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small $p$. We prove that a canonical minimum-width DFA accepting a language $\mathcal L$ - dubbed the Hasse automaton $\mathcal H$ of $\mathcal L$ - can be exhibited. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogous of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages. △ Less

Submitted 3 May, 2023; v1 submitted 9 August, 2022; originally announced August 2022.

Comments: arXiv admin note: text overlap with arXiv:2106.02309

arXiv:2111.02480 [pdf, ps, other]

Linear-time Minimization of Wheeler DFAs

Authors: Jarno Alanko, Nicola Cotumaccio, Nicola Prezza

Abstract: Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just $\logσ+ O(1)$ bits per edge, $σ$ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to ac… ▽ More Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just $\logσ+ O(1)$ bits per edge, $σ$ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to achieve further compression is minimization. When the input $\mathcal A$ is a general deterministic finite-state automaton (DFA), the state-of-the-art is represented by the classic Hopcroft's algorithm, which runs in $O(|\mathcal A|\log |\mathcal A|)$ time. This algorithm stands at the core of the only existing minimization algorithm for Wheeler DFAs, which inherits its complexity. In this work, we show that the minimum WDFA equivalent to a given input WDFA can be computed in linear $O(|\mathcal A|)$ time. When run on de Bruijn WDFAs built from real DNA datasets, an implementation of our algorithm reduces the number of nodes from 14% to 51% at a speed of more than 1 million nodes per second. △ Less

Submitted 3 November, 2021; originally announced November 2021.

arXiv:2111.02478 [pdf, other]

HOLZ: High-Order Entropy Encoding of Lempel-Ziv Factor Distances

Authors: Dominik Köppl, Gonzalo Navarro, Nicola Prezza

Abstract: We propose a new representation of the offsets of the Lempel-Ziv (LZ) factorization based on the co-lexicographic order of the processed prefixes. The selected offsets tend to approach the k-th order empirical entropy. Our evaluations show that this choice of offsets is superior to the rightmost LZ parsing and the bit-optimal LZ parsing on datasets with small high-order entropy. We propose a new representation of the offsets of the Lempel-Ziv (LZ) factorization based on the co-lexicographic order of the processed prefixes. The selected offsets tend to approach the k-th order empirical entropy. Our evaluations show that this choice of offsets is superior to the rightmost LZ parsing and the bit-optimal LZ parsing on datasets with small high-order entropy. △ Less

Submitted 3 November, 2021; originally announced November 2021.

arXiv:2106.02309 [pdf, ps, other]

On (co-lex) Ordering Automata

Authors: Giovanna D'Agostino, Nicola Cotumaccio, Alberto Policriti, Nicola Prezza

Abstract: The states of a deterministic finite automaton A can be identified with collections of words in Pf(L(A)) -- the set of prefixes of words belonging to the regular language accepted by A. But words can be ordered and among the many possible orders a very natural one is the co-lexicographic one. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton'… ▽ More The states of a deterministic finite automaton A can be identified with collections of words in Pf(L(A)) -- the set of prefixes of words belonging to the regular language accepted by A. But words can be ordered and among the many possible orders a very natural one is the co-lexicographic one. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton's states. In a number of papers automata admitting a total ordering of states coherent with the ordering of the set of words reaching them have been proposed. Such class of ordered automata -- the Wheeler automata -- turned out to be efficiently stored/searched using an index. Unfortunately not all automata can be totally ordered as previously outlined. However, automata can always be partially ordered and an intrinsic measure of their complexity can be defined and effectively determined, as the minimum width of one of their admissible partial orders. As shown in previous works, this new concept of width of an automaton has useful consequences in the fields of graph compression, indexing data structures, and automata theory. In this paper we prove that a canonical, minimum-width, partially-ordered automaton accepting a language L -- dubbed the Hasse automaton H of L -- can be exhibited. H provides, in a precise sense, the best possible way to (partially) order the states of any automaton accepting L, as long as we want to maintain an operational link with the (co-lexicographic) order of Pf(L(A)). Using H we prove that the width of the language can be effectively computed from the minimum automaton recognizing the language. Finally, we explore the relationship between two (often conflicting) objectives: minimizing the width and minimizing the number of states of an automaton. △ Less

Submitted 4 June, 2021; originally announced June 2021.

arXiv:2102.06798 [pdf, ps, other]

Co-lexicographically Ordering Automata and Regular Languages -- Part II

Authors: Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

Abstract: In the present work, we tackle the regular language indexing problem by first studying the hierarchy of $p$-sortable languages: regular languages accepted by automata of width $p$. We show that the hierarchy is strict and does not collapse, and provide (exponential in $p$) upper and lower bounds relating the minimum widths of equivalent NFAs and DFAs. Our bounds indicate the importance of being ab… ▽ More In the present work, we tackle the regular language indexing problem by first studying the hierarchy of $p$-sortable languages: regular languages accepted by automata of width $p$. We show that the hierarchy is strict and does not collapse, and provide (exponential in $p$) upper and lower bounds relating the minimum widths of equivalent NFAs and DFAs. Our bounds indicate the importance of being able to index NFAs, as they enable indexing regular languages with much faster and smaller indexes. Our second contribution solves precisely this problem, optimally: we devise a polynomial-time algorithm that indexes any NFA with the optimal value $p$ for its width, without explicitly computing $p$ (NP-hard to find). In particular, this implies that we can index in polynomial time the well-studied case $p=1$ (Wheeler NFAs). More in general, in polynomial time we can build an index breaking the worst-case conditional lower bound of $Ω(|P| m)$, whenever the input NFA's width is $p \in o(\sqrt{m})$. △ Less

Submitted 10 March, 2023; v1 submitted 12 February, 2021; originally announced February 2021.

arXiv:2011.10008 [pdf, other]

Subpath Queries on Compressed Graphs: a Survey

Authors: Nicola Prezza

Abstract: Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text $T$, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in $T$ in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two order… ▽ More Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text $T$, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in $T$ in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text's entropy. These contributions had an enormous impact in bioinformatics: nowadays, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today's compressed indexes for labeled graphs and regular languages. △ Less

Submitted 13 December, 2020; v1 submitted 19 November, 2020; originally announced November 2020.

Comments: Fixed some typos and references to Boyer-Moore-Galil's and Apostolico-Giancarlo's algorithms

arXiv:2011.07143 [pdf, ps, other]

Adaptive Learning of Compressible Strings

Authors: Gabriele Fici, Nicola Prezza, Rossano Venturini

Abstract: Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $σn/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $σ$ is the size of the alphabet of $S$ and $n$ its length,… ▽ More Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $σn/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $σ$ is the size of the alphabet of $S$ and $n$ its length, and gave an algorithm that spends $(σ-1)n+O(σ\sqrt{n})$ queries to reconstruct $S$. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to $τ$ bits, performs $q=O(τ)$ substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length $n$ over an integer alphabet of size $σ$ with $rle$ runs can be reconstructed with $q=O(rle (σ+ \log \frac{n}{rle}))$ substring queries in linear time and space. We then present an algorithm that spends $q \in O(σg\log n)$ substring queries and runs in $O(n(\log n + \log σ)+ q)$ time using linear space, where $g$ is the size of a smallest straight-line program generating the string. △ Less

Submitted 19 October, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

Comments: Accepted for publication in Theoretical Computer Science

arXiv:2008.08506 [pdf, ps, other]

doi 10.1007/978-3-030-67731-2_18

Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

Authors: Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, Anna Toffanello

Abstract: The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its nu… ▽ More The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as $r$) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the $r$ parameter as measure of repetitiveness, especially to evaluate the performance in terms of both space and time of compressed indexing data structures. In this paper, we investigate $ρ(v)$, the ratio of $r$ and of the number of runs of the BWT of the reverse of $v$. Kempa and Kociumaka [FOCS 2020] gave the first non-trivial upper bound as $ρ(v) = O(\log^2(n))$, for any string $v$ of length $n$. However, nothing is known about the tightness of this upper bound. We present infinite families of binary strings for which $ρ(v) = Θ(\log n)$ holds, thus giving the first non-trivial lower bound on $ρ(n)$, the maximum over all strings of length $n$. Our results suggest that $r$ is not an ideal measure of the repetitiveness of the string, since the number of repeated factors is invariant between the string and its reverse. We believe that there is a more intricate relationship between the number of runs of the BWT and the string's combinatorial properties. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: 14 pages, 2 figues

Report number: 47th Int. Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2021), LNCS 12607: 249--262 (2021)

arXiv:2007.07718 [pdf, ps, other]

On Indexing and Compressing Finite Automata

Authors: Nicola Cotumaccio, Nicola Prezza

Abstract: An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. In this paper, we solve the long-standing problem of indexing arbitrary finite automata. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that… ▽ More An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. In this paper, we solve the long-standing problem of indexing arbitrary finite automata. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that states reached by a given string form one interval on the partial order, thus enabling indexing. We provide a lower bound stating that such an interval requires $O(p)$ words to be represented, $p$ being the order's width (i.e. the size of its largest antichain). Indeed, we show that $p$ determines the complexity of several fundamental problems on finite automata: (i) Letting $σ$ be the alphabet size, we provide an encoding for NFAs using $\lceil\log σ\rceil + 2\lceil\log p\rceil + 2$ bits per transition and a smaller encoding for DFAs using $\lceil\log σ\rceil + \lceil\log p\rceil + 2$ bits per transition. This is achieved by generalizing the Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed pattern matching can be solved in $\tilde O(m\cdot p^2)$ query time on NFAs. (iii) We provide a polynomial-time algorithm to index DFAs, while matching the optimal value for $ p $. On the other hand, we prove that the problem is NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset construction algorithm for NFA determinization generates an equivalent DFA of size $2^p(n-p+1)-1$, where $n$ is the number of NFA's states. △ Less

Submitted 15 July, 2020; originally announced July 2020.

arXiv:2004.01120 [pdf, other]

On Locating Paths in Compressed Tries

Authors: Nicola Prezza

Abstract: In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on… ▽ More In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our first contribution is to propose a suitable generalization of the run-length BWT to tries. We show that this natural generalization enjoys several of the useful properties of its counterpart on strings: in particular, the transform natively supports counting occurrences of a query pattern on the trie's paths and its size $r$ captures the trie's repetitiveness and lower-bounds a natural notion of trie entropy. Our main contribution is a much deeper insight into the combinatorial structure of this object. In detail, we show that a data structure of $O(r\log n) + 2n + o(n)$ bits, where $n$ is the number of nodes, allows locating the $occ$ occurrences of a pattern of length $m$ in nearly-optimal $O(m\logσ+ occ)$ time, where $σ$ is the alphabet's size. Our solution consists in sampling $O(r)$ nodes that can be used as "anchor points" during the locate process. Once obtained the pre-order identifier of the first pattern occurrence (in co-lexicographic order), we show that a constant number of constant-time jumps between those anchor points lead to the identifier of the next pattern occurrence, thus enabling locating in optimal $O(1)$ time per occurrence. △ Less

Submitted 16 December, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: Improved toehold lemma running time; added more detailed proofs that take care of all border cases in the locate strategy; postprint version to appear in SODA 2020

arXiv:2002.10303 [pdf, ps, other]

Wheeler Languages

Authors: Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

Abstract: The recently introduced class of Wheeler graphs, inspired by the Burrows-Wheeler Transform (BWT) of a given string, admits an efficient index data structure for searching for subpaths with a given path label, and lifts the applicability of the Burrows-Wheeler transform from strings to languages. In this paper we study the regular languages accepted by automata having a Wheeler graph as transition… ▽ More The recently introduced class of Wheeler graphs, inspired by the Burrows-Wheeler Transform (BWT) of a given string, admits an efficient index data structure for searching for subpaths with a given path label, and lifts the applicability of the Burrows-Wheeler transform from strings to languages. In this paper we study the regular languages accepted by automata having a Wheeler graph as transition function, and prove results on determination, Myhill_Nerode characterization, decidability, and closure properties for this class of languages. △ Less

Submitted 24 February, 2020; originally announced February 2020.

arXiv:1912.11944 [pdf, ps, other]

doi 10.1016/j.is.2019.03.007

On the Reproducibility of Experiments of Indexing Repetitive Document Collections

Authors: Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro, Juan J. Lastra-Díaz, Nicola Prezza, Diego Seco

Abstract: This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe… ▽ More This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as reproducibility package. △ Less

Submitted 26 December, 2019; originally announced December 2019.

Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Replication framework available at: https://github.com/migumar2/uiHRDC/

Journal ref: Information Systems; Volume 83, July 2019; pages 181-194

arXiv:1910.02151 [pdf, ps, other]

Towards a Definitive Compressibility Measure for Repetitive Sequences

Authors: Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

Abstract: Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better wh… ▽ More Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size $γ$ of the smallest string \emph{attractor}, was introduced. The measure $γ\le b$ lower bounds all the previous relevant ones, yet length-$n$ strings can be represented and efficiently indexed within space $O(γ\log\frac{n}γ)$, which also upper bounds most measures. While $γ$ is certainly a better measure of repetitiveness than $b$, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in $o(γ\log n)$ space. In this paper, we study an even smaller measure, $δ\le γ$, which can be computed in linear time, is monotonic, and allows encoding every string in $O(δ\log\frac{n}δ)$ space because $z = O(δ\log\frac{n}δ)$. We show that $δ$ better captures the compressibility of repetitive strings. Concretely, we show that (1) $δ$ can be strictly smaller than $γ$, by up to a logarithmic factor; (2) there are string families needing $Ω(δ\log\frac{n}δ)$ space to be encoded, so this space is optimal for every $n$ and $δ$; (3) one can build run-length context-free grammars of size $O(δ\log\frac{n}δ)$, whereas the smallest (non-run-length) grammar can be up to $Θ(\log n/\log\log n)$ times larger; and (4) within $O(δ\log\frac{n}δ)$ space we can not only... △ Less

Submitted 15 January, 2021; v1 submitted 4 October, 2019; originally announced October 2019.

arXiv:1908.04686 [pdf, ps, other]

Space-Efficient Construction of Compressed Suffix Trees

Authors: Nicola Prezza, Giovanna Rosone

Abstract: We show how to build several data structures of central importance to string processing, taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let $n$ be the text length and $σ$ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in $O(n\logσ)$ time using just $o(n\logσ)$ bits of working space on top of th… ▽ More We show how to build several data structures of central importance to string processing, taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let $n$ be the text length and $σ$ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in $O(n\logσ)$ time using just $o(n\logσ)$ bits of working space on top of the input BWT. Using these algorithms as building blocks, for any parameter $0 < ε\leq 1$ we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in $O\left(n(\logσ+ ε^{-1}\cdot \log\log n)\right)$ time using at most $n\logσ\cdot(ε+ o(1))$ bits of working space on top of the input BWT and the output. In particular, this implies that we can build a compressed suffix tree from the BWT using just succinct working space (i.e. $o(n\logσ)$ bits) and any time in $Θ(n\logσ) + ω(n\log\log n)$. This improves the previous most space-efficient algorithms, which worked in $O(n)$ bits and $O(n\log n)$ time. We also consider the problem of merging BWTs of string collections, and provide a solution running in $O(n\logσ)$ time and using just $o(n\logσ)$ bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms use (in RAM) as few as $n$ bits on top of a packed representation of the input/output and process data as fast as $2.92$ megabases per second. △ Less

Submitted 12 August, 2019; originally announced August 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1901.05226

arXiv:1902.01088 [pdf, other]

doi 10.1137/1.9781611975994.55

Regular Languages meet Prefix Sorting

Authors: Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

Abstract: Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et… ▽ More Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting to labeled graphs-we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. Interestingly, we characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: when sorted, the strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with $n$ states admits an equivalent Wheeler DFA (WDFA) with at most $2n-1-|Σ|$ states that can be computed in $O(n^3)$ time. This is in sharp contrast with general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a $O(n\log n)$-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By contribution (i), our algorithms can also be used to index any WNFA at the moderate price of doubling the automaton's size. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in $O(n\log n)$ time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time. △ Less

Submitted 9 July, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

Comments: added minimization theorems; uploaded submitted version; New version with new results (W-MH theorem, linear determinization), added author: Giovanna D'Agostino

arXiv:1901.05226 [pdf, other]

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Authors: Nicola Prezza, Giovanna Rosone

Abstract: We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1, σ] can be computed from the Burrows-Wheeler transformed collection in O(n log σ) time using o(n log σ) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bi… ▽ More We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1, σ] can be computed from the Burrows-Wheeler transformed collection in O(n log σ) time using o(n log σ) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM. △ Less

Submitted 22 January, 2019; v1 submitted 16 January, 2019; originally announced January 2019.

arXiv:1811.12779 [pdf, other]

Optimal-Time Dictionary-Compressed Indexes

Authors: Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

Abstract: We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based… ▽ More We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based on \emph{locally-consistent parsing}. More in detail, let $γ$ be the size of the smallest attractor for a text $T$ of length $n$. The measure $γ$ is an (asymptotic) lower bound to the size of dictionary compressors based on Lempel--Ziv, context-free grammars, and many others. The smallest known text representations in terms of attractors use space $O(γ\log(n/γ))$, and our lightest indexes work within the same asymptotic space. Let $ε>0$ be a suitably small constant fixed at construction time, $m$ be the pattern length, and $occ$ be the number of its text occurrences. Our index counts pattern occurrences in $O(m+\log^{2+ε}n)$ time, and locates them in $O(m+(occ+1)\log^εn)$ time. These times already outperform those of most dictionary-compressed indexes, while obtaining the least asymptotic space for any index searching within $O((m+occ)\,\textrm{polylog}\,n)$ time. Further, by increasing the space to $O(γ\log(n/γ)\log^εn)$, we reduce the locating time to the optimal $O(m+occ)$, and within $O(γ\log(n/γ)\log n)$ space we can also count in optimal $O(m)$ time. No dictionary-compressed index had obtained this time before. All our indexes can be constructed in $O(n)$ space and $O(n\log n)$ expected time. As a byproduct of independent interest... △ Less

Submitted 4 September, 2019; v1 submitted 30 November, 2018; originally announced November 2018.

arXiv:1811.01209 [pdf, other]

Optimal Rank and Select Queries on Dictionary-Compressed Text

Authors: Nicola Prezza

Abstract: We study the problem of supporting queries on a string $S$ of length $n$ within a space bounded by the size $γ$ of a string attractor for $S$. Recent works showed that random access on $S$ can be supported in optimal $O(\log(n/γ)/\log\log n)$ time within $O\left (γ \rm{polylog}\ n \right)$ space. In this paper, we extend this result to \emph{rank} and \emph{select} queries and provide lower bounds… ▽ More We study the problem of supporting queries on a string $S$ of length $n$ within a space bounded by the size $γ$ of a string attractor for $S$. Recent works showed that random access on $S$ can be supported in optimal $O(\log(n/γ)/\log\log n)$ time within $O\left (γ \rm{polylog}\ n \right)$ space. In this paper, we extend this result to \emph{rank} and \emph{select} queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a $\log\log n$ time-factor in \emph{select} queries. We also provide matching lower and upper bounds for \emph{partial sum} and \emph{predecessor} queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations. △ Less

Submitted 21 December, 2018; v1 submitted 3 November, 2018; originally announced November 2018.

Comments: improved select bound with reduction to psum. Added lower bounds on trees

arXiv:1809.02792 [pdf, other]

doi 10.1145/3375890

Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

Abstract: Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) s… ▽ More Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w (σ + n/r)) space, for a text of length n over an alphabet of size σ on a RAM machine with words of w = Ω(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log σ), we support count and locate in O(dm log(σ)/we) and O(dm log(σ)/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log(σ)/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation. △ Less

Submitted 4 July, 2019; v1 submitted 8 September, 2018; originally announced September 2018.

Comments: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma))

arXiv:1805.01876 [pdf, other]

Detecting Mutations by eBWT

Authors: Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone

Abstract: In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the… ▽ More In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity. △ Less

Submitted 10 May, 2018; v1 submitted 4 May, 2018; originally announced May 2018.

Comments: simplified Proposition 4; extended Thm 2 to ambiguous clusters

arXiv:1803.09520 [pdf, other]

doi 10.1016/j.tcs.2018.09.007

Universal Compressed Text Indexing

Authors: Gonzalo Navarro, Nicola Prezza

Abstract: The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-l… ▽ More The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let $γ$ be the size of a string attractor for a text of length $n$. Our index takes $O(γ\log(n/γ))$ words of space and supports locating the $occ$ occurrences of any pattern of length $m$ in $O(m\log n + occ\log^εn)$ time, for any constant $ε>0$. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries. △ Less

Submitted 6 September, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

Comments: Fixed with reviewer's comments

arXiv:1803.09517 [pdf, ps, other]

On the Approximation Ratio of Ordered Parsings

Authors: Gonzalo Navarro, Carlos Ochoa, Nicola Prezza

Abstract: Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is $b$, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing $b$ is NP-complete, a popular gold standard is $z$, the number of phrases in… ▽ More Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is $b$, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing $b$ is NP-complete, a popular gold standard is $z$, the number of phrases in the Lempel-Ziv parse of the text, which is the optimal one when phrases can be copied only from the left. While $z$ can be computed in linear time with a greedy algorithm, almost nothing has been known for decades about its approximation ratio with respect to $b$. In this paper we prove that $z=O(b\log(n/b))$, where $n$ is the text length. We also show that the bound is tight as a function of $n$, by exhibiting a text family where $z = Ω(b\log n)$. Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating $b$ with $r$, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We proceed by observing that Lempel-Ziv is just one particular case of greedy parses, meaning that the optimal value of $z$ is obtained by scanning the text and maximizing the phrase length at each step, and of ordered parses, meaning that there is an increasing order between phrases and their sources. As a new example of ordered greedy parses, we introduce {\em lexicographical} parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size $v$ of the optimal lexicographical parse is also obtained greedily in $O(n)$ time, that $v=O(b\log(n/b))$, and that there exists a text family where $v = Ω(b\log n)$. △ Less

Submitted 25 October, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

arXiv:1803.01723 [pdf, ps, other]

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

Authors: Nicola Prezza

Abstract: We consider the problem of encoding a string of length $n$ from an integer alphabet of size $σ$ so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take $n\logσ+ Θ(\log (n\logσ))$ bits. We describe a new data structure matching this lower bound when… ▽ More We consider the problem of encoding a string of length $n$ from an integer alphabet of size $σ$ so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take $n\logσ+ Θ(\log (n\logσ))$ bits. We describe a new data structure matching this lower bound when $σ\leq n^{O(1)}$ while supporting both queries in optimal $O(1)$ time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of $Θ(\log n)$ bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in $O(n\log n)$ time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in $O(n^{1.5}\sqrt{\log σ})$ time. Running times of these Las Vegas algorithms hold in the worst case with high probability. △ Less

Submitted 11 May, 2020; v1 submitted 5 March, 2018; originally announced March 2018.

Comments: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithms

arXiv:1803.01695 [pdf, other]

doi 10.4230/LIPIcs.ESA.2018.52

String Attractors: Verification and Optimization

Authors: Dominik Kempa, Alberto Policriti, Nicola Prezza, Eva Rotenberg

Abstract: String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $Γ\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..σ]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $Γ$. Finding the smallest $k$-attractor is NP-h… ▽ More String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $Γ\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..σ]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $Γ$. Finding the smallest $k$-attractor is NP-hard for $k\geq3$, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the $k$-attractor problem to a set-cover instance where string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a $k$-attractor in near-optimal time and how to quickly compute exact and approximate solutions. For example, we prove that a minimum $3$-attractor can be found in optimal $O(n)$ time when $σ\in O(\sqrt[3+ε]{\log n})$ for any constant $ε>0$, and $2.45$-approximation can be computed in $O(n)$ time on general alphabets. To conclude, we introduce and study the complexity of the closely-related sharp-$k$-attractor problem: to find the smallest set of positions capturing all distinct substrings of length exactly $k$. We show that the problem is in P for $k=1,2$ and is NP-complete for constant $k\geq 3$. △ Less

Submitted 17 April, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

arXiv:1802.10347 [pdf, other]

Decompressing Lempel-Ziv Compressed Text

Authors: Philip Bille, Mikko Berggren Ettienne, Travis Gagie, Inge Li Gørtz, Nicola Prezza

Abstract: We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ i… ▽ More We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ in linear time. In this paper, we show that $O(n)$ time and $O(z)$ working space can be achieved for constant-size alphabets. On general alphabets of size $σ$, we describe (i) a trade-off achieving $O(n\log^δσ)$ time and $O(z\log^{1-δ}σ)$ space for any $0\leq δ\leq 1$, and (ii) a solution achieving $O(n)$ time and $O(z\log\log (n/z))$ space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of $S$ with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text. △ Less

Submitted 4 November, 2019; v1 submitted 28 February, 2018; originally announced February 2018.

arXiv:1711.07270 [pdf, ps, other]

A Separation Between Run-Length SLPs and LZ77

Authors: Philip Bille, Travis Gagie, Inge Li Gørtz, Nicola Prezza

Abstract: In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar. In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar. △ Less

Submitted 20 November, 2017; originally announced November 2017.

arXiv:1710.10964 [pdf, ps, other]

doi 10.1145/3188745.3188814

At the Roots of Dictionary Compression: String Attractors

Authors: Dominik Kempa, Nicola Prezza

Abstract: A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, f… ▽ More A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, followed by straight-line programs, run-length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph. In this paper, we show that these techniques are different solutions to the same, elegant, combinatorial problem: to find a small set of positions capturing all text's substrings. We call such a set a string attractor. We first show reductions between dictionary compressors and string attractors. This gives the approximation ratios of dictionary compressors with respect to the smallest string attractor and uncovers new relations between the output sizes of different compressors. We show that the $k$-attractor problem: deciding whether a text has a size-$t$ set of positions capturing substrings of length at most $k$, is NP-complete for $k\geq 3$. We provide several approximation techniques for the smallest $k$-attractor, show that the problem is APX-complete for constant $k$, and give strong inapproximability results. To conclude, we provide matching lower and upper bounds for the random access problem on string attractors. The upper bound is proved by showing a data structure supporting queries in optimal time. Our data structure is universal: by our reductions to string attractors, it supports random access on any dictionary-compression scheme. In particular, it matches the lower bound also on LZ77, straight-line programs, collage systems, and macro schemes, and therefore closes (at once) the random access problem for all these compressors. △ Less

Submitted 28 May, 2019; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: In Proceedings of 50th Annual ACM SIGACT Symposium on the Theory of Computing (STOC'18)

arXiv:1709.05314 [pdf, ps, other]

String Attractors

Authors: Nicola Prezza

Abstract: Let $S$ be a string of length $n$. In this paper we introduce the notion of \emph{string attractor}: a subset of the string's positions $[1,n]$ such that every distinct substring of $S$ has an occurrence crossing one of the attractor's elements. We first show that the minimum attractor's size yields upper-bounds to the string's repetitiveness as measured by its linguistic complexity and by the len… ▽ More Let $S$ be a string of length $n$. In this paper we introduce the notion of \emph{string attractor}: a subset of the string's positions $[1,n]$ such that every distinct substring of $S$ has an occurrence crossing one of the attractor's elements. We first show that the minimum attractor's size yields upper-bounds to the string's repetitiveness as measured by its linguistic complexity and by the length of its longest repeated substring. We then prove that all known compressors for repetitive strings induce a string attractor whose size is bounded by their associated repetitiveness measure, and can therefore be considered as approximations of the smallest one. Using further reductions, we derive the approximation ratios of these compressors with respect to the smallest attractor and solve several open problems related to the asymptotic relations between repetitiveness measures (in particular, between the the sizes of the Lempel-Ziv factorization, the run-length Burrows-Wheeler transform, the smallest grammar, and the smallest macro scheme). These reductions directly provide approximation algorithms for the smallest string attractor. We then apply string attractors to solve efficiently a fundamental problem in the field of compressed computation: we present a universal compressed data structure for text extraction that improves existing strategies simultaneously for \emph{all} known dictionary compressors and that, by recent lower bounds, almost matches the optimal running time within the resulting space. To conclude, we consider generalizations of string attractors to labeled graphs, show that the attractor problem is NP-complete on trees, and provide a logarithmic approximation computable in polynomial time. △ Less

Submitted 19 September, 2017; v1 submitted 15 September, 2017; originally announced September 2017.

arXiv:1705.10987 [pdf, other]

Succinct Partial Sums and Fenwick Trees

Authors: Philip Bille, Anders Roy Christiansen, Nicola Prezza, Frederik Rye Skjoldjensen

Abstract: We consider the well-studied partial sums problem in succint space where one is to maintain an array of n k-bit integers subject to updates such that partial sums queries can be efficiently answered. We present two succint versions of the Fenwick Tree - which is known for its simplicity and practicality. Our results hold in the encoding model where one is allowed to reuse the space from the input… ▽ More We consider the well-studied partial sums problem in succint space where one is to maintain an array of n k-bit integers subject to updates such that partial sums queries can be efficiently answered. We present two succint versions of the Fenwick Tree - which is known for its simplicity and practicality. Our results hold in the encoding model where one is allowed to reuse the space from the input data. Our main result is the first that only requires nk + o(n) bits of space while still supporting sum/update in O(log_b n) / O(b log_b n) time where 2 <= b <= log^O(1) n. The second result shows how optimal time for sum/update can be achieved while only slightly increasing the space usage to nk + o(nk) bits. Beyond Fenwick Trees, the results are primarily based on bit-packing and sampling - making them very practical - and they also allow for simple optimal parallelization. △ Less

Submitted 31 May, 2017; originally announced May 2017.

arXiv:1705.10382 [pdf, other]

Optimal-Time Text Indexing in BWT-runs Bounded Space

Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

Abstract: Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used… ▽ More Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m$ in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of $r$. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the $occ$ occurrences efficiently within $O(r)$ space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within $O(r\log(n/r))$ space, on a RAM machine of $w=Ω(\log n)$ bits. Within $O(r\log (n/r))$ space, our index can also count in optimal time $O(m)$. Raising the space to $O(r w\log_σ(n/r))$, we support count and locate in $O(m\log(σ)/w)$ and $O(m\log(σ)/w+occ)$ time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using $O(r\log(n/r))$ space that replaces the text and extracts any text substring of length $\ell$ in almost-optimal time $O(\log(n/r)+\ell\log(σ)/w)$. (...continues...) △ Less

Submitted 11 July, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

arXiv:1704.08558 [pdf, other]

Practical and Effective Re-Pair Compression

Authors: Philip Bille, Inge Li Gørtz, Nicola Prezza

Abstract: Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses $(1+ε)n+\sqrt n$ words on top of the re-writable text (of length $n$ and stored in $n$ words), for any constant $ε>0$; in practice however, this solution uses complex sub-procedures preventin… ▽ More Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses $(1+ε)n+\sqrt n$ words on top of the re-writable text (of length $n$ and stored in $n$ words), for any constant $ε>0$; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the above-mentioned result making use of more practical solutions; our tool further improves the working space to $(1.5+ε)n$ words (text included), for some small constant $ε$. As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with $d$ rules is $\log(d!)+2d\approx d\log d+0.557 d$ bits, and the most efficient encoding algorithm in the literature uses at most $d\log d + 2d$ bits and runs in $\mathcal O(d^{1.5})$ time. We describe a linear-time heuristic maximizing the compressibility of the output Re-Pair grammar. On real datasets, our grammar encoding uses---on average---only $2.8\%$ more bits than the information-theoretic minimum. In half of the tested cases, our compressor improves the output size of 7-Zip with maximum compression rate turned on. △ Less

Submitted 27 April, 2017; originally announced April 2017.

arXiv:1702.01340 [pdf, other]

From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back

Authors: Alberto Policriti, Nicola Prezza

Abstract: The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes $z$ and $r$ closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output.… ▽ More The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes $z$ and $r$ closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output. Let $n$ be the text length. We show that $RLBWT$ can be converted to $LZ77$ in $\mathcal{O}(n\log r)$ time and $\mathcal{O}(r)$ words of working space. Conversely, we provide an algorithm to convert $LZ77$ to $RLBWT$ in $\mathcal{O}\big(n(\log r + \log z)\big)$ time and $\mathcal{O}(r+z)$ words of working space. Note that $r$ and $z$ can be \emph{constant} if the text is highly repetitive, and our algorithms can operate with (up to) \emph{exponentially} less space than naive solutions based on full decompression. △ Less

Submitted 4 February, 2017; originally announced February 2017.

arXiv:1701.07238 [pdf, other]

A Framework of Dynamic Data Structures for String Processing

Authors: Nicola Prezza

Abstract: In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM-indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our t… ▽ More In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM-indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of four recently-published compression algorithms implemented using DYNAMIC with those of state-of-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks. △ Less

Submitted 25 January, 2017; originally announced January 2017.

arXiv:1611.01479 [pdf, other]

Space-Efficient Re-Pair Compression

Authors: Philip Bille, Inge Li Gørtz, Nicola Prezza

Abstract: Re-Pair is an effective grammar-based compression scheme achieving strong compression rates in practice. Let $n$, $σ$, and $d$ be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and $5n + 4σ^2 + 4d + \sqrt{n}$ words of working space on top of the text. In thi… ▽ More Re-Pair is an effective grammar-based compression scheme achieving strong compression rates in practice. Let $n$, $σ$, and $d$ be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and $5n + 4σ^2 + 4d + \sqrt{n}$ words of working space on top of the text. In this work, we propose two algorithms improving on the space of their original solution. Our model assumes a memory word of $\lceil\log_2 n\rceil$ bits and a re-writable input text composed by $n$ such words. Our first algorithm runs in expected $\mathcal O(n/ε)$ time and uses $(1+ε)n +\sqrt n$ words of space on top of the text for any parameter $0<ε\leq 1$ chosen in advance. Our second algorithm runs in expected $\mathcal O(n\log n)$ time and improves the space to $n +\sqrt n$ words. △ Less

Submitted 4 November, 2016; originally announced November 2016.

arXiv:1608.05100 [pdf, ps, other]

In-Place Sparse Suffix Sorting

Authors: Nicola Prezza

Abstract: Suffix arrays encode the lexicographical order of all suffixes of a text and are often combined with the Longest Common Prefix array (LCP) to simulate navigational queries on the suffix tree in reduced space. In space-critical applications such as sparse and compressed text indexing, only information regarding the lexicographical order of a size-$b$ subset of all $n$ text suffixes is often needed.… ▽ More Suffix arrays encode the lexicographical order of all suffixes of a text and are often combined with the Longest Common Prefix array (LCP) to simulate navigational queries on the suffix tree in reduced space. In space-critical applications such as sparse and compressed text indexing, only information regarding the lexicographical order of a size-$b$ subset of all $n$ text suffixes is often needed. Such information can be stored space-efficiently (in $b$ words) in the sparse suffix array (SSA). The SSA and its relative sparse LCP array (SLCP) can be used as a space-efficient substitute of the sparse suffix tree. Very recently, Gawrychowski and Kociumaka [SODA 2017] showed that the sparse suffix tree (and therefore SSA and SLCP) can be built in asymptotically optimal $O(b)$ space with a Monte Carlo algorithm running in $O(n)$ time. The main reason for using the SSA and SLCP arrays in place of the sparse suffix tree is, however, their reduced space of $b$ words each. This leads naturally to the quest for in-place algorithms building these arrays. Franceschini and Muthukrishnan [ICALP 2007] showed that the full suffix array can be built in-place and in optimal running time. On the other hand, finding sub-quadratic in-place algorithms for building the SSA and SLCP for \emph{general} subsets of suffixes has been an elusive task for decades. In this paper, we give the first solution to this problem. We provide the first in-place algorithm building the full LCP array in $O(n\log n)$ expected time and the first Monte Carlo in-place algorithms building the SSA and SLCP in $O(n + b\log^2 n)$ expected time. We moreover describe the first in-place solution for the suffix selection problem: to compute the $i$-th smallest text suffix. △ Less

Submitted 1 November, 2017; v1 submitted 17 August, 2016; originally announced August 2016.

Comments: ACM-SIAM Symposium on Discrete Algorithms 2018; arXiv admin note: text overlap with arXiv:1607.06660 Comment: new style (lipics); using Heath-Brown theorem for number of primes in Z; improved bounds for LCP array computation and sparse suffix sorting; added construction of the LCE structure using radix sort; added reference to lower bound for LCE query times; uploaded version accepted at SODA 2018

arXiv:1607.06660 [pdf, ps, other]

Fast Longest Common Extensions in Small Space

Authors: Alberto Policriti, Nicola Prezza

Abstract: In this paper we address the longest common extension (LCE) problem: to compute the length $\ell$ of the longest common prefix between any two suffixes of $T\in Σ^n$ with $ Σ= \{0, \ldots σ-1\} $. We present two fast and space-efficient solutions based on (Karp-Rabin) \textit{fingerprinting} and \textit{sampling}. Our first data structure exploits properties of Mersenne prime numbers when used as… ▽ More In this paper we address the longest common extension (LCE) problem: to compute the length $\ell$ of the longest common prefix between any two suffixes of $T\in Σ^n$ with $ Σ= \{0, \ldots σ-1\} $. We present two fast and space-efficient solutions based on (Karp-Rabin) \textit{fingerprinting} and \textit{sampling}. Our first data structure exploits properties of Mersenne prime numbers when used as moduli of the Karp-Rabin hash function and takes $n\lceil \log_2σ\rceil$ bits of space. Our second structure works with any prime modulus and takes $n\lceil \log_2σ\rceil + n/w + w\log_2 n$ bits of space ($ w $ memory-word size). Both structures support $\mathcal O\left(m\logσ/w \right)$-time extraction of any length-$m$ text substring, $\mathcal O(\log\ell)$-time LCE queries with high probability, and can be built in optimal $\mathcal O(n)$ time. In the first case, ours is the first result showing that it is possible to answer LCE queries in $o(n)$ time while using only $\mathcal O(1)$ words on top of the space required to store the text. Our results improve the state of the art in space usage, query times, and preprocessing times and are extremely practical: we present a C++ implementation that is very fast and space-efficient in practice. △ Less

Submitted 22 July, 2016; originally announced July 2016.

arXiv:1604.06002 [pdf, other]

Practical combinations of repetition-aware data structures

Authors: Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu Raffinot

Abstract: Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct mea… ▽ More Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct measures of repetition all grow sublinearly in the length of a highly-repetitive string. In this paper we explore the practical advantages of combining data structures whose size depends on distinct measures of repetition. The main ingredient of our structures is the run-length encoded BWT (RLBWT), which takes space proportional to the number of runs in the Burrows-Wheeler transform of a string. We describe a range of practical variants that combine RLBWT with the set of boundaries of the Lempel-Ziv 77 factors of a string, which take space proportional to the number of factors. Such variants use, respectively, the RLBWT of a string and the RLBWT of its reverse, or just one RLBWT inside a bidirectional index, or just one RLBWT with support for unidirectional extraction. We also study the practical advantages of combining RLBWT with the compact directed acyclic word graph of a string, a data structure that takes space proportional to the number of one-character extensions of maximal repeats. Our approaches are easy to implement, and provide competitive tradeoffs on significant datasets. △ Less

Submitted 21 April, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

Comments: arXiv admin note: text overlap with arXiv:1502.05937

arXiv:1510.06257 [pdf, other]

Computing LZ77 in Run-Compressed Space

Authors: Nicola Prezza, Alberto Policriti

Abstract: In this paper, we show that the LZ77 factorization of a text T {\inΣ^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a… ▽ More In this paper, we show that the LZ77 factorization of a text T {\inΣ^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing. △ Less

Submitted 21 October, 2015; originally announced October 2015.

Showing 1–50 of 52 results for author: Prezza, N