Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 52 results for author: Prezza, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02763  [pdf, ps, other

    cs.FL cs.DS

    Indexing Finite-State Automata Using Forward-Stable Partitions

    Authors: Ruben Becker, Sung-Hwan Kim, Nicola Prezza, Carlo Tosoni

    Abstract: An index on a finite-state automaton is a data structure able to locate specific patterns on the automaton's paths and consequently on the regular language accepted by the automaton itself. Cotumaccio and Prezza [SODA '21], introduced a data structure able to solve pattern matching queries on automata, generalizing the famous FM-index for strings of Ferragina and Manzini [FOCS '00]. The efficiency… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 20 pages, 3 figures, submitted in SPIRE 2024

  2. arXiv:2404.14235  [pdf, other

    cs.DS

    Computing the LCP Array of a Labeled Graph

    Authors: Jarno Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, Nicola Prezza

    Abstract: The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queri… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  3. arXiv:2312.01359  [pdf, other

    cs.DS

    Suffixient Sets

    Authors: Lore Depuydt, Travis Gagie, Ben Langmead, Giovanni Manzini, Nicola Prezza

    Abstract: We define a suffixient set for a text $T [1..n]$ to be a set $S$ of positions between 1 and $n$ such that, for any edge descending from a node $u$ to a node $v$ in the suffix tree of $T$, there is an element $s \in S$ such that $u$'s path label is a suffix of $T [1..s - 1]$ and $T [s]$ is the first character of $(u, v)$'s edge label. We first show there is a suffixient set of cardinality at most… ▽ More

    Submitted 4 June, 2024; v1 submitted 3 December, 2023; originally announced December 2023.

  4. arXiv:2310.17980  [pdf, other

    cs.DS

    Sketching and Streaming for Dictionary Compression

    Authors: Ruben Becker, Matteo Canton, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Nicola Prezza

    Abstract: We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicativ… ▽ More

    Submitted 9 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

  5. arXiv:2307.07267  [pdf, other

    cs.DS

    Random Wheeler Automata

    Authors: Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Riccardo Maso, Nicola Prezza

    Abstract: Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the co-lexicographic order of the strings labeling the automaton's paths; this property makes it possible to represent the automaton's topology in… ▽ More

    Submitted 7 June, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

    Comments: 17 pages, 3 figures

  6. arXiv:2306.05684  [pdf, ps, other

    cs.DS

    Space-time Trade-offs for the LCP Array of Wheeler DFAs

    Authors: Nicola Cotumaccio, Travis Gagie, Dominik Köppl, Nicola Prezza

    Abstract: Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires $ O(n \log n) $ bits, $ n $ being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particu… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  7. arXiv:2306.04737  [pdf, other

    cs.FL

    Optimal Wheeler Language Recognition

    Authors: Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Alberto Policriti, Nicola Prezza

    Abstract: A Wheeler automaton is a finite state automaton whose states admit a total Wheeler order, reflecting the co-lexicographic order of the strings labeling source-to-node paths. A Wheeler language is a regular language admitting an accepting Wheeler automaton. Wheeler languages admit efficient and elegant solutions to hard problems such as automata compression and regular expression matching, therefor… ▽ More

    Submitted 18 December, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

  8. arXiv:2305.05129  [pdf, other

    cs.DS

    Sorting Finite Automata via Partition Refinement

    Authors: Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, Nicola Prezza

    Abstract: Wheeler nondeterministic finite automata (WNFAs) were introduced as a generalization of prefix sorting from strings to labeled graphs. WNFAs admit optimal solutions to classic hard problems on labeled graphs and languages. The problem of deciding whether a given NFA is Wheeler is known to be NP-complete. Recently, however, Alanko et al. showed how to side-step this complexity by switching to preor… ▽ More

    Submitted 18 December, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

  9. arXiv:2305.03626  [pdf, other

    cs.LG cs.CR cs.LO stat.ML

    Verifiable Learning for Robust Tree Ensembles

    Authors: Stefano Calzavara, Lorenzo Cazzaro, Giulio Ermanno Pibiri, Nicola Prezza

    Abstract: Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security ve… ▽ More

    Submitted 11 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 19 pages, 5 figures; full version of the revised paper accepted at ACM CCS 2023 with corrected typo in footnote 1

  10. arXiv:2304.10962  [pdf, other

    cs.DS

    Faster Prefix-Sorting Algorithms for Deterministic Finite Automata

    Authors: Sung-Hwan Kim, Francisco Olivares, Nicola Prezza

    Abstract: Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these tech… ▽ More

    Submitted 21 April, 2023; originally announced April 2023.

  11. arXiv:2301.05338  [pdf, ps, other

    cs.DS

    Computing matching statistics on Wheeler DFAs

    Authors: Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, Marinella Sciortino

    Abstract: Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we sho… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

  12. arXiv:2301.00754  [pdf, other

    cs.DS

    Algorithms for Massive Data -- Lecture Notes

    Authors: Nicola Prezza

    Abstract: These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These not… ▽ More

    Submitted 25 March, 2024; v1 submitted 2 January, 2023; originally announced January 2023.

    Comments: added chapter 1 on compressed data structures. Fixed a few mistakes (Blooom filter analysis) and typos

  13. arXiv:2208.04931  [pdf, ps, other

    cs.FL cs.DS

    Co-lexicographically Ordering Automata and Regular Languages -- Part I

    Authors: Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

    Abstract: In the present work, we lay out a new theory showing that all automata can always be co-lexicographically partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width $p$ of one of their admissible co-lex partial orders - dubbed here the automaton's co-lex width. We first show that this new measure captures at once the complex… ▽ More

    Submitted 3 May, 2023; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: arXiv admin note: text overlap with arXiv:2106.02309

  14. arXiv:2111.02480  [pdf, ps, other

    cs.DS

    Linear-time Minimization of Wheeler DFAs

    Authors: Jarno Alanko, Nicola Cotumaccio, Nicola Prezza

    Abstract: Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just $\logσ+ O(1)$ bits per edge, $σ$ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to ac… ▽ More

    Submitted 3 November, 2021; originally announced November 2021.

  15. arXiv:2111.02478  [pdf, other

    cs.DS

    HOLZ: High-Order Entropy Encoding of Lempel-Ziv Factor Distances

    Authors: Dominik Köppl, Gonzalo Navarro, Nicola Prezza

    Abstract: We propose a new representation of the offsets of the Lempel-Ziv (LZ) factorization based on the co-lexicographic order of the processed prefixes. The selected offsets tend to approach the k-th order empirical entropy. Our evaluations show that this choice of offsets is superior to the rightmost LZ parsing and the bit-optimal LZ parsing on datasets with small high-order entropy.

    Submitted 3 November, 2021; originally announced November 2021.

  16. arXiv:2106.02309  [pdf, ps, other

    cs.FL cs.CL

    On (co-lex) Ordering Automata

    Authors: Giovanna D'Agostino, Nicola Cotumaccio, Alberto Policriti, Nicola Prezza

    Abstract: The states of a deterministic finite automaton A can be identified with collections of words in Pf(L(A)) -- the set of prefixes of words belonging to the regular language accepted by A. But words can be ordered and among the many possible orders a very natural one is the co-lexicographic one. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton'… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

  17. arXiv:2102.06798  [pdf, ps, other

    cs.FL cs.DS

    Co-lexicographically Ordering Automata and Regular Languages -- Part II

    Authors: Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

    Abstract: In the present work, we tackle the regular language indexing problem by first studying the hierarchy of $p$-sortable languages: regular languages accepted by automata of width $p$. We show that the hierarchy is strict and does not collapse, and provide (exponential in $p$) upper and lower bounds relating the minimum widths of equivalent NFAs and DFAs. Our bounds indicate the importance of being ab… ▽ More

    Submitted 10 March, 2023; v1 submitted 12 February, 2021; originally announced February 2021.

  18. arXiv:2011.10008  [pdf, other

    cs.DS

    Subpath Queries on Compressed Graphs: a Survey

    Authors: Nicola Prezza

    Abstract: Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text $T$, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in $T$ in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two order… ▽ More

    Submitted 13 December, 2020; v1 submitted 19 November, 2020; originally announced November 2020.

    Comments: Fixed some typos and references to Boyer-Moore-Galil's and Apostolico-Giancarlo's algorithms

  19. arXiv:2011.07143  [pdf, ps, other

    cs.DS

    Adaptive Learning of Compressible Strings

    Authors: Gabriele Fici, Nicola Prezza, Rossano Venturini

    Abstract: Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle $σn/4 -O(n)$ queries in order to be able to reconstruct the hidden string, where $σ$ is the size of the alphabet of $S$ and $n$ its length,… ▽ More

    Submitted 19 October, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

    Comments: Accepted for publication in Theoretical Computer Science

  20. Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

    Authors: Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, Anna Toffanello

    Abstract: The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its nu… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

    Comments: 14 pages, 2 figues

    Report number: 47th Int. Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2021), LNCS 12607: 249--262 (2021)

  21. arXiv:2007.07718  [pdf, ps, other

    cs.DS

    On Indexing and Compressing Finite Automata

    Authors: Nicola Cotumaccio, Nicola Prezza

    Abstract: An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. In this paper, we solve the long-standing problem of indexing arbitrary finite automata. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that… ▽ More

    Submitted 15 July, 2020; originally announced July 2020.

  22. arXiv:2004.01120  [pdf, other

    cs.DS

    On Locating Paths in Compressed Tries

    Authors: Nicola Prezza

    Abstract: In this paper, we consider the problem of compressing a trie while supporting the powerful \emph{locate} queries: to return the pre-order identifiers of all nodes reached by a path labeled with a given query pattern. Our result builds on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018, JACM 2020] based on… ▽ More

    Submitted 16 December, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

    Comments: Improved toehold lemma running time; added more detailed proofs that take care of all border cases in the locate strategy; postprint version to appear in SODA 2020

  23. arXiv:2002.10303  [pdf, ps, other

    cs.FL

    Wheeler Languages

    Authors: Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

    Abstract: The recently introduced class of Wheeler graphs, inspired by the Burrows-Wheeler Transform (BWT) of a given string, admits an efficient index data structure for searching for subpaths with a given path label, and lifts the applicability of the Burrows-Wheeler transform from strings to languages. In this paper we study the regular languages accepted by automata having a Wheeler graph as transition… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

  24. On the Reproducibility of Experiments of Indexing Repetitive Document Collections

    Authors: Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro, Juan J. Lastra-Díaz, Nicola Prezza, Diego Seco

    Abstract: This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work [5]. In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe… ▽ More

    Submitted 26 December, 2019; originally announced December 2019.

    Comments: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Replication framework available at: https://github.com/migumar2/uiHRDC/

    Journal ref: Information Systems; Volume 83, July 2019; pages 181-194

  25. arXiv:1910.02151  [pdf, ps, other

    cs.DS

    Towards a Definitive Compressibility Measure for Repetitive Sequences

    Authors: Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

    Abstract: Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better wh… ▽ More

    Submitted 15 January, 2021; v1 submitted 4 October, 2019; originally announced October 2019.

  26. arXiv:1908.04686  [pdf, ps, other

    cs.DS

    Space-Efficient Construction of Compressed Suffix Trees

    Authors: Nicola Prezza, Giovanna Rosone

    Abstract: We show how to build several data structures of central importance to string processing, taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let $n$ be the text length and $σ$ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in $O(n\logσ)$ time using just $o(n\logσ)$ bits of working space on top of th… ▽ More

    Submitted 12 August, 2019; originally announced August 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1901.05226

  27. Regular Languages meet Prefix Sorting

    Authors: Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza

    Abstract: Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et… ▽ More

    Submitted 9 July, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: added minimization theorems; uploaded submitted version; New version with new results (W-MH theorem, linear determinization), added author: Giovanna D'Agostino

  28. arXiv:1901.05226  [pdf, other

    cs.DS

    Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

    Authors: Nicola Prezza, Giovanna Rosone

    Abstract: We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1, σ] can be computed from the Burrows-Wheeler transformed collection in O(n log σ) time using o(n log σ) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bi… ▽ More

    Submitted 22 January, 2019; v1 submitted 16 January, 2019; originally announced January 2019.

  29. arXiv:1811.12779  [pdf, other

    cs.DS

    Optimal-Time Dictionary-Compressed Indexes

    Authors: Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza

    Abstract: We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings, including \emph{string attractors} --- new combinatorial objects encompassing most known compressibility measures for highly repetitive texts ---, and grammars based… ▽ More

    Submitted 4 September, 2019; v1 submitted 30 November, 2018; originally announced November 2018.

  30. arXiv:1811.01209  [pdf, other

    cs.DS

    Optimal Rank and Select Queries on Dictionary-Compressed Text

    Authors: Nicola Prezza

    Abstract: We study the problem of supporting queries on a string $S$ of length $n$ within a space bounded by the size $γ$ of a string attractor for $S$. Recent works showed that random access on $S$ can be supported in optimal $O(\log(n/γ)/\log\log n)$ time within $O\left (γ \rm{polylog}\ n \right)$ space. In this paper, we extend this result to \emph{rank} and \emph{select} queries and provide lower bounds… ▽ More

    Submitted 21 December, 2018; v1 submitted 3 November, 2018; originally announced November 2018.

    Comments: improved select bound with reduction to psum. Added lower bounds on trees

  31. Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

    Abstract: Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) s… ▽ More

    Submitted 4 July, 2019; v1 submitted 8 September, 2018; originally announced September 2018.

    Comments: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma))

  32. arXiv:1805.01876  [pdf, other

    cs.DS

    Detecting Mutations by eBWT

    Authors: Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone

    Abstract: In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the… ▽ More

    Submitted 10 May, 2018; v1 submitted 4 May, 2018; originally announced May 2018.

    Comments: simplified Proposition 4; extended Thm 2 to ambiguous clusters

  33. Universal Compressed Text Indexing

    Authors: Gonzalo Navarro, Nicola Prezza

    Abstract: The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-l… ▽ More

    Submitted 6 September, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: Fixed with reviewer's comments

  34. arXiv:1803.09517  [pdf, ps, other

    cs.DS

    On the Approximation Ratio of Ordered Parsings

    Authors: Gonzalo Navarro, Carlos Ochoa, Nicola Prezza

    Abstract: Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is $b$, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing $b$ is NP-complete, a popular gold standard is $z$, the number of phrases in… ▽ More

    Submitted 25 October, 2019; v1 submitted 26 March, 2018; originally announced March 2018.

  35. arXiv:1803.01723  [pdf, ps, other

    cs.DS

    Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

    Authors: Nicola Prezza

    Abstract: We consider the problem of encoding a string of length $n$ from an integer alphabet of size $σ$ so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take $n\logσ+ Θ(\log (n\logσ))$ bits. We describe a new data structure matching this lower bound when… ▽ More

    Submitted 11 May, 2020; v1 submitted 5 March, 2018; originally announced March 2018.

    Comments: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithms

  36. String Attractors: Verification and Optimization

    Authors: Dominik Kempa, Alberto Policriti, Nicola Prezza, Eva Rotenberg

    Abstract: String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $Γ\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..σ]^n$ if and only if every distinct substring of $S$ of length at most $k$ has an occurrence straddling at least one of the positions in $Γ$. Finding the smallest $k$-attractor is NP-h… ▽ More

    Submitted 17 April, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

  37. arXiv:1802.10347  [pdf, other

    cs.DS

    Decompressing Lempel-Ziv Compressed Text

    Authors: Philip Bille, Mikko Berggren Ettienne, Travis Gagie, Inge Li Gørtz, Nicola Prezza

    Abstract: We consider the problem of decompressing the Lempel--Ziv 77 representation of a string $S$ of length $n$ using a working space as close as possible to the size $z$ of the input. The folklore solution for the problem runs in $O(n)$ time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size $O(z\log(n/z))$ and then stream $S$ i… ▽ More

    Submitted 4 November, 2019; v1 submitted 28 February, 2018; originally announced February 2018.

  38. arXiv:1711.07270  [pdf, ps, other

    cs.DS

    A Separation Between Run-Length SLPs and LZ77

    Authors: Philip Bille, Travis Gagie, Inge Li Gørtz, Nicola Prezza

    Abstract: In this paper we give an infinite family of strings for which the length of the Lempel-Ziv'77 parse is a factor $Ω(\log n/\log\log n)$ smaller than the smallest run-length grammar.

    Submitted 20 November, 2017; originally announced November 2017.

  39. At the Roots of Dictionary Compression: String Attractors

    Authors: Dominik Kempa, Nicola Prezza

    Abstract: A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary compressors: algorithms able to reduce the text's size by exploiting its repetitiveness. Lempel-Ziv 77 is one of the most successful and well-known tools of this kind, f… ▽ More

    Submitted 28 May, 2019; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: In Proceedings of 50th Annual ACM SIGACT Symposium on the Theory of Computing (STOC'18)

  40. arXiv:1709.05314  [pdf, ps, other

    cs.DS

    String Attractors

    Authors: Nicola Prezza

    Abstract: Let $S$ be a string of length $n$. In this paper we introduce the notion of \emph{string attractor}: a subset of the string's positions $[1,n]$ such that every distinct substring of $S$ has an occurrence crossing one of the attractor's elements. We first show that the minimum attractor's size yields upper-bounds to the string's repetitiveness as measured by its linguistic complexity and by the len… ▽ More

    Submitted 19 September, 2017; v1 submitted 15 September, 2017; originally announced September 2017.

  41. arXiv:1705.10987  [pdf, other

    cs.DS

    Succinct Partial Sums and Fenwick Trees

    Authors: Philip Bille, Anders Roy Christiansen, Nicola Prezza, Frederik Rye Skjoldjensen

    Abstract: We consider the well-studied partial sums problem in succint space where one is to maintain an array of n k-bit integers subject to updates such that partial sums queries can be efficiently answered. We present two succint versions of the Fenwick Tree - which is known for its simplicity and practicality. Our results hold in the encoding model where one is allowed to reuse the space from the input… ▽ More

    Submitted 31 May, 2017; originally announced May 2017.

  42. arXiv:1705.10382  [pdf, other

    cs.DS

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Authors: Travis Gagie, Gonzalo Navarro, Nicola Prezza

    Abstract: Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used… ▽ More

    Submitted 11 July, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

  43. arXiv:1704.08558  [pdf, other

    cs.DS

    Practical and Effective Re-Pair Compression

    Authors: Philip Bille, Inge Li Gørtz, Nicola Prezza

    Abstract: Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses $(1+ε)n+\sqrt n$ words on top of the re-writable text (of length $n$ and stored in $n$ words), for any constant $ε>0$; in practice however, this solution uses complex sub-procedures preventin… ▽ More

    Submitted 27 April, 2017; originally announced April 2017.

  44. arXiv:1702.01340  [pdf, other

    cs.DS

    From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back

    Authors: Alberto Policriti, Nicola Prezza

    Abstract: The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes $z$ and $r$ closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output.… ▽ More

    Submitted 4 February, 2017; originally announced February 2017.

  45. arXiv:1701.07238  [pdf, other

    cs.DS

    A Framework of Dynamic Data Structures for String Processing

    Authors: Nicola Prezza

    Abstract: In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM-indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our t… ▽ More

    Submitted 25 January, 2017; originally announced January 2017.

  46. arXiv:1611.01479  [pdf, other

    cs.DS

    Space-Efficient Re-Pair Compression

    Authors: Philip Bille, Inge Li Gørtz, Nicola Prezza

    Abstract: Re-Pair is an effective grammar-based compression scheme achieving strong compression rates in practice. Let $n$, $σ$, and $d$ be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and $5n + 4σ^2 + 4d + \sqrt{n}$ words of working space on top of the text. In thi… ▽ More

    Submitted 4 November, 2016; originally announced November 2016.

  47. arXiv:1608.05100  [pdf, ps, other

    cs.DS

    In-Place Sparse Suffix Sorting

    Authors: Nicola Prezza

    Abstract: Suffix arrays encode the lexicographical order of all suffixes of a text and are often combined with the Longest Common Prefix array (LCP) to simulate navigational queries on the suffix tree in reduced space. In space-critical applications such as sparse and compressed text indexing, only information regarding the lexicographical order of a size-$b$ subset of all $n$ text suffixes is often needed.… ▽ More

    Submitted 1 November, 2017; v1 submitted 17 August, 2016; originally announced August 2016.

    Comments: ACM-SIAM Symposium on Discrete Algorithms 2018; arXiv admin note: text overlap with arXiv:1607.06660 Comment: new style (lipics); using Heath-Brown theorem for number of primes in Z; improved bounds for LCP array computation and sparse suffix sorting; added construction of the LCE structure using radix sort; added reference to lower bound for LCE query times; uploaded version accepted at SODA 2018

  48. arXiv:1607.06660  [pdf, ps, other

    cs.DS

    Fast Longest Common Extensions in Small Space

    Authors: Alberto Policriti, Nicola Prezza

    Abstract: In this paper we address the longest common extension (LCE) problem: to compute the length $\ell$ of the longest common prefix between any two suffixes of $T\in Σ^n$ with $ Σ= \{0, \ldots σ-1\} $. We present two fast and space-efficient solutions based on (Karp-Rabin) \textit{fingerprinting} and \textit{sampling}. Our first data structure exploits properties of Mersenne prime numbers when used as… ▽ More

    Submitted 22 July, 2016; originally announced July 2016.

  49. arXiv:1604.06002  [pdf, other

    cs.DS

    Practical combinations of repetition-aware data structures

    Authors: Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu Raffinot

    Abstract: Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct mea… ▽ More

    Submitted 21 April, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

    Comments: arXiv admin note: text overlap with arXiv:1502.05937

  50. arXiv:1510.06257  [pdf, other

    cs.DS

    Computing LZ77 in Run-Compressed Space

    Authors: Nicola Prezza, Alberto Policriti

    Abstract: In this paper, we show that the LZ77 factorization of a text T {\inΣ^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a… ▽ More

    Submitted 21 October, 2015; originally announced October 2015.