-
Generalizing Roberts' characterization of unit interval graphs
Authors:
Virginia Ardévol Martínez,
Romeo Rizzi,
Abdallah Saffidine,
Florian Sikora,
Stéphane Vialette
Abstract:
For any natural number $d$, a graph $G$ is a (disjoint) $d$-interval graph if it is the intersection graph of (disjoint) $d$-intervals, the union of $d$ (disjoint) intervals on the real line. Two important subclasses of $d$-interval graphs are unit and balanced $d$-interval graphs (where every interval has unit length or all the intervals associated to a same vertex have the same length, respectiv…
▽ More
For any natural number $d$, a graph $G$ is a (disjoint) $d$-interval graph if it is the intersection graph of (disjoint) $d$-intervals, the union of $d$ (disjoint) intervals on the real line. Two important subclasses of $d$-interval graphs are unit and balanced $d$-interval graphs (where every interval has unit length or all the intervals associated to a same vertex have the same length, respectively). A celebrated result by Roberts gives a simple characterization of unit interval graphs being exactly claw-free interval graphs. Here, we study the generalization of this characterization for $d$-interval graphs. In particular, we prove that for any $d \geq 2$, if $G$ is a $K_{1,2d+1}$-free interval graph, then $G$ is a unit $d$-interval graph. However, somehow surprisingly, under the same assumptions, $G$ is not always a \emph{disjoint} unit $d$-interval graph. This implies that the class of disjoint unit $d$-interval graphs is strictly included in the class of unit $d$-interval graphs. Finally, we study the relationships between the classes obtained under disjoint and non-disjoint $d$-intervals in the balanced case and show that the classes of disjoint balanced 2-intervals and balanced 2-intervals coincide, but this is no longer true for $d>2$.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Quasi-kernels in split graphs
Authors:
Hélène Langlois,
Frédéric Meunier,
Romeo Rizzi,
Stéphane Vialette,
Yacong Zhou
Abstract:
In a digraph, a quasi-kernel is a subset of vertices that is independent and such that the shortest path from every vertex to this subset is of length at most two. The ``small quasi-kernel conjecture,'' proposed by Erdős and Székely in 1976, postulates that every sink-free digraph has a quasi-kernel whose size is within a fraction of the total number of vertices. The conjecture is even more precis…
▽ More
In a digraph, a quasi-kernel is a subset of vertices that is independent and such that the shortest path from every vertex to this subset is of length at most two. The ``small quasi-kernel conjecture,'' proposed by Erdős and Székely in 1976, postulates that every sink-free digraph has a quasi-kernel whose size is within a fraction of the total number of vertices. The conjecture is even more precise with a $1/2$ ratio, but even with larger ratio, this property is known to hold only for few classes of graphs. The focus here is on small quasi-kernels in split graphs. This family of graphs has played a special role in the study of the conjecture since it was used to disprove a strengthening that postulated the existence of two disjoint quasi-kernels. The paper proves that every sink-free split digraph $D$ has a quasi-kernel of size at most $\frac{2}{3}|V(D)|$, and even of size at most two when the graph is an orientation of a complete split graph. It is also shown that computing a quasi-kernel of minimal size in a split digraph is W[2]-hard.
△ Less
Submitted 26 February, 2024; v1 submitted 24 December, 2023;
originally announced December 2023.
-
Recognizing unit multiple intervals is hard
Authors:
Virginia Ardévol Martínez,
Romeo Rizzi,
Florian Sikora,
Stéphane Vialette
Abstract:
Multiple interval graphs are a well-known generalization of interval graphs introduced in the 1970s to deal with situations arising naturally in scheduling and allocation. A $d$-interval is the union of $d$ intervals on the real line, and a graph is a $d$-interval graph if it is the intersection graph of $d$-intervals. In particular, it is a unit $d$-interval graph if it admits a $d$-interval repr…
▽ More
Multiple interval graphs are a well-known generalization of interval graphs introduced in the 1970s to deal with situations arising naturally in scheduling and allocation. A $d$-interval is the union of $d$ intervals on the real line, and a graph is a $d$-interval graph if it is the intersection graph of $d$-intervals. In particular, it is a unit $d$-interval graph if it admits a $d$-interval representation where every interval has unit length.
Whereas it has been known for a long time that recognizing 2-interval graphs and other related classes such as 2-track interval graphs is NP-complete, the complexity of recognizing unit 2-interval graphs remains open. Here, we settle this question by proving that the recognition of unit 2-interval graphs is also NP-complete. Our proof technique uses a completely different approach from the other hardness results of recognizing related classes. Furthermore, we extend the result for unit $d$-interval graphs for any $d\geq 2$, which does not follow directly in graph recognition problems --as an example, it took almost 20 years to close the gap between $d=2$ and $d> 2$ for the recognition of $d$-track interval graphs. Our result has several implications, including that recognizing $(x, \dots, x)$ $d$-interval graphs and depth $r$ unit 2-interval graphs is NP-complete for every $x\geq 11$ and every $r\geq 4$.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Minimum Path Cover in Parameterized Linear Time
Authors:
Manuel Caceres,
Massimo Cairo,
Brendan Mumey,
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
A minimum path cover (MPC) of a directed acyclic graph (DAG) $G = (V,E)$ is a minimum-size set of paths that together cover all the vertices of the DAG. Computing an MPC is a basic polynomial problem, dating back to Dilworth's and Fulkerson's results in the 1950s. Since the size $k$ of an MPC (also known as the width) can be small in practical applications, research has also studied algorithms who…
▽ More
A minimum path cover (MPC) of a directed acyclic graph (DAG) $G = (V,E)$ is a minimum-size set of paths that together cover all the vertices of the DAG. Computing an MPC is a basic polynomial problem, dating back to Dilworth's and Fulkerson's results in the 1950s. Since the size $k$ of an MPC (also known as the width) can be small in practical applications, research has also studied algorithms whose running time is parameterized on $k$.
We obtain a new MPC parameterized algorithm for DAGs running in time $O(k^2|V| + |E|)$. Our algorithm is the first solving the problem in parameterized linear time. Additionally, we obtain an edge sparsification algorithm preserving the width of a DAG but reducing $|E|$ to less than $2|V|$. This algorithm runs in time $O(k^2|V|)$ and requires an MPC of a DAG as input, thus its total running time is the same as the running time of our MPC algorithm.
△ Less
Submitted 17 November, 2022;
originally announced November 2022.
-
Solving the Probabilistic Profitable Tour Problem on a Tree
Authors:
Enrico Angelelli,
Renata Mansini,
Romeo Rizzi
Abstract:
The profitable tour problem (PTP) is a well-known NP-hard routing problem searching for a tour visiting a subset of customers while maximizing profit as the difference between total revenue collected and traveling costs. PTP is known to be solvable in polynomial time when special structures of the underlying graph are considered. However, the computational complexity of the corresponding probabili…
▽ More
The profitable tour problem (PTP) is a well-known NP-hard routing problem searching for a tour visiting a subset of customers while maximizing profit as the difference between total revenue collected and traveling costs. PTP is known to be solvable in polynomial time when special structures of the underlying graph are considered. However, the computational complexity of the corresponding probabilistic generalizations is still an open issue in many cases. In this paper, we analyze the probabilistic PTP where customers are located on a tree and need, with a known probability, for a service provision at a predefined prize. The problem objective is to select a priori a subset of customers with whom to commit the service so to maximize the expected profit. We provide a polynomial time algorithm computing the optimal solution in $O(n^2)$, where $n$ is the number of nodes in the tree.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Cut paths and their remainder structure, with applications
Authors:
Massimo Cairo,
Shahbaz Khan,
Romeo Rizzi,
Sebastian Schmidt,
Alexandru I. Tomescu,
Elia C. Zirondelli
Abstract:
In a strongly connected graph $G = (V,E)$, a cut arc (also called strong bridge) is an arc $e \in E$ whose removal makes the graph no longer strongly connected. Equivalently, there exist $u,v \in V$, such that all $u$-$v$ walks contain $e$. Cut arcs are a fundamental graph-theoretic notion, with countless applications, especially in reachability problems.
In this paper we initiate the study of c…
▽ More
In a strongly connected graph $G = (V,E)$, a cut arc (also called strong bridge) is an arc $e \in E$ whose removal makes the graph no longer strongly connected. Equivalently, there exist $u,v \in V$, such that all $u$-$v$ walks contain $e$. Cut arcs are a fundamental graph-theoretic notion, with countless applications, especially in reachability problems.
In this paper we initiate the study of cut paths, as a generalisation of cut arcs, which we naturally define as those paths $P$ for which there exist $u,v \in V$, such that all $u$-$v$ walks contain $P$ as subwalk. We first prove various properties of cut paths and define their remainder structures, which we use to present a simple $O(m)$-time verification algorithm for a cut path ($|V| = n$, $|E| = m$).
Secondly, we apply cut paths and their remainder structures to improve several reachability problems from bioinformatics. A walk is called safe if it is a subwalk of every node-covering closed walk of a strongly connected graph. Multi-safety is defined analogously, by considering node-covering sets of closed walks instead. We show that cut paths provide simple $O(m)$-time algorithms verifying if a walk is safe or multi-safe. For multi-safety, we present the first linear time algorithm, while for safety, we present a simple algorithm where the state-of-the-art employed complex data structures. Finally we show that the simultaneous computation of remainder structures of all subwalks of a cut path can be performed in linear time. These properties yield an $O(mn)$ algorithm outputting all maximal multi-safe walks, improving over the state-of-the-art algorithm running in time $O(m^2+n^3)$.
The results of this paper only scratch the surface in the study of cut paths, and we believe a rich structure of a graph can be revealed, considering the perspective of a path, instead of just an arc.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Width Helps and Hinders Splitting Flows
Authors:
Manuel Cáceres,
Massimo Cairo,
Andreas Grigorjew,
Shahbaz Khan,
Brendan Mumey,
Romeo Rizzi,
Alexandru I. Tomescu,
Lucia Williams
Abstract:
Minimum flow decomposition (MFD) is the NP-hard problem of finding a smallest decomposition of a network flow/circulation $X$ on a directed graph $G$ into weighted source-to-sink paths whose superposition equals $X$. We show that, for acyclic graphs, considering the \emph{width} of the graph (the minimum number of paths needed to cover all of its edges) yields advances in our understanding of its…
▽ More
Minimum flow decomposition (MFD) is the NP-hard problem of finding a smallest decomposition of a network flow/circulation $X$ on a directed graph $G$ into weighted source-to-sink paths whose superposition equals $X$. We show that, for acyclic graphs, considering the \emph{width} of the graph (the minimum number of paths needed to cover all of its edges) yields advances in our understanding of its approximability. For the version of the problem that uses only non-negative weights, we identify and characterise a new class of \emph{width-stable} graphs, for which a popular heuristic is a \gwsimple-approximation ($|X|$ being the total flow of $X$), and strengthen its worst-case approximation ratio from $Ω(\sqrt{m})$ to $Ω(m / \log m)$ for sparse graphs, where $m$ is the number of edges in the graph. We also study a new problem on graphs with cycles, Minimum Cost Circulation Decomposition (MCCD), and show that it generalises MFD through a simple reduction. For the version allowing also negative weights, we give a $(\lceil \log \Vert X \Vert \rceil +1)$-approximation ($\Vert X \Vert$ being the maximum absolute value of $X$ on any edge) using a power-of-two approach, combined with parity fixing arguments and a decomposition of unitary circulations ($\Vert X \Vert \leq 1$), using a generalised notion of width for this problem. Finally, we disprove a conjecture about the linear independence of minimum (non-negative) flow decompositions posed by Kloster et al. [ALENEX 2018], but show that its useful implication (polynomial-time assignments of weights to a given set of paths to decompose a flow) holds for the negative version.
△ Less
Submitted 9 May, 2023; v1 submitted 5 July, 2022;
originally announced July 2022.
-
Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches
Authors:
Paola Bonizzoni,
Matteo Costantini,
Clelia De Felice,
Alessia Petescia,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi,
Jens Stoye,
Rocco Zaccagnino,
Rosalba Zizza
Abstract:
Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings. Surprisingly, the fingerprint of a sequencing read, which is…
▽ More
Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.
△ Less
Submitted 2 June, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Sparsifying, Shrinking and Splicing for Minimum Path Cover in Parameterized Linear Time
Authors:
Manuel Cáceres,
Massimo Cairo,
Brendan Mumey,
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
A minimum path cover (MPC) of a directed acyclic graph (DAG) $G = (V,E)$ is a minimum-size set of paths that together cover all the vertices of the DAG. Computing an MPC is a basic polynomial problem, dating back to Dilworth's and Fulkerson's results in the 1950s. Since the size $k$ of an MPC (also known as the width) can be small in practical applications, research has also studied algorithms who…
▽ More
A minimum path cover (MPC) of a directed acyclic graph (DAG) $G = (V,E)$ is a minimum-size set of paths that together cover all the vertices of the DAG. Computing an MPC is a basic polynomial problem, dating back to Dilworth's and Fulkerson's results in the 1950s. Since the size $k$ of an MPC (also known as the width) can be small in practical applications, research has also studied algorithms whose complexity is parameterized on $k$. We obtain two new MPC parameterized algorithms for DAGs running in time $O(k^2|V|\log{|V|} + |E|)$ and $O(k^3|V| + |E|)$. We also obtain a parallel algorithm running in $O(k^2|V| + |E|)$ parallel steps and using $O(\log{|V|})$ processors (in the PRAM model). Our latter two algorithms are the first solving the problem in parameterized linear time. Finally, we present an algorithm running in time $O(k^2|V|)$ for transforming any MPC to another MPC using less than $2|V|$ distinct edges, which we prove to be asymptotically tight. As such, we also obtain edge sparsification algorithms preserving the width of the DAG with the same running time as our MPC algorithms. At the core of all our algorithms we interleave the usage of three techniques: transitive sparsification, shrinking of a path cover, and the splicing of a set of paths along a given path.
△ Less
Submitted 12 July, 2021;
originally announced July 2021.
-
Algorithmic aspects of quasi-kernels
Authors:
Hélène Langlois,
Frédéric Meunier,
Romeo Rizzi,
Stéphane Vialette
Abstract:
In a digraph, a quasi-kernel is a subset of vertices that is independent and such that every vertex can reach some vertex in that set via a directed path of length at most two. Whereas Chvátal and Lovász proved in 1974 that every digraph has a quasi-kernel, very little is known so far about the complexity of finding small quasi-kernels. In 1976 Erdős and Székely conjectured that every sink-free di…
▽ More
In a digraph, a quasi-kernel is a subset of vertices that is independent and such that every vertex can reach some vertex in that set via a directed path of length at most two. Whereas Chvátal and Lovász proved in 1974 that every digraph has a quasi-kernel, very little is known so far about the complexity of finding small quasi-kernels. In 1976 Erdős and Székely conjectured that every sink-free digraph $D = (V, A)$ has a quasi-kernel of size at most $|V|/2$. Obviously, if $D$ has two disjoint quasi-kernels then it has a quasi-kernel of size at most $|V|/2$, and in 2001 Gutin, Koh, Tay and Yeo conjectured that every sink-free digraph has two disjoint quasi-kernels. Yet, they constructed in 2004 a counterexample, thereby disproving this stronger conjecture. We shall show that, not only sink-free digraphs occasionally fail to contain two disjoint quasi-kernels, but it is computationally hard to distinguish those that do from those that do not. We also prove that the problem of computing a small quasi-kernel is polynomial time solvable for orientations of trees but is computationally hard in most other cases (and in particular for restricted acyclic digraphs).
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
The Hydrostructure: a Universal Framework for Safe and Complete Algorithms for Genome Assembly
Authors:
Massimo Cairo,
Shahbaz Khan,
Romeo Rizzi,
Sebastian Schmidt,
Alexandru I. Tomescu,
Elia C. Zirondelli
Abstract:
Genome assembly is a fundamental problem in Bioinformatics, requiring to reconstruct a source genome from an assembly graph built from a set of reads (short strings sequenced from the genome). A notion of genome assembly solution is that of an arc-covering walk of the graph. Since assembly graphs admit many solutions, the goal is to find what is definitely present in all solutions, or what is safe…
▽ More
Genome assembly is a fundamental problem in Bioinformatics, requiring to reconstruct a source genome from an assembly graph built from a set of reads (short strings sequenced from the genome). A notion of genome assembly solution is that of an arc-covering walk of the graph. Since assembly graphs admit many solutions, the goal is to find what is definitely present in all solutions, or what is safe. Most practical assemblers are based on heuristics having at their core unitigs, namely paths whose internal nodes have unit in-degree and out-degree, and which are clearly safe. The long-standing open problem of finding all the safe parts of the solutions was recently solved [RECOMB 2016] yielding a 60% increase in contig length. This safe and complete genome assembly algorithm was followed by other works improving the time bounds, as well as extending the results for different notions of assembly solution. But it remained open whether one can be complete also for models of genome assembly of practical applicability.
In this paper we present a universal framework for obtaining safe and complete algorithms which unify the previous results, while also allowing for easy generalisations to assembly problems including many practical aspects. This is based on a novel graph structure, called the hydrostructure of a walk, which highlights the reachability properties of the graph from the perspective of the walk. The hydrostructure allows for simple characterisations of the existing safe walks, and of their new practical versions. Almost all of our characterisations are directly adaptable to optimal verification algorithms, and simple enumeration algorithms. Most of these algorithms are also improved to optimality using an incremental computation procedure and a previous optimal algorithm of a specific model.
△ Less
Submitted 2 November, 2021; v1 submitted 25 November, 2020;
originally announced November 2020.
-
A linear-time parameterized algorithm for computing the width of a DAG
Authors:
Manuel Cáceres,
Massimo Cairo,
Brendan Mumey,
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
The width $k$ of a directed acyclic graph (DAG) $G = (V, E)$ equals the largest number of pairwise non-reachable vertices. Computing the width dates back to Dilworth's and Fulkerson's results in the 1950s, and is doable in quadratic time in the worst case. Since $k$ can be small in practical applications, research has also studied algorithms whose complexity is parameterized on $k$. Despite these…
▽ More
The width $k$ of a directed acyclic graph (DAG) $G = (V, E)$ equals the largest number of pairwise non-reachable vertices. Computing the width dates back to Dilworth's and Fulkerson's results in the 1950s, and is doable in quadratic time in the worst case. Since $k$ can be small in practical applications, research has also studied algorithms whose complexity is parameterized on $k$. Despite these efforts, it is still open whether there exists a linear-time $O(f(k)(|V| + |E|))$ parameterized algorithm computing the width. We answer this question affirmatively by presenting an $O(k^24^k|V| + k2^k|E|)$ time algorithm, based on a new notion of frontier antichains. As we process the vertices in a topological order, all frontier antichains can be maintained with the help of several combinatorial properties, paying only $f(k)$ along the way. The fact that the width can be computed by a single $f(k)$-sweep of the DAG is a new surprising insight into this classical problem. Our algorithm also allows deciding whether the DAG has width at most $w$ in time $O(f(\min(w,k))(|V|+|E|))$.
△ Less
Submitted 24 June, 2021; v1 submitted 15 July, 2020;
originally announced July 2020.
-
Safety in $s$-$t$ Paths, Trails and Walks
Authors:
Massimo Cairo,
Shahbaz Khan,
Romeo Rizzi,
Sebastian Schmidt,
Alexandru I. Tomescu
Abstract:
Given a directed graph $G$ and a pair of nodes $s$ and $t$, an \emph{$s$-$t$ bridge} of $G$ is an edge whose removal breaks all $s$-$t$ paths of $G$ (and thus appears in all $s$-$t$ paths). Computing all $s$-$t$ bridges of $G$ is a basic graph problem, solvable in linear time.
In this paper, we consider a natural generalisation of this problem, with the notion of "safety" from bioinformatics. We…
▽ More
Given a directed graph $G$ and a pair of nodes $s$ and $t$, an \emph{$s$-$t$ bridge} of $G$ is an edge whose removal breaks all $s$-$t$ paths of $G$ (and thus appears in all $s$-$t$ paths). Computing all $s$-$t$ bridges of $G$ is a basic graph problem, solvable in linear time.
In this paper, we consider a natural generalisation of this problem, with the notion of "safety" from bioinformatics. We say that a walk $W$ is \emph{safe} with respect to a set $\mathcal{W}$ of $s$-$t$ walks, if $W$ is a subwalk of all walks in $\mathcal{W}$. We start by considering the maximal safe walks when $\mathcal{W}$ consists of: all $s$-$t$ paths, all $s$-$t$ trails, or all $s$-$t$ walks of $G$. We show that the first two problems are immediate linear-time generalisations of finding all $s$-$t$ bridges, while the third problem is more involved. In particular, we show that there exists a compact representation computable in linear time, that allows outputting all maximal safe walks in time linear in their length.
We further generalise these problems, by assuming that safety is defined only with respect to a subset of \emph{visible} edges. Here we prove a dichotomy between the $s$-$t$ paths and $s$-$t$ trails cases, and the $s$-$t$ walks case: the former two are NP-hard, while the latter is solvable with the same complexity as when all edges are visible. We also show that the same complexity results hold for the analogous generalisations of \emph{$s$-$t$ articulation points} (nodes appearing in all $s$-$t$ paths).
We thus obtain the best possible results for natural "safety"-generalisations of these two fundamental graph problems. Moreover, our algorithms are simple and do not employ any complex data structures, making them ideal for use in practice.
△ Less
Submitted 17 July, 2020; v1 submitted 9 July, 2020;
originally announced July 2020.
-
Computing all $s$-$t$ bridges and articulation points simplified
Authors:
Massimo Cairo,
Shahbaz Khan,
Romeo Rizzi,
Sebastian Schmidt,
Alexandru I. Tomescu,
Elia Zirondelli
Abstract:
Given a directed graph $G$ and a pair of nodes $s$ and $t$, an $s$-$t$ bridge of $G$ is an edge whose removal breaks all $s$-$t$ paths of $G$. Similarly, an $s$-$t$ articulation point of $G$ is a node whose removal breaks all $s$-$t$ paths of $G$. Computing the sequence of all $s$-$t$ bridges of $G$ (as well as the $s$-$t$ articulation points) is a basic graph problem, solvable in linear time usin…
▽ More
Given a directed graph $G$ and a pair of nodes $s$ and $t$, an $s$-$t$ bridge of $G$ is an edge whose removal breaks all $s$-$t$ paths of $G$. Similarly, an $s$-$t$ articulation point of $G$ is a node whose removal breaks all $s$-$t$ paths of $G$. Computing the sequence of all $s$-$t$ bridges of $G$ (as well as the $s$-$t$ articulation points) is a basic graph problem, solvable in linear time using the classical min-cut algorithm.
When dealing with cuts of unit size ($s$-$t$ bridges) this algorithm can be simplified to a single graph traversal from $s$ to $t$ avoiding an arbitrary $s$-$t$ path, which is interrupted at the $s$-$t$ bridges. Further, the corresponding proof is also simplified making it independent of the theory of network flows.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
Genome assembly, from practice to theory: safe, complete and linear-time
Authors:
Massimo Cairo,
Romeo Rizzi,
Alexandru I. Tomescu,
Elia C. Zirondelli
Abstract:
Genome assembly asks to reconstruct an unknown string from many shorter substrings of it. Even though it is one of the key problems in Bioinformatics, it is generally lacking major theoretical advances. Its hardness stems both from practical issues (size and errors of real data), and from the fact that problem formulations inherently admit multiple solutions. Given these, at their core, most state…
▽ More
Genome assembly asks to reconstruct an unknown string from many shorter substrings of it. Even though it is one of the key problems in Bioinformatics, it is generally lacking major theoretical advances. Its hardness stems both from practical issues (size and errors of real data), and from the fact that problem formulations inherently admit multiple solutions. Given these, at their core, most state-of-the-art assemblers are based on finding non-branching paths (unitigs) in an assembly graph. If one defines a genome assembly solution as a closed arc-covering walk of the graph, then unitigs appear in all solutions, being thus safe partial solutions. All all such safe walks were recently characterized as omnitigs, leading to the first safe and complete genome assembly algorithm. Even if omnitig finding was improved to quadratic time, it remained open whether the crucial linear-time feature of finding unitigs can be attained with omnitigs.
We describe a surprising $O(m)$-time algorithm to identify all maximal omnitigs of a graph with $n$ nodes and $m$ arcs, notwithstanding the existence of families of graphs with $Θ(mn)$ total maximal omnitig size. This is based on the discovery of a family of walks (macrotigs) with the property that all the non-trivial omnitigs are univocal extensions of subwalks of a macrotig, with two consequences: (1) A linear-time output-sensitive algorithm enumerating all maximal omnitigs. (2) A compact $O(m)$ representation of all maximal omnitigs, which allows, e.g., for $O(m)$-time computation of various statistics on them.
Our results close a long-standing theoretical question inspired by practical genome assemblers, originating with the use of unitigs in 1995. We envision our results to be at the core of a reverse transfer from theory to practical and complete genome assembly programs, as has been the case for other key Bioinformatics problems.
△ Less
Submitted 8 November, 2020; v1 submitted 24 February, 2020;
originally announced February 2020.
-
When a Dollar Makes a BWT
Authors:
Sara Giuliani,
Zsuzsanna Lipták,
Francesco Masillo,
Romeo Rizzi
Abstract:
The Burrows-Wheeler-Transform (BWT) is a reversible string transformation which plays a central role in text compression and is fundamental in many modern bioinformatics applications. The BWT is a permutation of the characters, which is in general better compressible and allows to answer several different query types more efficiently than the original string.
It is easy to see that not every str…
▽ More
The Burrows-Wheeler-Transform (BWT) is a reversible string transformation which plays a central role in text compression and is fundamental in many modern bioinformatics applications. The BWT is a permutation of the characters, which is in general better compressible and allows to answer several different query types more efficiently than the original string.
It is easy to see that not every string is a BWT image, and exact characterizations of BWT images are known. We investigate a related combinatorial question. In many applications, a sentinel character dollar is added to mark the end of the string, and thus the BWT of a string ending with dollar contains exactly one dollar-character. Given a string w, we ask in which positions, if any, the dollar-character can be inserted to turn w into the BWT image of a word ending with dollar. We show that this depends only on the standard permutation of w and present a O(n log n)-time algorithm for identifying all such positions, improving on the naive quadratic time algorithm. We also give a combinatorial characterization of such positions and develop bounds on their number and value. This is an extended version of [Giuliani et al. ICTCS 2019].
△ Less
Submitted 12 March, 2021; v1 submitted 24 August, 2019;
originally announced August 2019.
-
On Restricted Disjunctive Temporal Problems: Faster Algorithms and Tractability Frontier
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
In 2005 Kumar studied the Restricted Disjunctive Temporal Problem (RDTP), a restricted but very expressive class of disjunctive temporal problems (DTPs). It was shown that that RDTPs are solvable in deterministic strongly-polynomial time by reducing them to the Connected Row-Convex (CRC) constraints problem; plus, Kumar devised a randomized algorithm whose expected running time is less than that o…
▽ More
In 2005 Kumar studied the Restricted Disjunctive Temporal Problem (RDTP), a restricted but very expressive class of disjunctive temporal problems (DTPs). It was shown that that RDTPs are solvable in deterministic strongly-polynomial time by reducing them to the Connected Row-Convex (CRC) constraints problem; plus, Kumar devised a randomized algorithm whose expected running time is less than that of the deterministic one. Instead, the most general form of DTPs allows for multi-variable disjunctions of many interval constraints and it is NP-complete.
This work offers a deeper comprehension on the tractability of RDTPs, leading to an elementary deterministic strongly-polynomial time algorithm for them, significantly improving the asymptotic running times of both the deterministic and randomized algorithms of Kumar. The result is obtained by reducing RDTPs to the Single-Source Shortest-Paths (SSSP) and the 2-SAT problem (jointly), instead of reducing to CRCs. In passing, we obtain a faster (quadratic-time) algorithm for RDTPs having only Type-1 and Type-2 constraints (and no Type-3 constraint). As a second main contribution, we study the tractability frontier of solving RDTPs by considering Hyper Temporal Networks (\HTNs), a strict generalization of \STNs grounded on hypergraphs: on one side, we prove that solving temporal problems having only Type-2 constraints and either only multi-tail or only multi-head hyperarc constraints lies in both NP and co-NP and it admits deterministic pseudo-polynomial time algorithms; on the other side, solving problems with Type-3 constraints and either only multi-tail or only multi-head hyperarc constraints turns strongly NP-complete.
△ Less
Submitted 4 August, 2018; v1 submitted 6 May, 2018;
originally announced May 2018.
-
Computing the BWT and LCP array of a Set of Strings in External Memory
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi
Abstract:
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set…
▽ More
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k + m) main memory, where l is the maximum value in the LCP array.
△ Less
Submitted 4 December, 2020; v1 submitted 19 May, 2017;
originally announced May 2017.
-
Perfect phylogenies via branchings in acyclic digraphs and a generalization of Dilworth's theorem
Authors:
Ademir Hujdurović,
Edin Husić,
Martin Milanič,
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
Motivated by applications in cancer genomics and following the work of Hajirasouliha and Raphael (WABI 2014), Hujdurović et al. (IEEE TCBB, to appear) introduced the minimum conflict-free row split (MCRS) problem: split each row of a given binary matrix into a bitwise OR of a set of rows so that the resulting matrix corresponds to a perfect phylogeny and has the minimum possible number of rows amo…
▽ More
Motivated by applications in cancer genomics and following the work of Hajirasouliha and Raphael (WABI 2014), Hujdurović et al. (IEEE TCBB, to appear) introduced the minimum conflict-free row split (MCRS) problem: split each row of a given binary matrix into a bitwise OR of a set of rows so that the resulting matrix corresponds to a perfect phylogeny and has the minimum possible number of rows among all matrices with this property. Hajirasouliha and Raphael also proposed the study of a similar problem, in which the task is to minimize the number of distinct rows of the resulting matrix. Hujdurović et al. proved that both problems are NP-hard, gave a related characterization of transitively orientable graphs, and proposed a polynomial-time heuristic algorithm for the MCRS problem based on coloring cocomparability graphs.
We give new, more transparent formulations of the two problems, showing that the problems are equivalent to two optimization problems on branchings in a derived directed acyclic graph. Building on these formulations, we obtain new results on the two problems, including: (i) a strengthening of the heuristic by Hujdurović et al. via a new min-max result in digraphs generalizing Dilworth's theorem, which may be of independent interest, (ii) APX-hardness results for both problems, (iii) approximation algorithms, and (iv) exponential-time algorithms solving the two problems to optimality faster than the naïve brute-force approach. Our work relates to several well studied notions in combinatorial optimization: chain partitions in partially ordered sets, laminar hypergraphs, and (classical and weighted) colorings of graphs.
△ Less
Submitted 27 January, 2018; v1 submitted 19 January, 2017;
originally announced January 2017.
-
Hardness of Covering Alignment: Phase Transition in Post-Sequence Genomics
Authors:
Romeo Rizzi,
Massimo Cairo,
Veli Mäkinen,
Alexandru I. Tomescu,
Daniel Valenzuela
Abstract:
Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes…
▽ More
Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes. More broadly, our approach can also be seen as a minimal extension of sequence alignment to labelled directed acyclic graphs (labeled DAGs). Namely, we show that finding a \emph{covering alignment} of two labeled DAGs is NP-hard even on binary alphabets. A covering alignment asks for two paths $R_1$ (red) and $G_1$ (green) in DAG $D_1$ and two paths $R_2$ (red) and $G_2$ (green) in DAG $D_2$ that cover the nodes of the graphs and maximize the sum of the global alignment scores: $\mathsf{as}(\mathsf{sp}(R_1),\mathsf{sp}(R_2))+\mathsf{as}(\mathsf{sp}(G_1),\mathsf{sp}(G_2))$, where $\mathsf{sp}(P)$ is the concatenation of labels on the path $P$. Pair-wise alignment of haplotype sequences forming a diploid chromosome can be converted to a two-path coverable labelled DAG, and then the covering alignment models the similarity of two diploids over arbitrary recombinations. We also give a reduction to the other direction, to show that such a recombination-oblivious diploid alignment is NP-hard on alphabets of size $3$.
△ Less
Submitted 22 May, 2018; v1 submitted 15 November, 2016;
originally announced November 2016.
-
Linear-Time Safe-Alternating DFS and SCCs
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
An alternating graph is a directed graph whose vertex set is partitioned into two classes, existential and universal. This forms the basic arena for a plethora of infinite duration two-player games where Player~$\square$ and~$\ocircle$ alternate in a turn-based sliding of a pebble along the arcs they control. We study alternating strongly-connectedness as a generalization of strongly-connectedness…
▽ More
An alternating graph is a directed graph whose vertex set is partitioned into two classes, existential and universal. This forms the basic arena for a plethora of infinite duration two-player games where Player~$\square$ and~$\ocircle$ alternate in a turn-based sliding of a pebble along the arcs they control. We study alternating strongly-connectedness as a generalization of strongly-connectedness in directed graphs, aiming at providing a linear time decomposition and a sound structural graph characterization. For this a refined notion of alternating reachability is introduced: Player~$\square$ attempts to reach vertices without leaving a prescribed subset of the vertices, while Player~$\ocircle$ works against. This is named \emph{safe alternating reachability}. It is shown that every arena uniquely decomposes into safe alternating strongly-connected components where Player~$\square$ can visit each vertex within a given component infinitely often, without having to ever leave out the component itself. Our main result is a linear time algorithm for computing this alternating graph decomposition. Both the underlying graph structures and the algorithm generalize the classical decomposition of a directed graph into strongly-connected components. The algorithm builds on a linear time generalization of the depth-first search on alternation, taking inspiration from Tarjan 1972 machinery. Our theory has direct applications in solving well-known infinite duration pebble games faster. Dinneen and Khoussainov showed in 1999 that deciding a given Update Game costs $O(mn)$ time, where $n$ is the number of vertices and $m$ is that of arcs. We solve the task in $Θ(m+n)$ linear~time. The complexity of Explicit McNaughton-Müller Games also improves from cubic to quadratic.
△ Less
Submitted 13 February, 2022; v1 submitted 30 October, 2016;
originally announced October 2016.
-
Faster O(|V|^2|E|W)-Time Energy Algorithms for Optimal Strategy Synthesis in Mean Payoff Games
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
This study strengthens the links between Mean Payoff Games (\MPG{s}) and Energy Games (EG{s}). Firstly, we offer a faster $O(|V|^2|E|W)$ pseudo-polynomial time and $Θ(|V|+|E|)$ space deterministic algorithm for solving the Value Problem and Optimal Strategy Synthesis in \MPG{s}. This improves the best previously known estimates on the pseudo-polynomial time complexity to: \[ O(|E|\log |V|) + Θ\Big…
▽ More
This study strengthens the links between Mean Payoff Games (\MPG{s}) and Energy Games (EG{s}). Firstly, we offer a faster $O(|V|^2|E|W)$ pseudo-polynomial time and $Θ(|V|+|E|)$ space deterministic algorithm for solving the Value Problem and Optimal Strategy Synthesis in \MPG{s}. This improves the best previously known estimates on the pseudo-polynomial time complexity to: \[ O(|E|\log |V|) + Θ\Big(\sum_{v\in V}\texttt{deg}_Γ(v)\cdot\ell_Γ(v)\Big) = O(|V|^2|E|W), \] where $\ell_Γ(v)$ counts the number of times that a certain energy-lifting operator $δ(\cdot, v)$ is applied to any $v\in V$, along a certain sequence of Value-Iterations on reweighted \EG{s}; and $\texttt{deg}_Γ(v)$ is the degree of $v$. This improves significantly over a previously known pseudo-polynomial time estimate, i.e. $Θ\big(|V|^2|E|W + \sum_{v\in V}\texttt{deg}_Γ(v)\cdot\ell_Γ(v)\big)$ \citep{CR15, CR16}, as the pseudo-polynomiality is now confined to depend solely on $\ell_Γ$. Secondly, we further explore on the relationship between Optimal Positional Strategies (OPSs) in \MPG{s} and Small Energy-Progress Measures (SEPMs) in reweighted \EG{s}. It is observed that the space of all OPSs, $\texttt{opt}_ΓΣ^M_0$, admits a unique complete decomposition in terms of extremal-SEPM{s} in reweighted EG{s}. This points out what we called the "Energy-Lattice $\mathcal{X}^*_Γ$ associated to $\texttt{opt}_ΓΣ^M_0$". Finally, it is offered a pseudo-polynomial total-time recursive procedure for enumerating (w/o repetitions) all the elements of $\mathcal{X}^*_Γ$, and for computing the corresponding partitioning of $\texttt{opt}_ΓΣ^M_0$.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Dynamic Controllability of Conditional Simple Temporal Networks is PSPACE-complete
Authors:
Massimo Cairo,
Romeo Rizzi
Abstract:
Even after the proposal of various solution algorithms, the precise computational complexity of checking whether a Conditional Temporal Network is Dynamically Controllable had still remained widely open. This issue gets settled in this paper which provides constructions, algorithms, and bridging lemmas and arguments to formally prove that: (1) the problem is PSPACE-hard, and (2) the problem lies i…
▽ More
Even after the proposal of various solution algorithms, the precise computational complexity of checking whether a Conditional Temporal Network is Dynamically Controllable had still remained widely open. This issue gets settled in this paper which provides constructions, algorithms, and bridging lemmas and arguments to formally prove that: (1) the problem is PSPACE-hard, and (2) the problem lies in PSPACE.
△ Less
Submitted 30 August, 2016;
originally announced August 2016.
-
Instantaneous Reaction-Time in Dynamic-Consistency Checking of Conditional Simple Temporal Networks -- Extended version with an Improved Upper Bound --
Authors:
Massimo Cairo,
Carlo Comin,
Romeo Rizzi
Abstract:
CSTNs is a constraint-based graph-formalism for conditional temporal planning. In order to address the DC-Checking problem, in [Comin and Rizzi, TIME 2015] we introduced epsilon-DC (a refined, more realistic, notion of DC), and provided an algorithmic solution to it. The epsilon-DC notion is interesting per se, and the epsilon-DC-Checking algorithm in [Comin and Rizzi, TIME 2015] rests on the assu…
▽ More
CSTNs is a constraint-based graph-formalism for conditional temporal planning. In order to address the DC-Checking problem, in [Comin and Rizzi, TIME 2015] we introduced epsilon-DC (a refined, more realistic, notion of DC), and provided an algorithmic solution to it. The epsilon-DC notion is interesting per se, and the epsilon-DC-Checking algorithm in [Comin and Rizzi, TIME 2015] rests on the assumption that the reaction-time satisfies epsilon > 0; leaving unsolved the question of what happens when epsilon = 0. In this work, we introduce and study pi-DC, a sound notion of DC with an instantaneous reaction-time (i.e. one in which the planner can react to any observation at the same instant of time in which the observation is made). Firstly, we demonstrate by a counter-example that pi-DC is not equivalent to 0-DC, and that 0-DC is actually inadequate for modeling DC with an instantaneous reaction-time. This shows that the main results obtained in our previous work do not apply directly, as they were formulated, to the case of epsilon=0. Motivated by this observation, as a second contribution, our previous tools are extended in order to handle pi-DC, and the notion of ps-tree is introduced, also pointing out a relationship between pi-DC and HyTN-Consistency. Thirdly, a simple reduction from pi-DC-Checking to DC-Checking is identified. This allows us to design and to analyze the first sound-and-complete pi-DC-Checking procedure. Remarkably, the time complexity of the proposed algorithm remains (pseudo) singly-exponential in the number of propositional letters. Finally, it is observed that the technique can be leveraged to actually reduce from pi-DC to 1-DC, this allows us to further improve the exponents in the time complexity of pi-DC-Checking.
△ Less
Submitted 9 December, 2018; v1 submitted 14 August, 2016;
originally announced August 2016.
-
A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Serena Nicosia,
Marco Previtali,
Raffaella Rizzi
Abstract:
Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this…
▽ More
Indexing of very large collections of strings such as those produced by the widespread sequencing technologies, heavily relies on multi-string generalizations of the Burrows-Wheeler Transform (BWT), and for this problem various in-memory algorithms have been proposed. The rapid growing of data that are processed routinely, such as in bioinformatics, requires a large amount of main memory, and this fact has motivated the development of algorithms, to compute the BWT, that work almost entirely in external memory. On the other hand, the related problem of computing the Longest Common Prefix (LCP) array is often instrumental in several algorithms on collection of strings, such as those that compute the suffix-prefix overlap among strings, which is an essential step for many genome assembly algorithms. The best current lightweight approach to compute BWT and LCP array on a set of $m$ strings, each one $k$ characters long, has I/O complexity that is $O(mk^2 \log |Σ|)$ (where $|Σ|$ is the size of the alphabet), thus it is not optimal. In this paper we propose a novel approach to build BWT and LCP array (simultaneously) with $O(kmL(\log k +\log σ))$ I/O complexity, where $L$ is the length of longest substring that appears at least twice in the input strings.
△ Less
Submitted 28 July, 2016;
originally announced July 2016.
-
The Complexity of Simulation and Matrix Multiplication
Authors:
Massimo Cairo,
Romeo Rizzi
Abstract:
Computing the simulation preorder of a given Kripke structure (i.e., a directed graph with $n$ labeled vertices) has crucial applications in model checking of temporal logic. It amounts to solving a specific two-players reachability game, called simulation game. We offer the first conditional lower bounds for this problem, and we relate its complexity (for computation, verification, and certificat…
▽ More
Computing the simulation preorder of a given Kripke structure (i.e., a directed graph with $n$ labeled vertices) has crucial applications in model checking of temporal logic. It amounts to solving a specific two-players reachability game, called simulation game. We offer the first conditional lower bounds for this problem, and we relate its complexity (for computation, verification, and certification) to some variants of $n\times n$ matrix multiplication.
We show that any $O(n^α)$-time algorithm for simulation games, even restricting to acyclic games/structures, can be used to compute $n\times n$ boolean matrix multiplication (BMM) in $O(n^α)$ time. This is the first evidence that improving the existing $O(n^{3})$-time solutions may be difficult, without resorting to fast matrix multiplication. In the acyclic case, we match this lower bound presenting the first subcubic algorithm, based on fast BMM, and running in $n^{ω+o(1)}$ time (where $ω<2.376$ is the exponent of matrix multiplication).
For both acyclic and cyclic structures, we point out the existence of natural and canonical $O(n^{2})$-size certificates, that can be verified in truly subcubic time. In the acyclic case, $O(n^{2})$ time is sufficient, employing standard matrix product verification. In the cyclic case, a $\max$-semi-boolean matrix multiplication (MSBMM) is used, i.e., a matrix multiplication on the semi-ring $(\max,\times)$ where one matrix contains only $0$'s and $1$'s. This MSBMM is computable (hence verifiable) in truly subcubic $n^{(3+ω)/2+o(1)}$ time by reduction to $(\max,\min)$-multiplication.
Finally, we show a reduction from MSBMM to cyclic simulation games which implies a separation between the cyclic and the acyclic cases, unless MSBMM can be verified in $n^{ω+o(1)}$ time.
△ Less
Submitted 30 August, 2016; v1 submitted 7 May, 2016;
originally announced May 2016.
-
Decomposing Cubic Graphs into Connected Subgraphs of Size Three
Authors:
Laurent Bulteau,
Guillaume Fertin,
Anthony Labarre,
Romeo Rizzi,
Irena Rusu
Abstract:
Let $S=\{K_{1,3},K_3,P_4\}$ be the set of connected graphs of size 3. We study the problem of partitioning the edge set of a graph $G$ into graphs taken from any non-empty $S'\subseteq S$. The problem is known to be NP-complete for any possible choice of $S'$ in general graphs. In this paper, we assume that the input graph is cubic, and study the computational complexity of the problem of partitio…
▽ More
Let $S=\{K_{1,3},K_3,P_4\}$ be the set of connected graphs of size 3. We study the problem of partitioning the edge set of a graph $G$ into graphs taken from any non-empty $S'\subseteq S$. The problem is known to be NP-complete for any possible choice of $S'$ in general graphs. In this paper, we assume that the input graph is cubic, and study the computational complexity of the problem of partitioning its edge set for any choice of $S'$. We identify all polynomial and NP-complete problems in that setting, and give graph-theoretic characterisations of $S'$-decomposable cubic graphs in some cases.
△ Less
Submitted 28 April, 2016;
originally announced April 2016.
-
FSG: Fast String Graph Construction for De Novo Assembly of Reads Data
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi
Abstract:
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection…
▽ More
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this paper, we explore a novel approach to compute the string graph, based on the FM-index and Burrows-Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the SGA assembler as a standalone module to construct the string graph.
The new integrated assembler has been assessed on a standard benchmark, showing that FSG is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads.
△ Less
Submitted 29 May, 2017; v1 submitted 12 April, 2016;
originally announced April 2016.
-
Sorting With Forbidden Intermediates
Authors:
Carlo Comin,
Anthony Labarre,
Romeo Rizzi,
Stéphane Vialette
Abstract:
A wide range of applications, most notably in comparative genomics, involve the computation of a shortest sorting sequence of operations for a given permutation, where the set of allowed operations is fixed beforehand. Such sequences are useful for instance when reconstructing potential scenarios of evolution between species, or when trying to assess their similarity. We revisit those problems by…
▽ More
A wide range of applications, most notably in comparative genomics, involve the computation of a shortest sorting sequence of operations for a given permutation, where the set of allowed operations is fixed beforehand. Such sequences are useful for instance when reconstructing potential scenarios of evolution between species, or when trying to assess their similarity. We revisit those problems by adding a new constraint on the sequences to be computed: they must \emph{avoid} a given set of \emph{forbidden intermediates}, which correspond to species that cannot exist because the mutations that would be involved in their creation are lethal. We initiate this study by focusing on the case where the only mutations that can occur are exchanges of any two elements in the permutations, and give a polynomial time algorithm for solving that problem when the permutation to sort is an involution.
△ Less
Submitted 23 March, 2017; v1 submitted 19 February, 2016;
originally announced February 2016.
-
Checking Dynamic Consistency of Conditional Hyper Temporal Networks via Mean Payoff Games (Hardness and (pseudo) Singly-Exponential Time Algorithm)
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
In this work we introduce the \emph{Conditional Hyper Temporal Network (CHyTN)} model, which is a natural extension and generalization of both the \CSTN and the \HTN model. Our contribution goes as follows. We show that deciding whether a given \CSTN or CHyTN is dynamically consistent is \coNP-hard. Then, we offer a proof that deciding whether a given CHyTN is dynamically consistent is \PSPACE-har…
▽ More
In this work we introduce the \emph{Conditional Hyper Temporal Network (CHyTN)} model, which is a natural extension and generalization of both the \CSTN and the \HTN model. Our contribution goes as follows. We show that deciding whether a given \CSTN or CHyTN is dynamically consistent is \coNP-hard. Then, we offer a proof that deciding whether a given CHyTN is dynamically consistent is \PSPACE-hard, provided that the input instances are allowed to include both multi-head and multi-tail hyperarcs. In light of this, we continue our study by focusing on CHyTNs that allow only multi-head or only multi-tail hyperarcs, and we offer the first deterministic (pseudo) singly-exponential time algorithm for the problem of checking the dynamic-consistency of such CHyTNs, also producing a dynamic execution strategy whenever the input CHyTN is dynamically consistent. Since \CSTN{s} are a special case of CHyTNs, this provides as a byproduct the first sound-and-complete (pseudo) singly-exponential time algorithm for checking dynamic-consistency in CSTNs. The proposed algorithm is based on a novel connection between CSTN{s}/CHyTN{s} and Mean Payoff Games. The presentation of the connection between \CSTN{s}/CHyTNs and \MPG{s} is mediated by the \HTN model. In order to analyze the algorithm, we introduce a refined notion of dynamic-consistency, named $ε$-dynamic-consistency, and present a sharp lower bounding analysis on the critical value of the reaction time $\hat{\varepsilon}$ where a \CSTN/CHyTN transits from being, to not being, dynamically consistent. The proof technique introduced in this analysis of $\hat{\varepsilon}$ is applicable more generally when dealing with linear difference constraints which include strict inequalities.
△ Less
Submitted 15 April, 2017; v1 submitted 19 February, 2016;
originally announced February 2016.
-
Decoding Hidden Markov Models Faster Than Viterbi Via Online Matrix-Vector (max, +)-Multiplication
Authors:
Massimo Cairo,
Gabriele Farina,
Romeo Rizzi
Abstract:
In this paper, we present a novel algorithm for the maximum a posteriori decoding (MAPD) of time-homogeneous Hidden Markov Models (HMM), improving the worst-case running time of the classical Viterbi algorithm by a logarithmic factor. In our approach, we interpret the Viterbi algorithm as a repeated computation of matrix-vector $(\max, +)$-multiplications. On time-homogeneous HMMs, this computatio…
▽ More
In this paper, we present a novel algorithm for the maximum a posteriori decoding (MAPD) of time-homogeneous Hidden Markov Models (HMM), improving the worst-case running time of the classical Viterbi algorithm by a logarithmic factor. In our approach, we interpret the Viterbi algorithm as a repeated computation of matrix-vector $(\max, +)$-multiplications. On time-homogeneous HMMs, this computation is online: a matrix, known in advance, has to be multiplied with several vectors revealed one at a time. Our main contribution is an algorithm solving this version of matrix-vector $(\max,+)$-multiplication in subquadratic time, by performing a polynomial preprocessing of the matrix. Employing this fast multiplication algorithm, we solve the MAPD problem in $O(mn^2/ \log n)$ time for any time-homogeneous HMM of size $n$ and observation sequence of length $m$, with an extra polynomial preprocessing cost negligible for $m > n$. To the best of our knowledge, this is the first algorithm for the MAPD problem requiring subquadratic time per observation, under the only assumption -- usually verified in practice -- that the transition probability matrix does not change with time.
△ Less
Submitted 11 December, 2015; v1 submitted 30 November, 2015;
originally announced December 2015.
-
Pattern matching in $(213,231)$-avoiding permutations
Authors:
Both Emerite Neou,
Romeo Rizzi,
Stéphane Vialette
Abstract:
Given permutations $σ\in S_k$ and $π\in S_n$ with $k<n$, the \emph{pattern matching} problem is to decide whether $π$ matches $σ$ as an order-isomorphic subsequence. We give a linear-time algorithm in case both $π$ and $σ$ avoid the two size-$3$ permutations $213$ and $231$. For the special case where only $σ$ avoids $213$ and $231$, we present a $O(max(kn^2,n^2\log(\log(n)))$ time algorithm. We e…
▽ More
Given permutations $σ\in S_k$ and $π\in S_n$ with $k<n$, the \emph{pattern matching} problem is to decide whether $π$ matches $σ$ as an order-isomorphic subsequence. We give a linear-time algorithm in case both $π$ and $σ$ avoid the two size-$3$ permutations $213$ and $231$. For the special case where only $σ$ avoids $213$ and $231$, we present a $O(max(kn^2,n^2\log(\log(n)))$ time algorithm. We extend our research to bivincular patterns that avoid $213$ and $231$ and present a $O(kn^4)$ time algorithm. Finally we look at the related problem of the longest subsequence which avoids $213$ and $231$.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
Enumerating Cyclic Orientations of a Graph
Authors:
Alessio Conte,
Roberto Grossi,
Andrea Marino,
Romeo Rizzi
Abstract:
Acyclic and cyclic orientations of an undirected graph have been widely studied for their importance: an orientation is acyclic if it assigns a direction to each edge so as to obtain a directed acyclic graph (DAG) with the same vertex set; it is cyclic otherwise. As far as we know, only the enumeration of acyclic orientations has been addressed in the literature. In this paper, we pose the problem…
▽ More
Acyclic and cyclic orientations of an undirected graph have been widely studied for their importance: an orientation is acyclic if it assigns a direction to each edge so as to obtain a directed acyclic graph (DAG) with the same vertex set; it is cyclic otherwise. As far as we know, only the enumeration of acyclic orientations has been addressed in the literature. In this paper, we pose the problem of efficiently enumerating all the \emph{cyclic} orientations of an undirected connected graph with $n$ vertices and $m$ edges, observing that it cannot be solved using algorithmic techniques previously employed for enumerating acyclic orientations.We show that the problem is of independent interest from both combinatorial and algorithmic points of view, and that each cyclic orientation can be listed with $\tilde{O}(m)$ delay time. Space usage is $O(m)$ with an additional setup cost of $O(n^2)$ time before the enumeration begins, or $O(mn)$ with a setup cost of $\tilde{O}(m)$ time.
△ Less
Submitted 19 June, 2015;
originally announced June 2015.
-
An Improved Upper Bound on Maximal Clique Listing via Rectangular Fast Matrix Multiplication
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
The first output-sensitive algorithm for the Maximal Clique Listing problem was given by Tsukiyama et.al. in 1977. As any algorithm falling within the Reverse Search paradigm, it performs a DFS visit of a directed tree (the RS-tree) having the objects to be listed (i.e. maximal cliques) as its nodes. In a recursive implementation, the RS-tree corresponds to the recursion tree of the algorithm. The…
▽ More
The first output-sensitive algorithm for the Maximal Clique Listing problem was given by Tsukiyama et.al. in 1977. As any algorithm falling within the Reverse Search paradigm, it performs a DFS visit of a directed tree (the RS-tree) having the objects to be listed (i.e. maximal cliques) as its nodes. In a recursive implementation, the RS-tree corresponds to the recursion tree of the algorithm. The time delay is given by the cost of generating the next child of a node, and Tsukiyama showed it is $O(mn)$. In 2004, Makino and Uno sharpened the time delay to $O(n^ω)$ by generating all the children of a node in one single shot performed by computing a \emph{square} fast matrix multiplication. In this paper, we further improve the asymptotics for the exploration of the same RS-tree by grouping the offsprings' computation even further. Our idea is to rely on rectangular fast matrix multiplication in order to compute all children of $n^2$ nodes in one shot. According to the current upper bounds on fast matrix multiplication, with this the time delay improves from $O(n^{2.3728639})$ to $O(n^{2.093362})$.
△ Less
Submitted 24 August, 2015; v1 submitted 2 June, 2015;
originally announced June 2015.
-
Dynamic Consistency of Conditional Simple Temporal Networks via Mean Payoff Games: a Singly-Exponential Time DC-Checking
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
Conditional Simple Temporal Network (CSTN) is a constraint-based graph-formalism for conditional temporal planning. It offers a more flexible formalism than the equivalent CSTP model of Tsamardinos, Vidal and Pollack, from which it was derived mainly as a sound formalization. Three notions of consistency arise for CSTNs and CSTPs: weak, strong, and dynamic. Dynamic consistency is the most interest…
▽ More
Conditional Simple Temporal Network (CSTN) is a constraint-based graph-formalism for conditional temporal planning. It offers a more flexible formalism than the equivalent CSTP model of Tsamardinos, Vidal and Pollack, from which it was derived mainly as a sound formalization. Three notions of consistency arise for CSTNs and CSTPs: weak, strong, and dynamic. Dynamic consistency is the most interesting notion, but it is also the most challenging and it was conjectured to be hard to assess. Tsamardinos, Vidal and Pollack gave a doubly-exponential time algorithm for deciding whether a CSTN is dynamically-consistent and to produce, in the positive case, a dynamic execution strategy of exponential size. In the present work we offer a proof that deciding whether a CSTN is dynamically-consistent is coNP-hard and provide the first singly-exponential time algorithm for this problem, also producing a dynamic execution strategy whenever the input CSTN is dynamically-consistent. The algorithm is based on a novel connection with Mean Payoff Games, a family of two-player combinatorial games on graphs well known for having applications in model-checking and formal verification. The presentation of such connection is mediated by the Hyper Temporal Network model, a tractable generalization of Simple Temporal Networks whose consistency checking is equivalent to determining Mean Payoff Games. In order to analyze the algorithm we introduce a refined notion of dynamic-consistency, named ε-dynamic-consistency, and present a sharp lower bounding analysis on the critical value of the reaction time \hat{\varepsilon} where the CSTN transits from being, to not being, dynamically-consistent. The proof technique introduced in this analysis of \hat{\varepsilon} is applicable more in general when dealing with linear difference constraints which include strict inequalities.
△ Less
Submitted 17 July, 2015; v1 submitted 4 May, 2015;
originally announced May 2015.
-
An Improved Pseudo-Polynomial Upper Bound for the Value Problem and Optimal Strategy Synthesis in Mean Payoff Games
Authors:
Carlo Comin,
Romeo Rizzi
Abstract:
In this work we offer an $O(|V|^2 |E|\, W)$ pseudo-polynomial time deterministic algorithm for solving the Value Problem and Optimal Strategy Synthesis in Mean Payoff Games. This improves by a factor $\log(|V|\, W)$ the best previously known pseudo-polynomial time upper bound due to Brim,~\etal The improvement hinges on a suitable characterization of values, and a description of optimal positional…
▽ More
In this work we offer an $O(|V|^2 |E|\, W)$ pseudo-polynomial time deterministic algorithm for solving the Value Problem and Optimal Strategy Synthesis in Mean Payoff Games. This improves by a factor $\log(|V|\, W)$ the best previously known pseudo-polynomial time upper bound due to Brim,~\etal The improvement hinges on a suitable characterization of values, and a description of optimal positional strategies, in terms of reweighted Energy Games and Small Energy-Progress Measures.
△ Less
Submitted 24 April, 2016; v1 submitted 15 March, 2015;
originally announced March 2015.
-
Hyper Temporal Networks
Authors:
Carlo Comin,
Roberto Posenato,
Romeo Rizzi
Abstract:
Simple Temporal Networks (STNs) provide a powerful and general tool for representing conjunctions of maximum delay constraints over ordered pairs of temporal variables. In this paper we introduce Hyper Temporal Networks (HyTNs), a strict generalization of STNs, to overcome the limitation of considering only conjunctions of constraints but maintaining a practical efficiency in the consistency check…
▽ More
Simple Temporal Networks (STNs) provide a powerful and general tool for representing conjunctions of maximum delay constraints over ordered pairs of temporal variables. In this paper we introduce Hyper Temporal Networks (HyTNs), a strict generalization of STNs, to overcome the limitation of considering only conjunctions of constraints but maintaining a practical efficiency in the consistency check of the instances. In a Hyper Temporal Network a single temporal hyperarc constraint may be defined as a set of two or more maximum delay constraints which is satisfied when at least one of these delay constraints is satisfied. HyTNs are meant as a light generalization of STNs offering an interesting compromise. On one side, there exist practical pseudo-polynomial time algorithms for checking consistency and computing feasible schedules for HyTNs. On the other side, HyTNs offer a more powerful model accommodating natural constraints that cannot be expressed by STNs like Trigger off exactly delta min before (after) the occurrence of the first (last) event in a set., which are used to represent synchronization events in some process aware information systems/workflow models proposed in the literature.
△ Less
Submitted 22 March, 2017; v1 submitted 13 March, 2015;
originally announced March 2015.
-
On the complexity of the vector connectivity problem
Authors:
Ferdinando Cicalese,
Martin Milanič,
Romeo Rizzi
Abstract:
We study a relaxation of the Vector Domination problem called Vector Connectivity (VecCon). Given a graph $G$ with a requirement $r(v)$ for each vertex $v$, VecCon asks for a minimum cardinality set $S$ of vertices such that every vertex $v\in V\setminus S$ is connected to $S$ via $r(v)$ disjoint paths. In the paper introducing the problem, Boros et al. [Networks, 2014] gave polynomial-time soluti…
▽ More
We study a relaxation of the Vector Domination problem called Vector Connectivity (VecCon). Given a graph $G$ with a requirement $r(v)$ for each vertex $v$, VecCon asks for a minimum cardinality set $S$ of vertices such that every vertex $v\in V\setminus S$ is connected to $S$ via $r(v)$ disjoint paths. In the paper introducing the problem, Boros et al. [Networks, 2014] gave polynomial-time solutions for VecCon in trees, cographs, and split graphs, and showed that the problem can be approximated in polynomial time on $n$-vertex graphs to within a factor of $\log n+2$, leaving open the question of whether the problem is NP-hard on general graphs. We show that VecCon is APX-hard in general graphs, and NP-hard in planar bipartite graphs and in planar line graphs. We also generalize the polynomial result for trees by solving the problem for block graphs.
△ Less
Submitted 8 December, 2014;
originally announced December 2014.
-
Efficiently listing bounded length st-paths
Authors:
Romeo Rizzi,
Gustavo Sacomoto,
Marie-France Sagot
Abstract:
The problem of listing the $K$ shortest simple (loopless) $st$-paths in a graph has been studied since the early 1960s. For a non-negatively weighted graph with $n$ vertices and $m$ edges, the most efficient solution is an $O(K(mn + n^2 \log n))$ algorithm for directed graphs by Yen and Lawler [Management Science, 1971 and 1972], and an $O(K(m+n \log n))$ algorithm for the undirected version by Ka…
▽ More
The problem of listing the $K$ shortest simple (loopless) $st$-paths in a graph has been studied since the early 1960s. For a non-negatively weighted graph with $n$ vertices and $m$ edges, the most efficient solution is an $O(K(mn + n^2 \log n))$ algorithm for directed graphs by Yen and Lawler [Management Science, 1971 and 1972], and an $O(K(m+n \log n))$ algorithm for the undirected version by Katoh et al. [Networks, 1982], both using $O(Kn + m)$ space. In this work, we consider a different parameterization for this problem: instead of bounding the number of $st$-paths output, we bound their length. For the bounded length parameterization, we propose new non-trivial algorithms matching the time complexity of the classic algorithms but using only $O(m+n)$ space. Moreover, we provide a unified framework such that the solutions to both parameterizations -- the classic $K$-shortest and the new length-bounded paths -- can be seen as two different traversals of a same tree, a Dijkstra-like and a DFS-like traversal, respectively.
△ Less
Submitted 25 November, 2014;
originally announced November 2014.
-
Amortized $\tilde{O}(|V|)$-Delay Algorithm for Listing Chordless Cycles in Undirected Graphs
Authors:
Rui Ferreira,
Roberto Grossi,
Romeo Rizzi,
Gustavo Sacomoto,
Marie-France Sagot
Abstract:
Chordless cycles are very natural structures in undirected graphs, with an important history and distinguished role in graph theory. Motivated also by previous work on the classical problem of listing cycles, we study how to list chordless cycles. The best known solution to list all the $C$ chordless cycles contained in an undirected graph $G = (V,E)$ takes $O(|E|^2 +|E|\cdot C)$ time. In this pap…
▽ More
Chordless cycles are very natural structures in undirected graphs, with an important history and distinguished role in graph theory. Motivated also by previous work on the classical problem of listing cycles, we study how to list chordless cycles. The best known solution to list all the $C$ chordless cycles contained in an undirected graph $G = (V,E)$ takes $O(|E|^2 +|E|\cdot C)$ time. In this paper we provide an algorithm taking $\tilde{O}(|E| + |V |\cdot C)$ time. We also show how to obtain the same complexity for listing all the $P$ chordless $st$-paths in $G$ (where $C$ is replaced by $P$ ).
△ Less
Submitted 6 August, 2014;
originally announced August 2014.
-
An External-Memory Algorithm for String Graph Construction
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Yuri Pirola,
Marco Previtali,
Raffaella Rizzi
Abstract:
Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome fr…
▽ More
Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from Bioinformatics, where a large number of short strings (called reads) are routinely produced and analyzed. In that field, a fundamental problem is to assemble a genome from a large set of much shorter samples extracted from the unknown genome. The approaches that are currently used to tackle this problem are memory-intensive. This fact does not bode well with the ongoing increase in the availability of genomic data. A data structure that is used in genome assembly is the string graph, where vertices correspond to samples and arcs represent two overlapping samples. In this paper we address an open problem: to design an external-memory algorithm to compute the string graph.
△ Less
Submitted 11 June, 2015; v1 submitted 29 May, 2014;
originally announced May 2014.
-
A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths
Authors:
Alexandru I. Tomescu,
Anna Kuosmanen,
Romeo Rizzi,
Veli Mäkinen
Abstract:
RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. The problem is usually modeled by a weighted splicing graph whose nodes stand for exons and whose edges stand for split alignments to the exons. The task consists of finding a number of paths, together with their expression levels, whi…
▽ More
RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. The problem is usually modeled by a weighted splicing graph whose nodes stand for exons and whose edges stand for split alignments to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the coverages of the graph under various fitness functions, such least sum of squares. In (Tomescu et al. RECOMB-seq 2013) we showed that under general fitness functions, if we allow a polynomially bounded number of paths in an optimal solution, this problem can be solved in polynomial time by a reduction to a min-cost flow program. In this paper we further refine this problem by asking for a bounded number k of paths that optimally explain the splicing graph. This problem becomes NP-hard in the strong sense, but we give a fast combinatorial algorithm based on dynamic programming for it. In order to obtain a practical tool, we implement three optimizations and heuristics, which achieve better performance on real data, and similar or better performance on simulated data, than state-of-the-art tools Cufflinks, IsoLasso and SLIDE. Our tool, called Traph, is available at http://www.cs.helsinki.fi/gsa/traph/
△ Less
Submitted 30 July, 2013;
originally announced July 2013.
-
Combinatorial decomposition approaches for efficient counting and random generation FPTASes
Authors:
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
Given a combinatorial decomposition for a counting problem, we resort to the simple scheme of approximating large numbers by floating-point representations in order to obtain efficient Fully Polynomial Time Approximation Schemes (FPTASes) for it. The number of bits employed for the exponent and the mantissa will depend on the error parameter $0 < \varepsilon \leq 1$ and on the characteristics of t…
▽ More
Given a combinatorial decomposition for a counting problem, we resort to the simple scheme of approximating large numbers by floating-point representations in order to obtain efficient Fully Polynomial Time Approximation Schemes (FPTASes) for it. The number of bits employed for the exponent and the mantissa will depend on the error parameter $0 < \varepsilon \leq 1$ and on the characteristics of the problem. Accordingly, we propose the first FPTASes with $1 \pm \varepsilon$ relative error for counting and generating uniformly at random a labeled DAG with a given number of vertices. This is accomplished starting from a classical recurrence for counting DAGs, whose values we approximate by floating-point numbers.
After extending these results to other families of DAGs, we show how the same approach works also with problems where we are given a compact representation of a combinatorial ensemble and we are asked to count and sample elements from it. We employ here the floating-point approximation method to transform the classic pseudo-polynomial algorithm for counting 0/1 Knapsack solutions into a very simple FPTAS with $1 - \varepsilon$ relative error. Its complexity improves upon the recent result (Štefankovič et al., SIAM J. Comput., 2012), and, when $\varepsilon^{-1} = Ω(n)$, also upon the best-known randomized algorithm (Dyer, STOC, 2003). To show the versatility of this technique, we also apply it to a recent generalization of the problem of counting 0/1 Knapsack solutions in an arc-weighted DAG, obtaining a faster and simpler FPTAS than the existing one.
△ Less
Submitted 15 November, 2013; v1 submitted 9 July, 2013;
originally announced July 2013.
-
Indexes for Jumbled Pattern Matching in Strings, Trees and Graphs
Authors:
Ferdinando Cicalese,
Travis Gagie,
Emanuele Giaquinta,
Eduardo Sany Laber,
Zsuzsanna Lipták,
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
We consider how to index strings, trees and graphs for jumbled pattern matching when we are asked to return a match if one exists. For example, we show how, given a tree containing two colours, we can build a quadratic-space index with which we can find a match in time proportional to the size of the match. We also show how we need only linear space if we are content with approximate matches.
We consider how to index strings, trees and graphs for jumbled pattern matching when we are asked to return a match if one exists. For example, we show how, given a tree containing two colours, we can build a quadratic-space index with which we can find a match in time proportional to the size of the match. We also show how we need only linear space if we are content with approximate matches.
△ Less
Submitted 19 April, 2013;
originally announced April 2013.
-
Odd 2-factored snarks
Authors:
M. Abreu,
D. Labbate,
R. Rizzi,
J. Sheehan
Abstract:
A {\em snark} is a cubic cyclically 4-edge connected graph with edge chromatic number four and girth at least five. We say that a graph $G$ is {\em odd 2-factored} if for each 2-factor F of G each cycle of F is odd.
In this paper, we present a method for constructing odd 2--factored snarks. In particular, we construct two families of odd 2-factored snarks that disprove a conjecture by some of th…
▽ More
A {\em snark} is a cubic cyclically 4-edge connected graph with edge chromatic number four and girth at least five. We say that a graph $G$ is {\em odd 2-factored} if for each 2-factor F of G each cycle of F is odd.
In this paper, we present a method for constructing odd 2--factored snarks. In particular, we construct two families of odd 2-factored snarks that disprove a conjecture by some of the authors. Moreover, we approach the problem of characterizing odd 2-factored snarks furnishing a partial characterization of cyclically 4-edge connected odd 2-factored snarks. Finally, we pose a new conjecture regarding odd 2-factored snarks.
△ Less
Submitted 21 January, 2013; v1 submitted 30 October, 2012;
originally announced October 2012.
-
Set graphs. II. Complexity of set graph recognition and similar problems
Authors:
Martin Milanič,
Romeo Rizzi,
Alexandru I. Tomescu
Abstract:
A graph $G$ is said to be a `set graph' if it admits an acyclic orientation that is also `extensional', in the sense that the out-neighborhoods of its vertices are pairwise distinct. Equivalently, a set graph is the underlying graph of the digraph representation of a hereditarily finite set. In this paper, we continue the study of set graphs and related topics, focusing on computational complexity…
▽ More
A graph $G$ is said to be a `set graph' if it admits an acyclic orientation that is also `extensional', in the sense that the out-neighborhoods of its vertices are pairwise distinct. Equivalently, a set graph is the underlying graph of the digraph representation of a hereditarily finite set. In this paper, we continue the study of set graphs and related topics, focusing on computational complexity aspects. We prove that set graph recognition is NP-complete, even when the input is restricted to bipartite graphs with exactly two leaves. The problem remains NP-complete if, in addition, we require that the extensional acyclic orientation be also `slim', that is, that the digraph obtained by removing any arc from it is not extensional. We also show that the counting variants of the above problems are #P-complete, and prove similar complexity results for problems related to a generalization of extensional acyclic digraphs, the so-called `hyper-extensional digraphs', which were proposed by Aczel to describe hypersets. Our proofs are based on reductions from variants of the Hamiltonian Path problem. We also consider a variant of the well-known notion of a separating code in a digraph, the so-called `open-out-separating code', and show that it is NP-complete to determine whether an input extensional acyclic digraph contains an open-out-separating code of given size.
△ Less
Submitted 31 July, 2012;
originally announced July 2012.
-
Optimal Listing of Cycles and st-Paths in Undirected Graphs
Authors:
Rui Ferreira,
Roberto Grossi,
Andrea Marino,
Nadia Pisanti,
Romeo Rizzi,
Gustavo Sacomoto
Abstract:
We present the first optimal algorithm for the classical problem of listing all the cycles in an undirected graph. We exploit their properties so that the total cost is the time taken to read the input graph plus the time to list the output, namely, the edges in each of the cycles. The algorithm uses a reduction to the problem of listing all the paths from a vertex s to a vertex t which we also so…
▽ More
We present the first optimal algorithm for the classical problem of listing all the cycles in an undirected graph. We exploit their properties so that the total cost is the time taken to read the input graph plus the time to list the output, namely, the edges in each of the cycles. The algorithm uses a reduction to the problem of listing all the paths from a vertex s to a vertex t which we also solve optimally.
△ Less
Submitted 5 July, 2012; v1 submitted 12 May, 2012;
originally announced May 2012.
-
Some results on more flexible versions of Graph Motif
Authors:
Romeo Rizzi,
Florian Sikora
Abstract:
The problems studied in this paper originate from Graph Motif, a problem introduced in 2006 in the context of biological networks. Informally speaking, it consists in deciding if a multiset of colors occurs in a connected subgraph of a vertex-colored graph. Due to the high rate of noise in the biological data, more flexible definitions of the problem have been outlined. We present in this paper tw…
▽ More
The problems studied in this paper originate from Graph Motif, a problem introduced in 2006 in the context of biological networks. Informally speaking, it consists in deciding if a multiset of colors occurs in a connected subgraph of a vertex-colored graph. Due to the high rate of noise in the biological data, more flexible definitions of the problem have been outlined. We present in this paper two inapproximability results for two different optimization variants of Graph Motif: one where the size of the solution is maximized, the other when the number of substitutions of colors to obtain the motif from the solution is minimized. We also study a decision version of Graph Motif where the connectivity constraint is replaced by the well known notion of graph modularity. While the problem remains NP-complete, it allows algorithms in FPT for biologically relevant parameterizations.
△ Less
Submitted 10 September, 2014; v1 submitted 23 February, 2012;
originally announced February 2012.
-
Reconstructing Isoform Graphs from RNA-Seq data
Authors:
Stefano Beretta,
Paola Bonizzoni,
Gianluca Della Vedova,
Raffaella Rizzi
Abstract:
Next-generation sequencing (NGS) technologies allow new methodologies for alternative splicing (AS) analysis. Current computational methods for AS from NGS data are mainly focused on predicting splice site junctions or de novo assembly of full-length transcripts. These methods are computationally expensive and produce a huge number of full-length transcripts or splice junctions, spanning the whole…
▽ More
Next-generation sequencing (NGS) technologies allow new methodologies for alternative splicing (AS) analysis. Current computational methods for AS from NGS data are mainly focused on predicting splice site junctions or de novo assembly of full-length transcripts. These methods are computationally expensive and produce a huge number of full-length transcripts or splice junctions, spanning the whole genome of organisms. Thus summarizing such data into the different gene structures and AS events of the expressed genes is an hard task.
To face this issue in this paper we investigate the computational problem of reconstructing from NGS data, in absence of the genome, a gene structure for each gene that is represented by the isoform graph: we introduce such graph and we show that it uniquely summarizes the gene transcripts. We define the computational problem of reconstructing the isoform graph and provide some conditions that must be met to allow such reconstruction.
Finally, we describe an efficient algorithmic approach to solve this problem, validating our approach with both a theoretical and an experimental analysis.
△ Less
Submitted 14 August, 2012; v1 submitted 30 July, 2011;
originally announced August 2011.
-
Pure Parsimony Xor Haplotyping
Authors:
Paola Bonizzoni,
Gianluca Della Vedova,
Riccardo Dondi,
Yuri Pirola,
Romeo Rizzi
Abstract:
The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact sol…
▽ More
The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies. The xor-genotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well-known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Furthermore, we show that the problem has a polynomial time k-approximation, where k is the maximum number of xor-genotypes containing a given SNP. Finally, we propose a heuristic and produce an experimental analysis showing that it scales to real-world large instances taken from the HapMap project.
△ Less
Submitted 8 January, 2010;
originally announced January 2010.