Motif inference represents one of the most important areas of research in computational biology, ... more Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favour of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs s...
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of... more In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.
An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$, which... more An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length $m$ in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an $\mathcal{O}(nm^{1.5}\sqrt{\log m} + N)$-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that $N$ is substantially larger than both $n$ and $m$, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatori...
The problem of Haplotype Assembly is an essential step in human genome analysis. Being the well k... more The problem of Haplotype Assembly is an essential step in human genome analysis. Being the well known MEC model for its solution NP-hard, it is currently addressed by using algorithms that grow exponentially with the length of DNA fragments obtained by the sequencing process. Technological improvements will reduce fragmentation, increase fragment length and make such computational costs worst. WHATSHAP is a recently proposed novel approach which moves complexity from fragment length to fragment sovrapposition, improving the perspective of computational costs, but Haplotype Assembly still remains a demanding computational problem. Directions towards high-performance computing Haplotype Assembly for future sequencing, based on parallel WHATSHAP, are discussed in this paper. 1 Scientific Background Human genome is diploid, i.e. each chromosome comes in two copies, each of which is a haploid chromosome coming from one of the two parents (one allele per parent). Single Nucleotide Polymor...
A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N... more A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings...
Motif inference is at the heart of several time-demanding computational tasks, such as in molecul... more Motif inference is at the heart of several time-demanding computational tasks, such as in molecular biology, data mining and identification of structured motifs in sequences, and in data compression, to name a few. In this scenario, a motif is a pattern that appears repeated at least a certain number of times (the quorum), to be of interest. The pattern can be approximated in that some of its characters can be left unspecified (the don’t cares). Motif inference is not aimed at searching a given pattern but, rather, at discovering all the possible patterns that appear as motifs in the given input string. The combinatorial explosion of these patterns makes their discover an exponential-time computation. For this, the notion of basis has been recently introduced to succinctly represent all of them within reasonable time and space bounds. The goal of the paper is to shed light on the state of the art for this emerging field and to add further properties to what is currently known.
Identifying local similarity between two or more sequences, or identifying repetitions occurring ... more Identifying local similarity between two or more sequences, or identifying repetitions occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding fragments that are conserved among several given sequences, or inside a unique sequence, while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. The filter we introduce in this paper, called Ed'Nimbus, provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment method, eliminating an important fraction of the input that is guaranteed not to contain any approximate repetition. It consists in the verification of a strong necessary condition. This condition concerns the number and order of exactly repeated words shared by the approximate repetitions....
We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ... more We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer parameters dT and dS , and a new genome S. The objective is to identify in S approximate new instances of the gene cluster that could vary from the known gene orders by genome rearrangements that are constrained by T , by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by dT and dS , respectively. We prove that the PQ-Tree Search problem is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in O∗(2γ) time, where γ is the maximum degree of a node in T and O∗ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, w...
textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a s... more textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential ti...
Pattern matching on a set of similar texts has received much attention, especially recently, main... more Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.
String data are often disseminated to support applications such as location-based service provisi... more String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this article, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensit...
Uncertain sequences are compact representations of sets of similar strings. They highlight common... more Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string Ŝ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1, . . . , kn-1. Our main result is an 𝒪(N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing a...
Motif inference represents one of the most important areas of research in computational biology, ... more Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favour of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs s...
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of... more In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.
An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$, which... more An elastic-degenerate (ED) string is a sequence of $n$ sets of strings of total length $N$, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length $m$ in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an $\mathcal{O}(nm^{1.5}\sqrt{\log m} + N)$-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that $N$ is substantially larger than both $n$ and $m$, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatori...
The problem of Haplotype Assembly is an essential step in human genome analysis. Being the well k... more The problem of Haplotype Assembly is an essential step in human genome analysis. Being the well known MEC model for its solution NP-hard, it is currently addressed by using algorithms that grow exponentially with the length of DNA fragments obtained by the sequencing process. Technological improvements will reduce fragmentation, increase fragment length and make such computational costs worst. WHATSHAP is a recently proposed novel approach which moves complexity from fragment length to fragment sovrapposition, improving the perspective of computational costs, but Haplotype Assembly still remains a demanding computational problem. Directions towards high-performance computing Haplotype Assembly for future sequencing, based on parallel WHATSHAP, are discussed in this paper. 1 Scientific Background Human genome is diploid, i.e. each chromosome comes in two copies, each of which is a haploid chromosome coming from one of the two parents (one allele per parent). Single Nucleotide Polymor...
A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N... more A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings...
Motif inference is at the heart of several time-demanding computational tasks, such as in molecul... more Motif inference is at the heart of several time-demanding computational tasks, such as in molecular biology, data mining and identification of structured motifs in sequences, and in data compression, to name a few. In this scenario, a motif is a pattern that appears repeated at least a certain number of times (the quorum), to be of interest. The pattern can be approximated in that some of its characters can be left unspecified (the don’t cares). Motif inference is not aimed at searching a given pattern but, rather, at discovering all the possible patterns that appear as motifs in the given input string. The combinatorial explosion of these patterns makes their discover an exponential-time computation. For this, the notion of basis has been recently introduced to succinctly represent all of them within reasonable time and space bounds. The goal of the paper is to shed light on the state of the art for this emerging field and to add further properties to what is currently known.
Identifying local similarity between two or more sequences, or identifying repetitions occurring ... more Identifying local similarity between two or more sequences, or identifying repetitions occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding fragments that are conserved among several given sequences, or inside a unique sequence, while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. The filter we introduce in this paper, called Ed'Nimbus, provides a possible solution to this problem. It can be used as a preprocessing step to any multiple alignment method, eliminating an important fraction of the input that is guaranteed not to contain any approximate repetition. It consists in the verification of a strong necessary condition. This condition concerns the number and order of exactly repeated words shared by the approximate repetitions....
We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ... more We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer parameters dT and dS , and a new genome S. The objective is to identify in S approximate new instances of the gene cluster that could vary from the known gene orders by genome rearrangements that are constrained by T , by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by dT and dS , respectively. We prove that the PQ-Tree Search problem is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in O∗(2γ) time, where γ is the maximum degree of a node in T and O∗ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, w...
textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a s... more textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential ti...
Pattern matching on a set of similar texts has received much attention, especially recently, main... more Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.
String data are often disseminated to support applications such as location-based service provisi... more String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this article, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensit...
Uncertain sequences are compact representations of sets of similar strings. They highlight common... more Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string Ŝ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1, . . . , kn-1. Our main result is an 𝒪(N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing a...
Uploads
Papers by Nadia Pisanti