This software is an implementation of the algorithms described in [1] to compute all-pairs of Burrows-Wheeler similarity distributions (BWSD) for a string collection.
Given a collection of d strings, bwsd computes a matrix Mdxd with all pairs of BWSD-based distances.
The Burrows-Wheeler transform (BWT) and the Document array (DA) are computed using gsaca-k [2].
git clone https://github.com/felipelouza/bwsd
cd bwsd
make compile
Note: our implementaion needs sdsl-lite.
Given a collection of d strings in a single file FILE.
./bwsd [options] FILE d
Available options:
-h this help message
-A a preferred algorithm to use (default is alg. 1 BIT_sd)
-B b BWSD-based distance to compute, options: 1. expectation (default), 2. Shannon entropy
-T t use t parallel threads (default 1)
-o write output matrix to FILE.output.bin
-p print the output matrix (for debug)
-v verbose output
Notes:
- d=0 gives all strings as input.
- Supported extensions are .txt, .fasta and .fastq.
- If the input is changed, please run
make remove DIR=dataset/
, to rebuild the BWTs.
To run a test with d=10 strings from dataset/input.100.txt using Alg. 1 -A 1
, writing the output to disk option -o
, type:
./bwsd dataset/input.100.txt 10 -A 1 -o
Note: output matrix is written to FILE.output.bin
To see the resulting Mdxd, computing distances based on the Shannon entropy of BWSD option -B 2
, use option -p
:
./bwsd dataset/input.100.txt 10 -A 1 -o -B 2 -p
Result:
## BWSD_BIT_sd ##
writing 360 bytes to: input.100.txt.output.bin
0.000 0.000 0.684 1.459 2.156 2.128 1.750 2.156 1.896 1.299
0.000 0.000 0.684 1.459 2.156 2.128 1.750 2.156 1.896 1.299
0.684 0.684 0.000 1.614 1.811 1.224 1.750 1.750 1.614 1.299
1.459 1.459 1.614 0.000 2.250 2.059 2.156 2.156 0.959 1.750
2.156 2.156 1.811 2.250 0.000 2.322 0.944 2.020 2.522 1.686
2.128 2.128 1.224 2.059 2.322 0.000 1.792 2.252 1.961 1.459
1.750 1.750 1.750 2.156 0.944 1.792 0.000 1.627 2.522 1.392
2.156 2.156 1.750 2.156 2.020 2.252 1.627 0.000 1.753 1.506
1.896 1.896 1.614 0.959 2.522 1.961 2.522 1.753 0.000 2.000
1.299 1.299 1.299 1.750 1.686 1.459 1.392 1.506 2.000 0.000
Notes:
- We compute only the (d2-d)/2 entries of Mdxd (upper triangular matrix), which can be accessed as here
Using uncompressed bitvectors (bit_vector):
make SD_VECTOR=0
./bwsd [options] FILE d -A 1
Using wavelet tree:
make WT=1
./bwsd [options] FILE d -A 1
Note: Alg. 1 uses compressed bitvectors (BIT_sd) as default.
To see a more detailed execution use:
make clean
make DEBUG=1
Please, if you use this tool in an academic setting cite the following paper [1]:
@article{LouzaTGZ19,
author = {Louza, Felipe A. and Telles, Guilherme P. and Gog, Simon and Zhao, Liang},
title = {Algorithms to compute the Burrows-Wheeler Similarity Distribution},
journal = {Theor. Comput. Sci.},
volume = {782},
pages = {145-156},
year = {2019},
url = {https://www.sciencedirect.com/science/article/pii/S0304397519301653},
doi = {10.1016/j.tcs.2019.03.012},
}
We have used the following datasets in our experiments: https://github.com/felipelouza/bwsd/blob/master/experiments/dataset.tar.gz
[1] Louza, Felipe A., Telles, Guilherme P., Gog, Simon, Liang Zhao (2019): Algorithms to compute the Burrows-Wheeler Similarity Distribution. Theor. Comput. Sci. 782: 145-156 [ElsevierLink]
[2] Louza, Felipe A., Gog, Simon, Telles, Guilherme P. (2017). Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678: 22-39, github.