Blast User Manual
Blast User Manual
Blast User Manual
BLAST Help
Christiam Camacho
NCBI
camacho@ncbi.nlm.nih.gov
Thomas Madden
NCBI
madden@ncbi.nlm.nih.gov
George Coulouris
NCBI
coulouri@ncbi.nlm.nih.gov
Vahram Avagyan
BLAST Help
NCBI
avagyanv@ncbi.nlm.nih.gov
Ning Ma
NCBI
maning@ncbi.nlm.nih.gov
Tao Tao
NCBI
tao@ncbi.nlm.nih.gov
Richa Agarwala
NCBI
BLAST Help
richa@ncbi.nlm.nih.gov
1. Introduction
This manual documents the BLAST (Basic Local Alignment Search Tool) command line
applications developed at the National Center for Biotechnology Information (NCBI). These
applications have been revamped to provide an improved user interface, new features, and
performance improvements compared to its counterparts in the NCBI C Toolkit. Hereafter we
shall distinguish the C Toolkit BLAST command line applications from these command line
applications by referring to the latter as the BLAST+ applications, which have been developed
using the NCBI C++ Toolkit (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
BLAST Help
rid=toolkit.TOC&depth=2).
Please feel free to contact us with any questions, feedback, or bug reports at blast-
help@ncbi.nlm.nih.gov.
2. Installation
The BLAST+ applications are distributed in executable and source code format. For the
executable formats we provide installers as well as tarballs; the source code is only provided
as a tarball. These are freely available at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/.
Please be sure to use the most recent available version; this will be indicated in the file name
(for instance, in the sections below, version 2.2.18 is listed, but this should be replaced
accordingly).
Page 2
2.1 Windows
Download the executable installer ncbi-blast-2.2.18+.exe and double click on it. After
accepting the license agreement, select the install location and click “Install” and then “Close”
2.2 MacOSX
For users without administrator privileges: Download the ncbi-blast-2.2.18+-universal-
BLAST Help
macosx.tar.gz tarball and follow the procedure described in Other Unix platforms.
For users with administrator privileges and machines MacOSX version 10.5 or higher:
Download the ncbi-blast-2.2.18+.dmg installer and double click on it. Double click the newly
mounted ncbi-blast-2.2.18+ volume, double click on ncbi-blast-2.2.18+.pkg and follow the
instructions in the installer. By default the BLAST+ applications are installed in /usr/local/
ncbi/blast, overwriting its previous contents (an uninstaller is provided and it is recommended
when upgrading a BLAST+ installation).
Install:
rpm -ivh ncbi-blast-2.2.18-1.x86_64.rpm
Upgrade:
rpm -Uvh ncbi-blast-2.2.18-1.x86_64.rpm
Note: one must have root privileges to run these commands. If you do not have root privileges,
please use the procedure described in Other Unix platforms.
cd c++
./configure --without-debug --with-mt --with-build-root=ReleaseMT
cd ReleaseMT/build
make all_r
In Windows, extract the tarball and open the appropriate MSVC solution or project file (e.g.:
c++\compilers\msvc800_prj\static\build), build the -CONFIGURE- project, click on “Reload”
when prompted by the development environment, and then build the -BUILD-ALL- project.
The compiled executables will be found in the directory corresponding to the build
configuration selected (e.g.: c++\compilers\msvc800_prj\static\bin\debugdll).
3. Quick start
3.1 For users of NCBI C Toolkit BLAST
The easiest way to get started using these command line applications is by means of the
legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize
this script, simply prefix it to the invocation of the C toolkit BLAST command line application
and append the --path option pointing to the installation directory of the BLAST+ applications.
BLAST Help
use
For more details, refer to the section titled Backwards compatibility script.
BLAST Help
This script will download multiple tar files for each BLAST database volume if necessary,
without having to designate each volume. For example:
./update_blastdb.pl htgs
BLAST Help
will download all the relevant HTGs tar files (htgs.00.tar.gz, …, htgs.N.tar.gz)
The script can also compare your local copy of the database tar file(s) and only download tar
files if the date stamp has changed reflecting a newer version of the database. This will allow
the script run on a schedule and only download tar files when needed. Documentation for the
update_blastdb.pl script can be obtained by running the script without any arguments (perl is
required).
4. User manual
4.1 Functionality offered by BLAST+ applications
The functionality offered by the BLAST+ applications has been organized by program type,
as to more closely resemble Web BLAST. The following graph depicts a correspondence
between the NCBI C Toolkit BLAST command line applications and the BLAST+
applications:
BLAST Help
As an example, to run a search of a nucleotide query (translated “on the fly” by BLAST) against
BLAST Help
a protein database one would use the blastx application instead of blastall. The blastx
application will also work in “Blast2Sequences” mode (i.e.: accept FASTA sequences instead
of a BLAST database as targets) and can also send BLAST searches over the network to the
public NCBI server if desired.
The blastn, blastp, blastx, tblastx, tblastn, psiblast, rpsblast, and rpstblastn are considered
search applications, as they execute a BLAST search, whereas makeblastdb, blastdb_aliastool,
and blastdbcmd are considered BLAST database applications, as they either create or examine
BLAST databases.
There is also a new set of sequence filtering applications described in the section Sequence
filtering applications and an application to build database indices that greatly speed up
BLAST Help
Please note that the NCBI C Toolkit applications seedtop and blastclust are not available in
this release.
4.2.1 best_hit_overhang: Overhang value for Best-Hit algorithm. For more details, see the
section Best-Hits filtering algorithm.
BLAST Help
4.2.2 best_hit_score_edge: Score edge value for Best-Hit algorithm. For more details, see the
section Best-Hits filtering algorithm.
4.2.3 db: File name of BLAST database to search the query against. Unless an absolute path
is used, the database will be searched relative to the current working directory first, then relative
to the value specified by the BLASTDB environment variable, then relative to the BLASTDB
configuration value specified in the configuration file.
4.2.6 db_soft_mask: Filtering algorithm ID to apply to the database as soft masking for subject
sequences. The algorithm IDs for a given BLAST database can be obtained by invoking
blastdbcmd with its -info flag (only shown if such filtering in the BLAST database is available).
For more details see the section Masking in BLAST databases.
4.2.7 culling_limit: Ensures that more than the specified number of HSPs are not aligned to
the same part of the query. This option was designed for searches with a lot of repetitive
BLAST Help
matches, but if possible it is probably more efficient to mask the query to remove the repetitive
sequences.
4.2.8 entrez_query: Restrict the search of the BLAST database to the results of the Entrez
query provided.
4.2.10 export_search_strategy: Name of the file where to save the search strategy (see section
titled BLAST search strategies).
4.2.13 gilist: File containing a list of GIs to restrict the BLAST database to search. The expect
values in the BLAST results are based upon the sequences actually searched and not on the
underlying database.
4.2.16 html: Enables the generation of HTML output suitable for viewing in a web browser.
BLAST Help
4.2.17 import_search_strategy: Name of the file where to read the search strategy to execute
(see section titled BLAST search strategies).
4.2.720 max_target_seqs: Maximum number of aligned sequences to keep from the BLAST
database.
4.2.21 negative_gilist: File containing a list of GIs to exclude from the BLAST database.
BLAST Help
4.2.25 out: Name of the file to write the application’s output. Defaults to stdout.
4.2.26 outfmt: Allows for the specification of the search application’s output format. A listing
of the possible format types is available via the search application’s -help option. If a custom
output format is desired, this can be specified by providing a quoted string composed of the
desired output format (tabular, tabular with comments, or comma-separated value), a space,
and a space delimited list of output specifiers. The list of supported output specifiers is available
via the -help command line option. Unsupported output specifiers will be ignored. This should
be specified using double quotes if there are spaces in the output format specification (e.g.: -
outfmt "7 sseqid ssac qstart qend sstart send qseq evalue bitscore").
4.2.28 query: Name of the file containing the query sequence(s), or ‘-‘ if these are provided
on standard input.
4.2.29 query_loc: Location of the first query sequence to search in 1-based offsets (Format:
start-stop).
4.2.30 remote: Instructs the application to submit the search to NCBI for remote execution.
4.2.34 soft_masking: Apply filtering locations as soft masks (i.e.: only when finding alignment
seeds).
4.2.35 subject: Name of the file containing the subject sequence(s) to search.
4.2.36 subject_loc: Location of the first subject sequence to search in 1-based offsets (Format:
start-stop).
4.2.38 threshold: Minimum word score such that the word is added to the BLAST lookup
BLAST Help
table.
4.2.41 window_size: Size of the window for multiple hits algorithm, use 0 to specify 1-hit
algorithm.
4.2.43 xdrop_gap: X-dropoff value (in bits) for preliminary gapped extensions.
BLAST Help
4.2.44 xdrop_gap_final: X-dropoff value (in bits) for final gapped alignment.
The legacy_blast.pl script supports two modes of operation, one in which the C Toolkit BLAST
command line invocation is converted and executed on behalf of the user and another which
solely displays the BLAST+ application equivalent to what was provided, without executing
the command.
The first mode of operation is achieved by specifying the C Toolkit BLAST command line
application invocation and optionally providing the --path argument after the command line to
BLAST Help
convert if the installation path for the BLAST+ applications differs from the default (available
by invoking the script without arguments). See example in the first section of the Quick start.
The second mode of operation is achieved by specifying the C Toolkit BLAST command line
application invocation and appending the --print_only command line option as follows:
All BLAST+ applications have consistent exit codes to signify the exit status of the application.
The possible exit codes along with their meaning are detailed in the table below:
0 Success
4 Out of memory
BLAST Help
In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success)
and 1 (indicating failure).
4.5.2 Tasks
The concept of tasks has been added to support the notion of commonly performed tasks via
the -task command line option in blastn and blastp. The following tasks are currently available:
megablast Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences
dc-megablast Discontiguous megablast used to find more distant (e.g., interspecies) sequences
Improvements to the BLAST database reading module allow it to fetch only the relevant
portions of the subject sequence that are needed in the gapped alignment stage, providing a
substantial improvement in runtime. The following example compares 103 mouse EST
sequences against the human genome shows (example run in July 2008 after the database had
already been loaded into memory):
Similar gains in performance should be expected in BLAST databases which contain very large
sequences and many very short queries.
BLAST search strategies are files which encode the inputs necessary to perform a BLAST
search. The purpose of these files is to be able to seamlessly reproduce a BLAST search in
various environments (Web BLAST, command line applications, etc).
Click on "download" next to the RID/saved strategy in the "Recent Results" or "Saved
Strategies" tabs.
Add the -export_search_strategy along with a file name to the command line options.
BLAST Help
Go to the "Saved Strategies" tab, click on "Browse" to select your search strategy file, then
click on "View" to load it into the submission page.
Add the -import_search_strategy along with a file name containing the search strategy file.
Note that if provided, the –query, -db, -use_index, and –index_name command line options
will override the specifications of the search strategy file provided (no other command line
options will override the contents of the search strategy file).
BLAST Help
cookbook.
The BLAST+ applications are available via Windows and MacOSX installers as well as RPMs
(source and binary) and unix tarballs. For more details about these, refer to the installation
section.
information).
Additional requirements that must also be met in order to filter A on account of B are:
BLAST Help
one per line) in a file as the input to the -query and -subject command line options.
Upon encountering this type of input, by default the BLAST+ search applications will try to
resolve these sequence identifiers in locally available BLAST databases first, then in the
BLAST databases at NCBI, and finally in Genbank (the latter two data sources require a
properly configured internet connection). These data sources can be configured via the
DATA_LOADERS configuration option and the BLAST databases to search can be configured
via the BLASTDB_PROT_DATA_LOADER and BLASTDB_NUCL_DATA_LOADER
configuration options (see the section on Configuring BLAST).
4.6.1.3 use_sw_tback: Instead of using the X-dropoff gapped alignment algorithm, use Smith-
Waterman to compute locally optimal alignments
4.6.2 blastn
4.6.2.1 task: Specify the task to execute. For more details, refer to the section on tasks.
4.6.2.8 filtering_db: Name of BLAST database containing filtering elements (i.e.: repeats)
4.6.2.15 off_diagonal_range Maximum number of diagonals separating two hits used to initiate
an extension. Increasing values of this parameter lead to a longer run time, but more sensitive
results. If this parameter is set, a value of five is suggested. Only discontiguous megablast uses
two hits by default.
4.6.3 blastx
4.6.3.1 query_gencode: Genetic code to use to translate the query sequence(s).
4.6.3.2 frame_shift_penalty: Frame shift penalty for use with out-of-frame gapped alignments
4.6.4 tblastx
4.6.4.1 db_gencode: Genetic code to use to translate database/subjects.
4.6.5 tblastn
4.6.5.1 db_gencode: Identical to tblastx.
4.6.6 psiblast
4.6.6.1 comp_based_stats: Identical to blastp with the exception that only composition based
statistics mode 1 is valid when a PSSM is the input (either when restarting from a checkpoint
file or when performing multiple PSI-BLAST iterations).
4.6.6.5 out_pssm: Name of the file to store checkpoint file containing a PSSM.
4.6.6.7 in_msa: Name of the file containing multiple sequence alignment to restart PSI-BLAST.
4.6.7 rpstblastn
4.6.7.1 query_gencode: Identical to blastx.
4.6.8 makeblastdb
This application serves as a replacement for formatdb.
4.6.8.1 in: Input file or BLAST database name to use as source; the data type is automatically
detected. Note that multiple input files/BLAST databases can be provided, each must be
separated by white space in a string quoted with single quotation marks. Multiple input files/
BLAST databases which contain white space in them should be quoted with double quotation
marks inside the white space-separated, single quoted string (e.g.: -in ‘“C:\My Documents
\seqs.fsa” “E:\Users\Joe Smith\myfasta.fsa”‘).
BLAST Help
4.6.8.3 parse_seqids: Parse the Seq-id(s) in the FASTA input provided. Please note that this
option should be provided consistently among the various applications involved in creating
BLAST databases. For instance, the filtering applications as well as convert2blastmask should
use this option if makeblastdb uses it also.
4.6.8.4 hash_index: Enables the creation of sequence hash values. These hash values can then
be used to quickly determine if a given sequence data exists in this BLAST database.
4.6.8.5 mask_data: Comma-separated list of input files containing masking data to apply to
the sequences being added to the BLAST database being created. For more information, see
BLAST Help
4.6.8.7 max_file_sz: Maximum file size for any of the BLAST database files created.
4.6.8.8 logfile: Name of the file to which the program log should be redirected (stdout by
default).
4.6.8.10 taxid_map: Name of file which provides a mapping of sequence IDs to taxonomy IDs.
BLAST Help
4.6.9 blastdb_aliastool
This application replaces part of the functionality offered by formatdb. When formatting a large
input FASTA sequence file into a BLAST database, makeblastdb breaks up the resulting
database into optimal sized volumes and links the volumes into a large virtual database through
an automatically created BLAST database alias file.
We can use BLASTdatabase alias files under different scenarios to manage the collection of
BLAST databases and facilitate BLAST searches. For example, we can create an alias file to
combine an existing BLAST database with newly generated ones while leaving the original
one undisturbed. Also, for an existing BLAST database, we can create a BLAST database alias
file based on a GI list so we can search a subset of it, eliminating the need of creating a new
BLAST Help
database. For examples of how to use this application, please see the cookbook section.
1) Gi file conversion:
Converts a text file containing GIs (one per line) to a more efficient
this program to create an alias file for a BLAST database (see below).
Creates an alias for a BLAST database and a GI list which restricts this
(e.g., based on organism or a curated list). The alias file makes the
the same molecule type (no validation is done). The relevant options are
BLAST Help
4.6.9.1 gi_file_in: Text file to convert, should contain one GI per line.
4.6.9.4 gilist: Name of the file containing the GIs to restrict the database provided in -db.
Please note that when using GI lists, the expect values in the BLAST results are based upon
the sequences actually searched and not on the underlying database.
4.6.10 blastdbcmd
This application is the successor to fastacmd. The following are its supported options:
4.6.10.1 entry: A comma-delimited search string of sequence identifiers, or the keyword ‘all’
to select all sequences in the database.
4.6.10.2 entry_batch: Input file for batch processing, entries must be provided one per line. If
BLAST Help
4.6.10.4 info: Print BLAST database information (overrides all other options).
4.6.10.5 range: Selects the range of a sequence to extract in 1-based offsets (Format: start-
stop).
4.6.10.7 outfmt: Output format string. For a list of available format specifiers, invoke the
application with its -help option. Note that for all format specifiers except %f, each line of
output will correspond to a single sequence. This should be specified using double quotes if
there are spaces in the output format specification (e.g.: -outfmt "%g %t").
4.6.10.8 target_only: The definition line of the sequence should contain target GI only.
BLAST Help
4.6.10.10 line_length: Line length for output (applicable only with FASTA output format).
4.6.10.11 ctrl_a: Use Ctrl-A as the non-redundant defline separator (applicable only with
FASTA output format).
4.6.10.13 list: Display BLAST databases available in the directory provided as an argument to
BLAST Help
this option.
4.6.10.14 list_outfmt: Allows for the specification of the output format for the -list option; a
listing of the possible format types is available via the application’s -help option. Unsupported
output specifiers will be ignored. This option’s argument should be specified using double
quotes if there are spaces in the output format specification.
4.6.10.15 recursive: Recursively traverse the directory provided to the –list option to find and
display available BLAST databases.
4.6.11 convert2blastmask
This application extracts the lower-case masks from its FASTA input and converts them to a
file format suitable for specifying masking information to makeblastdb. The following are its
supported options:
4.6.11.1 masking_algorithm: The name of the masking algorithm used to create the masks
(e.g.: dust, seg, windowmasker, repeat).
4.6.12 blastdbcheck
This application performs tests on BLAST databases to check their integrity. The following
are its supported options:
4.6.12.1 dir: Name of the directory where to look for BLAST databases.
4.6.12.2 recursive: Flag to specify whether to recursively search for BLAST databases in the
directory specified above.
4.6.12.6 ends: Check the beginning and ending N sequences in the database.
4.6.12.7 isam: Set to true to perform ISAM file checking on each of the selected sequences.
4.6.13 blast_formatter
This application formats both local and remote BLAST results. An RID is required to format
remote BLAST results. The RID may be obtained either from a search submitted to the NCBI
BLAST web page or by using the –remote switch with one of the applications mentioned above.
BLAST Help
The blast_formatter accepts the BLAST archive format for stand-alone formatting. The
BLAST archive format can be produced by using “-outfmt 11” argument with the stand-alone
applications. The following are its supported options:
a semi-colon are considered comments. This file will be searched in the following order and
locations:
1 Current working directory
2 User's HOME directory
3 Directory specified by the NCBI environment variable
The search for this file will stop at the first location where it is found and the configurations
settings from that file will be applied. If the configuration file is not found, default values will
apply. The following are the possible configuration parameters that impact the BLAST+
applications:
BLAST Help
DATA_LOADERS Data loaders to use for automatic sequence identifier resolution. This is a blastdb,genbank
comma separated list of the following keywords: blastdb, genbank, and
none. The none keyword disables this feature and takes precedence over
any other keywords specified.
BLAST Help
BLASTDB_PROT_DATA_LOADER Locally available BLAST database name to search when resolving protein nr
sequences using BLAST databases. Ignored if DATA_LOADERS does
not include the blastdb keyword.
GENE_INFO_PATH Path to gene information files (NCBI only). Current working directory
The following is an example with comments describing the available parameters for
configuration:
BLAST Help
platforms).
5. Cookbook
5.1 Query a BLAST database with a GI, but exclude that GI from the results
Extract a GI from the ecoli database:
$ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \
tee exclude_me
1786181
Run the restricted database search, which shows there are no self-hits:
$ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0 \
a Generate the masking data using a sequence filtering utility like windowmasker or
dustmasker
b Generate the actual BLAST database using makeblastdb
For both steps, the input file can be a text file containing sequences in FASTA format, or an
existing BLAST database created using makeblastdb. We will provide examples for both
scenarios.
built-in dust algorithm (through the -dust option). To mask low-complexity sequences only,
we will need to use dustmasker.
For protein sequence data in FASTA files or BLAST database format, we need to use
segmasker to generate the mask information file.
The following examples assume that BLAST databases, listed in 5.2.3, are available in the
current working directory. Note that you should use the sequence id parsing consistently. In
all our examples, we enable this function by including the “-parse_seqids” in the command
line arguments.
We can generate the masking information with dustmasker using a single command line:
BLAST Help
Here we specify the input is a BLAST database named hs_chr (-in hs_chr -infmt blastdb),
enable the sequence id parsing (-parse_seqids), request the mask data in binary asn.1 format
(-outfmt maskinfo_asn1_bin), and name the output file as hs_chr_dust.asnb (-out
hs_chr_dust.asnb).
BLAST Help
If the input format is the original FASTA file, hs_chr.fa, we need to change input to -in and -
infmt options as follows:
To generate the masking information using windowmasker from the BLAST database hs_chr,
we first need to generate a counts file:
Here we specify the input BLAST database (-in hs_chr -infmt blastdb), request it to generate
the counts (-mk_counts) with sequence id parsing (-parse_seqids), and save the output to a file
named hs_chr_mask.counts (-out hs_chr_mask.counts).
BLAST Help
To use the FASTA file hs_chr.fa to generate the counts, we need to change the input file name
and format:
With the counts file we can then proceed to create the file containing the masking information
as follows:
Here we need to use the same input (-in hs_chr -infmt blastdb) and the output of step 1 (-ustat
hs_chr_mask.counts). We set the mask file format to binary asn.1 (-outfmt
maskinfo_asn1_bin), enable the sequence ids parsing (-parse_seqids), and save the masking
data to hs_chr_mask.asnb (-out hs_chr_mask.asnb).
To use the FASTA file hs_chr.fa, we change the input file name and file type:
We can generate the masking information with segmasker using a single command line:
BLAST Help
Here we specify the refseq_protein BLAST database (-in refseq_protein -infmt blastdb), enable
sequence ids parsing (-parse_seqids), request the mask data in binary asn.1 format (-outfmt
maskinfo_asn1_bin), and name the out file as refseq_seg.asnb (-out refseq_seg.asnb).
If the input format is the FASTA file, we need to change the command line to specify the input
format:
We can also extract the masking information from a FASTA sequence file with lowercase
masking (generated by various means) using convert2blastmask utility. An example command
line follows:
Here the input is hs_chr.mfa (-in hs_chr.mfa), enable parsing of sequence ids, specify the
masking algorithm name (-masking_algorithm repeat) and its parameter (-masking_options
“repeatmasker, default”), and ask for asn.1 output (-outfmt maskinfo_asn1_bin) to be saved in
specified file (-out hs_chr_mfa.asnb).
Note: we should use “-parse_seqids” in a consistent manner – either use it in both steps or not
use it at all.
For example, we can use the following command line to apply the masking information, created
in step 5.2.1.2, to the existing BLAST database generated in 5.2.3:
Here, we use the existing BLAST database as input file (-in hs_chr), specify its type (-dbtype
nucl), enable parsing of sequence ids (-parse_seqids), provide the masking data from step
5.2.1.2 (-mask_data hs_chr_mask.asnb), and name the output database with the same base
name (-out hs_chr) overwriting the existing one.
To use the original FASTA sequence file (hs_chr.fa) as the input, we need to use “-in hs_chr.fa”
to instruct makeblastdb to use that FASTA file instead.
We can check the “re-created” database to find out if the masking information was added
properly, using blastdbcmd with the following command line:
BLAST Help
Volumes:
/export/home/tao/blast_test/hs_chr
Extra lines under the “Available filtering algorithms …” describe the masking algorithms
available. The “Algorithm ID” field, 30 in our case, is what we need to use if we want to invoke
database soft masking during an actual search through the “-db_soft_mask” parameter.
We can apply additional masking data to an existing BLAST database with one type of masking
information already added. For example, we can apply the dust masking, generated in step
5.2.1.1, to the database generated in step 5.2.2.1, we can use this command line:
BLAST Help
Here, we use the existing database as input file (-in hs_chr), specify its type (-dbtype nucl),
enable parsing of sequence ids (-parse_seqids), provide the masking data from step 5.2.1.1 (-
mask_data hs_chr_dust.asnb), naming the database with the same based name (-out hs_chr)
overwriting the existing one.
Volumes:
/net/gizmo4/export/home/tao/blast_test/hs_chr
makeblastdb run by providing multiple set of masking data files in a comma delimited list:
We can use the masking data file generated in step 5.2.1.3 to create a protein BLAST database:
This produces the following summary, which includes the masking information:
Volumes:
/export/home/tao/blast_test/refseq_protein2.00
/export/home/tao/blast_test/refseq_protein2.01
/export/home/tao/blast_test/refseq_protein2.02
We use the following command line, which is very similar to that given in 5.2.2.1.
Here we use the lowercase masked FASTA sequence file as input (-in hs_chr.mfa), specify the
database as nucleotide (-dbtype nucl), enable parsing of sequence ids (-parse_seqids), provide
the masking data (-mask_data hs_chr_mfa.asnb), and name the resulting database as
hs_chr_mfa (-out hs_chr_mfa).
Volumes:
/export/home/tao/hs_chr_mfa
The algorithm name and algorithm options are the values we provided in step 5.2.1.4.
ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
We use this command line to create the BLAST database from the input nucleotide sequences:
For input nucleotide sequences with lowercase masking, we use the FASTA file hs_chr.mfa,
BLAST Help
containing the complete human chromosomes from BUILD37.1, generated by inflating and
combining the hs_ref_*.mfa.gz files located in the same ftp directory.
For input protein sequences, we use the preformatted refseq_protein database from the NCBI
blast/db/ ftp directory:
ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.00.tar.gz
ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.01.tar.gz
ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.02.tar.gz
BLAST Help
Here, we use the blastn program to search a nucleotide query HTT_gene* (-query HTT_gene)
with megablast algorithm (-task megablast) against the database created in step 5.2.2.1 (-db
BLAST Help
hs_chr). We invoke the soft database masking (-db_soft_mask 30), set the result format to
tabular output (-outfmt 7), and save the result to a file named HTT_megablast_mask.tab (-out
HTT_megablast_mask.tab). We also activated the multi-thread feature of blastn to speed up
the search by using 4 CPUs$ (-num_threads 4).
*This is a genomic fragment containing the HTT gene from human, including 5 kb up- and
down-stream of the transcribed region. It is represented by NG_009378.
$ The number to use under in your run will depend on the number of CPUs your system has.
In a test run under a 64-bits Linux machine, the above search takes 9.828 seconds real time,
while the same run without database soft masking invoked takes 31 minutes 44.651 seconds.
The first blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID), the
awk command selects from the output of that command those sequences which have a
taxonomy ID of 9606 (human) and prints its GIs, and finally the second blastdbcmd invocation
uses those GIs to print the sequence data for the human sequences in the nr database.
EREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVYGRDEGRDYYNNSHPGSGPGGPRGGY
ERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFRGPPPGLGRPTPDWERTRAGE
YGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSPWSAKGGVAAPGKNGK
ASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLAARAPSSRMVD
EDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPEDSYKRA
KSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD
GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIH
EDPIDAPAKRADEDGAK
output format. The tabular output format with comments is used, but only the query accession,
subject accession, evalue, query start, query stop, subject start, and subject stop are requested.
For brevity, only the first 10 lines of output are shown:
$ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalue
qstart qend sstart send"
# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111
# Database: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s.
start, s. end
BLAST Help
# 85 hits found
AE000111 AE000111 0.0 1 10596 1 10596
AE000111 AE000174 8e-30 5565 5671 6928 6821
AE000111 AE000394 1e-27 5587 5671 135 219
AE000111 AE000425 6e-26 5587 5671 8552 8468
AE000111 AE000171 3e-24 5587 5671 2214 2130
$
matches and mismatches. BTOP operations consist of 1.) a number with a count of matching
letters, 2.) two letters showing a mismatch (e.g., “AG” means A was replaced by G), or 3.) a
dash (“-“) and a letter showing a gap. Note also that BTOP always shows the query on the plus
strand, whereas the CIGAR string always has the subject on the plus strand.
The box below shows a blastn run first with BTOP output and then the same run with the
BLAST report showing the alignments.
Query= query1
Length=47
Subject=
Length=142
Strand=Plus/Plus
Query 1 ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47
||||||| |||||||||||||||||||||||||||||||||||||||
Sbjct 47 ACGTCCGGGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 93
Query 1 ACGTCCGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47
||||||| |||||||||||||||||||||||||||||||||||||||
Sbjct 1 ACGTCCG-GACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 46
BLAST Help
Query 1 ACGTCC--GAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 47
|||||| |||||||||||||||||||||||||||||||||||||||||
Sbjct 94 ACGTCCGAGAGACGCGAGCAGCGAGCAGCAGAGCGACGAGCAGCGACGA 142
sequences within an existing database. At the BLAST search level, we can provide multiple
database names to the “-db” parameter, or to provide a GI file specifying the desired subset to
the “-gilist” parameter. However for these types of searches, a more convenient way to conduct
them is by creating virtual BLAST databases for these. Note: When combining BLAST
databases, all the databases must be of the same molecule type. The following examples assume
that the two databases as well as the GI file are in the current working directory.
Note: one can also specify multiple databases using the -db parameter of blastdb_aliastool.
the tabular format looking for matches meeting a certain criteria, then go back and examine
the relevant alignments in the full BLAST report. He may also first look at pair-wise
alignments, then decide to use a query-anchored view. Viewing a BLAST report in different
formats has been possible on the NCBI BLAST web site since 2000, but has not been possible
with stand-alone BLAST runs. The blast_formatter allows this, if the original search produced
blast archive format using the –outfmt 11 switch. The query sequence, the BLAST options,
the masking information, the name of the database, and the alignment are written out as ASN.
1 (a structured format similar to XML). The blast_formatter reads this information and formats
a report. The BLAST database used for the original search must be available, or the sequences
need to be fetched from the NCBI, assuming the database contains sequences in the public
dataset. The box below illustrates the procedure. A blastn run first produces the BLAST archive
BLAST Help
format, and the blast_fomatter then reads the file and produces tabular output.
5.9 Extract lowercase masked FASTA from a BLAST database with masking information
If a BLAST database contains masking information, this can be extracted using the blastdbcmd
options –db_mask and –mask_sequence as follows:
Volumes:
mask-data-db
$ blastdbcmd -db mask-data-db -mask_sequence_with 20 -entry 71022837
>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 [Ustilago maydis
521]
MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPPPS
THG
GPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavQklslANEaawvsIGsaa
etm
BLAST Help
EdydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdTWGSMGHCYLMMDDLQRAYTA
YQQ
ALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQQNKFPASLECFRYILDN
PPR
PLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAGFNNQERAIQFLTKSLESDPNDA
QSW
YLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDALDAYSRAIRLNPYISEVWFDLGSLYEA
CNN
QISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVPQDVHPTAYANNNGMAPGPPTQIGGGPGPSYPP
PLV
GPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGTDDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGP
SPG
PPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPPHVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLA
AAG
PGGPPPPLDHYGRPMGGPMSEREREMEWEREREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRP
PVY
GRDEGRDYYNNSHPGSGPGGPRGGYERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDG
PFR
BLAST Help
GPPPGLGRPTPDWERTRAGEYGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRA
GSP
WSAKGGVAAPGKNGKASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNE
PLA
ARAPSSRMVDEDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDE
PED
SYKRAKSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPM
DVD
GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIHED
PID
APAKRADEDGAK
BLAST Help
5.10 Display the locations where BLAST will search for BLAST databases
This is accomplished by using the -show_blastdb_search_path option in blastdbcmd:
$ blastdbcmd -show_blastdb_search_path
:/net/nabl000/vol/blast/db/blast1:/net/nabl000/vol/blast/db/blast2:
$
repeat/repeat_7165 Nucleotide
repeat/repeat_7227 Nucleotide
repeat/repeat_7719 Nucleotide
repeat/repeat_7955 Nucleotide
repeat/repeat_9606 Nucleotide
repeat/repeat_9989 Nucleotide
$
The first column of the default output is the file name of the BLAST database (usually provided
as the –db argument to other BLAST+ applications), the second column represents the molecule
type of the BLAST database. This output is configurable via the list_outfmt command line
option.
BLAST Help
BLAST Help
BLAST Help
BLAST Help