Biopython - Quick Guide
Biopython - Quick Guide
Biopython - Quick Guide
Tutorialspoint
More Detail
Tutorialspoint
More Detail
Tutorialspoint
More Detail
Biopython - Introduction
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 1/79
06/10/2022, 18:18 Biopython - Quick Guide
Biopython is the largest and most popular bioinformatics package for Python. It contains a
number of different sub-modules for common bioinformatics tasks. It is developed by Chapman
and Chang, mainly written in Python. It also contains C code to optimize the complex
computation part of the software. It runs on Windows, Linux, Mac OS X, etc.
Basically, Biopython is a collection of python modules that provide functions to deal with DNA,
RNA & protein sequence operations such as reverse complementing of a DNA string, finding
motifs in protein sequences, etc. It provides lot of parsers to read all major genetic databases like
GenBank, SwissPort, FASTA, etc., as well as wrappers/interfaces to run other popular
bioinformatics software/tools like NCBI BLASTN, Entrez, etc., inside the python environment. It
has sibling projects like BioPerl, BioJava and BioRuby.
Features
Biopython is portable, clear and has easy to learn syntax. Some of the salient features are listed
below −
BioSQL − Standard set of SQL tables for storing sequences plus features and annotations.
Access to online services and database, including NCBI services (Blast, Entrez, PubMed)
and ExPASY services (SwissProt, Prosite).
Goals
The goal of Biopython is to provide simple, standard and extensive access to bioinformatics
through python language. The specific goals of the Biopython are listed below −
Advantages
Biopython requires very less code and comes up with the following advantages −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 2/79
06/10/2022, 18:18 Biopython - Quick Guide
Supports structure data used for PDB parsing, representation and analysis.
Supports journal data used in Medline applications.
Supports BioSQL database, which is widely used standard database amongst all
bioinformatics projects.
Population Genetics
Population genetics is the study of genetic variation within a population, and involves the
examination and modeling of changes in the frequencies of genes and alleles in populations over
space and time.
Biopython provides Bio.PopGen module for population genetics. This module contains all the
necessary functions to gather information about classic population genetics.
RNA Structure
Three major biological macromolecules that are essential for our life are DNA, RNA and Protein.
Proteins are the workhorses of the cell and play an important role as enzymes. DNA
(deoxyribonucleic acid) is considered as the “blueprint” of the cell. It carries all the genetic
information required for the cell to grow, take in nutrients, and propagate. RNA (Ribonucleic acid)
acts as “DNA photocopy” in the cell.
Biopython provides Bio.Sequence objects that represents nucleotides, building blocks of DNA
and RNA.
Biopython - Installation
This section explains how to install Biopython on your machine. It is very easy to install and it will
not take more than five minutes.
Biopython is designed to work with Python 2.5 or higher versions. So, it is mandatory that python
be installed first. Run the below command in your command prompt −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 3/79
06/10/2022, 18:18 Biopython - Quick Guide
It is defined below −
It shows the version of python, if installed properly. Otherwise, download the latest version of the
python, install it and then run the command again.
It is easy to install Biopython using pip from the command line on all platforms. Type the below
command −
After executing this command, the older versions of Biopython and NumPy (Biopython depends
on it) will be removed before installing the recent versions.
Now, you have successfully installed Biopython on your machine. To verify that Biopython is
installed properly, type the below command on your python console −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 4/79
06/10/2022, 18:18 Biopython - Quick Guide
Download the file and unpack the compressed archive file, move into the source code folder and
type the below command −
This will build Biopython from the source code as given below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 5/79
06/10/2022, 18:18 Biopython - Quick Guide
Let us create a simple Biopython application to parse a bioinformatics file and print the content.
This will help us understand the general concept of the Biopython and how it helps in the field of
bioinformatics.
Step 1 − First, create a sample sequence file, “example.fasta” and put the below content into it.
The extension, fasta refers to the file format of the sequence file. FASTA originates from the
bioinformatics software, FASTA and hence it gets its name. FASTA format has multiple sequence
arranged one by one and each sequence will have its own id, name, description and the actual
sequence data.
Step 2 − Create a new python script, *simple_example.py" and enter the below code and save it.
file = open("example.fasta")
Line 1 imports the parse class available in the Bio.SeqIO module. Bio.SeqIO module is used to
read and write the sequence file in different format and `parse’ class is used to parse the content
of the sequence file.
Line 2 imports the SeqRecord class available in the Bio.SeqRecord module. This module is used
to manipulate sequence records and SeqRecord class is used to represent a particular sequence
available in the sequence file.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 6/79
06/10/2022, 18:18 Biopython - Quick Guide
*Line 3" imports Seq class available in the Bio.Seq module. This module is used to manipulate
sequence data and Seq class is used to represent the sequence data of a particular sequence
record available in the sequence file.
Line 5 opens the “example.fasta” file using regular python function, open.
Line 7 parse the content of the sequence file and returns the content as the list of SeqRecord
object.
Line 9-15 loops over the records using python for loop and prints the attributes of the sequence
record (SqlRecord) such as id, name, description, sequence data, etc.
Step 3 − Open a command prompt and go to the folder containing sequence file, “example.fasta”
and run the below command −
Step 4 − Python runs the script and prints all the sequence data available in the sample file,
“example.fasta”. The output will be similar to the following content.
Id: sp|P25730|FMS1_ECOLI
Name: sp|P25730|FMS1_ECOLI
Decription: sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pili
Annotations: {}
Sequence Data: MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV
KGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITIDSNDLNFASSGVNKVSSTQKLSIHADATRVTGGA
GQYQGLVSIILTKSTTTTTTTKGT
Sequence Alphabet: SingleLetterAlphabet()
Id: sp|P15488|FMS3_ECOLI
Name: sp|P15488|FMS3_ECOLI
Decription: sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pili
Annotations: {}
Sequence Data: MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS
IASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDKNAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLV
YRANITITSTIKGGGTKKGTTDKK
Sequence Alphabet: SingleLetterAlphabet()
We have seen three classes, parse, SeqRecord and Seq in this example. These three classes
provide most of the functionality and we will learn those classes in the coming section.
Biopython - Sequence
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 7/79
06/10/2022, 18:18 Biopython - Quick Guide
Here, we have created a simple protein sequence AGCT and each letter represents Alanine,
Glycine, Cysteine and Threonine.
alphabet − used to represent the type of sequence. e.g. DNA sequence, RNA sequence, etc.
By default, it does not represent any sequence and is generic in nature.
Alphabet Module
Seq objects contain Alphabet attribute to specify sequence type, letters and possible operations.
It is defined in Bio.Alphabet module. Alphabet can be defined as below −
Alphabet module provides below classes to represent different types of sequences. Alphabet -
base class for all types of alphabets.
SingleLetterAlphabet - Generic alphabet with letters of size one. It derives from Alphabet and all
other alphabets type derives from it.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 8/79
06/10/2022, 18:18 Biopython - Quick Guide
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 9/79
06/10/2022, 18:18 Biopython - Quick Guide
Also, Biopython exposes all the bioinformatics related configuration data through Bio.Data
module. For example, IUPACData.protein_letters has the possible letters of IUPACProtein
alphabet.
Basic Operations
This section briefly explains about all the basic operations available in the Seq class. Sequences
are similar to python strings. We can perform python string operations like slicing, counting,
concatenation, find, split and strip in sequences.
>>> seq_string[0:2]
Seq('AG')
>>> seq_string[ : ]
Seq('AGCTAGCT')
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 10/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> len(seq_string)
8
>>> seq_string.count('A')
2
Here, the above two sequence objects, seq1, seq2 are generic DNA sequences and so you can
add them and produce new sequence. You can’t add sequences with incompatible alphabets,
such as a protein sequence and a DNA sequence as specified below −
To add two or more sequences, first store it in a python list, then retrieve it using ‘for loop’ and
finally add it together as shown below −
In the below section, various codes are given to get outputs based on the requirement.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 12/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> strip_seq.strip()
Seq('AGCT')
Here, the complement() method allows to complement a DNA or RNA sequence. The
reverse_complement() method complements and reverses the resultant sequence from left to
right. It is shown below −
>>> nucleotide.reverse_complement()
Seq('GACTGACTTCGA', IUPACAmbiguousDNA())
'S': 'S',
'T': 'A',
'V': 'B',
'W': 'W',
'X': 'X',
'Y': 'R'}
>>>
GC Content
Genomic DNA base composition (GC content) is predicted to significantly affect genome
functioning and species ecology. The GC content is the number of GC nucleotides divided by the
total nucleotides.
To get the GC nucleotide content, import the following module and perform the following steps −
Transcription
Transcription is the process of changing DNA sequence into RNA sequence. The actual
biological transcription process is performing a reverse complement (TCAG → CUGA) to get the
mRNA considering the DNA as template strand. However, in bioinformatics and so in Biopython,
we typically work directly with the coding strand and we can get the mRNA sequence by
changing the letter T to U.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 14/79
06/10/2022, 18:18 Biopython - Quick Guide
To get the DNA template strand, reverse_complement the back transcribed RNA as given below
−
>>> rna_seq.back_transcribe().reverse_complement()
Seq('ATACGATCGGCAT', IUPACUnambiguousDNA())
Translation
Translation is a process of translating RNA sequence to protein sequence. Consider a RNA
sequence as shown below −
>>> rna_seq.translate()
Seq('MAIV', IUPACProtein())
It is possible in translate() method to stop at the first stop codon. To perform this, you can assign
to_stop=True in translate() as follows −
Here, the stop codon is not included in the resulting sequence because it does not contain one.
Translation Table
The Genetic Codes page of the NCBI provides full list of translation tables used by Biopython. Let
us see an example for standard table to visualize the code −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 15/79
06/10/2022, 18:18 Biopython - Quick Guide
Biopython uses this table to translate the DNA to protein as well as to find the Stop codon.
SeqRecord
Bio.SeqRecord module provides SeqRecord to hold meta information of the sequence as well as
the sequence data itself as given below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 16/79
06/10/2022, 18:18 Biopython - Quick Guide
Let us understand the nuances of parsing the sequence file using real sequence file in the
coming sections.
FASTA
FASTA is the most basic file format for storing sequence data. Originally, FASTA is a software
package for sequence alignment of DNA and protein developed during the early evolution of
Bioinformatics and used mostly to search the sequence similarity.
Download and save this file into your Biopython sample directory as ‘orchid.fasta’.
Bio.SeqIO module provides parse() method to process sequence files and can be imported as
follows −
parse() method contains two arguments, first one is file handle and second is file format.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 17/79
06/10/2022, 18:18 Biopython - Quick Guide
Here, the parse() method returns an iterable object which returns SeqRecord on every iteration.
Being iterable, it provides lot of sophisticated and easy methods and let us see some of the
features.
next()
next() method returns the next item available in the iterable object, which we can be used to get
the first sequence as given below −
Here, seq_record.annotations is empty because the FASTA format does not support sequence
annotations.
list comprehension
We can convert the iterable object into list using list comprehension as given below
Here, we have used len method to get the total count. We can get sequence with maximum
length as follows −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 18/79
06/10/2022, 18:18 Biopython - Quick Guide
Writing a collection of SqlRecord objects (parsed data) into file is as simple as calling the
SeqIO.write method as below −
This method can be effectively used to convert the format as specified below −
GenBank
It is a richer sequence format for genes and includes fields for various kinds of annotations.
Biopython provides an example GenBank file and it can be accessed at
https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta.
Download and save file into your Biopython sample directory as ‘orchid.gbk’
Since, Biopython provides a single function, parse to parse all bioinformatics format. Parsing
GenBank format is as simple as changing the format option in the parse method.
>>> seq_record.name
'Z78533'
>>> seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCG
>>> seq_record.description
'C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA'
>>> seq_record.annotations {
'molecule_type': 'DNA',
'topology': 'linear',
'data_file_division': 'PLN',
'date': '30-NOV-2006',
'accessions': ['Z78533'],
'sequence_version': 1,
'gi': '2765658',
'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcrib
'source': 'Cypripedium irapeanum',
'organism': 'Cypripedium irapeanum',
'taxonomy': [
'Eukaryota',
'Viridiplantae',
'Streptophyta',
'Embryophyta',
'Tracheophyta',
'Spermatophyta',
'Magnoliophyta',
'Liliopsida',
'Asparagales',
'Orchidaceae',
'Cypripedioideae',
'Cypripedium'],
'references': [
Reference(title = 'Phylogenetics of the slipper orchids (Cypripedioid
Orchidaceae): nuclear rDNA ITS sequences', ...),
Reference(title = 'Direct Submission', ...)
]
}
Identifying the similar region enables us to infer a lot of information like what traits are conserved
between species, how close different species genetically are, how species evolve, etc. Biopython
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 20/79
06/10/2022, 18:18 Biopython - Quick Guide
Let us learn some of the important features provided by Biopython in this chapter −
Before starting to learn, let us download a sample sequence alignment file from the Internet.
Step 2 − Choose any one family having less number of seed value. It contains minimal data and
enables us to work easily with the alignment. Here, we have selected/clicked PF18225 and it
opens go to http://pfam.xfam.org/family/PF18225 and shows complete details about it,
including sequence alignments.
Step 3 − Go to alignment section and download the sequence alignment file in Stockholm format
(PF18225_seed.txt).
Let us try to read the downloaded sequence alignment file using Bio.AlignIO as below −
Read alignment using read method. read method is used to read single alignment data available
in the given file. If the given file contain many alignment, we can use parse method. parse
method returns iterable alignment object similar to parse method in Bio.SeqIO module.
We can also check the sequences (SeqRecord) available in the alignment as well as below −
Multiple Alignments
In general, most of the sequence alignment files contain single alignment data and it is enough to
use read method to parse it. In multiple sequence alignment concept, two or more sequences are
compared for best subsequence matches between them and results in multiple sequence
alignment in a single file.
If the input sequence alignment format contains more than one sequence alignment, then we
need to use parse method instead of read method as specified below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 22/79
06/10/2022, 18:18 Biopython - Quick Guide
Here, parse method returns iterable alignment object and it can be iterated to get actual
alignments.
Biopython provides a special module, Bio.pairwise2 to identify the alignment sequence using
pairwise method. Biopython applies the best algorithm to find the alignment sequence and it is
par with other software.
Let us write an example to find the sequence alignment of two simple and hypothetical
sequences using pairwise module. This will help us understand the concept of sequence
alignment and how to program it using Biopython.
Step 1
Import the module pairwise2 with the command given below −
Step 2
Create two sequences, seq1 and seq2 −
Step 3
Call method pairwise2.align.globalxx along with seq1 and seq2 to find the alignments using the
below line of code −
Here, globalxx method performs the actual work and finds all the best possible alignments in the
given sequences. Actually, Bio.pairwise2 provides quite a set of methods which follows the below
convention to find alignments in different scenarios.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 23/79
06/10/2022, 18:18 Biopython - Quick Guide
Here, the sequence alignment type refers to the alignment type which may be global or local.
global type is finding sequence alignment by taking entire sequence into consideration. local type
is finding sequence alignment by looking into the subset of the given sequences as well. This will
be tedious but provides better idea about the similarity between the given sequences.
X refers to matching score. The possible values are x (exact match), m (score based on
identical chars), d (user provided dictionary with character and match score) and finally c
(user defined function to provide custom scoring algorithm).
Y refers to gap penalty. The possible values are x (no gap penalties), s (same penalties for
both sequences), d (different penalties for each sequence) and finally c (user defined
function to provide custom gap penalties)
So, localds is also a valid method, which finds the sequence alignment using local alignment
technique, user provided dictionary for matches and user provided gap penalty for both
sequences.
Here, blosum62 refers to a dictionary available in the pairwise2 module to provide match score.
-10 refers to gap open penalty and -1 refers to gap extension penalty.
Step 4
Loop over the iterable alignments object and get each individual alignment object and print it.
Step 5
Bio.pairwise2 module provides a formatting method, format_alignment to better visualize the
result −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 24/79
06/10/2022, 18:18 Biopython - Quick Guide
...
ACCGGT
| | ||
A-C-GT
Score=4
ACCGGT
|| ||
AC--GT
Score=4
ACCGGT
| || |
A-CG-T
Score=4
ACCGGT
|| | |
AC-G-T
Score=4
>>>
Biopython also provides another module to do sequence alignment, Align. This module provides
a different set of API to simply the setting of parameter like algorithm, mode, match score, gap
penalties, etc., A simple look into the Align object is as follows −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 25/79
06/10/2022, 18:18 Biopython - Quick Guide
ClustalW
MUSCLE
EMBOSS needle and water
Let us write a simple example in Biopython to create sequence alignment through the most
popular alignment tool, ClustalW.
Step 3 − Set cmd by calling ClustalwCommanLine with input file, opuntia.fasta available in
Biopython package.
https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/opuntia.fasta
Step 4 − Calling cmd() will run the clustalw command and give an output of the resultant
alignment file, opuntia.aln.
NCBIWW module provides qblast function to query the BLAST online version,
https://blast.ncbi.nlm.nih.gov/Blast.cgi . qblast supports all the parameters supported by the
online version.
To obtain any help about this module, use the below command and understand the features −
>>> help(NCBIWWW.qblast)
Help on function qblast in module Bio.Blast.NCBIWWW:
qblast(
program, database, sequence,
url_base = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi',
auto_format = None,
composition_based_statistics = None,
db_genetic_code = None,
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 27/79
g
06/10/2022, 18:18 Biopython - Quick Guide
endpoints = None,
entrez_query = '(none)',
expect = 10.0,
filter = None,
gapcosts = None,
genetic_code = None,
hitlist_size = 50,
i_thresh = None,
layout = None,
lcase_mask = None,
matrix_name = None,
nucl_penalty = None,
nucl_reward = None,
other_advanced = None,
perc_ident = None,
phi_pattern = None,
query_file = None,
query_believe_defline = None,
query_from = None,
query_to = None,
searchsp_eff = None,
service = None,
threshold = None,
ungapped_alignment = None,
word_size = None,
alignments = 500,
alignment_view = None,
descriptions = 500,
entrez_links_new_window = None,
expect_low = None,
expect_high = None,
format_entrez_query = None,
format_object = None,
format_type = 'XML',
ncbi_gi = None,
results_file = None,
show_overview = None,
megablast = None,
template_type = None,
template_length = None
)
Supports all parameters of the qblast API for Put and Get.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 28/79
06/10/2022, 18:18 Biopython - Quick Guide
Please note that BLAST on the cloud supports the NCBI-BLAST Common
URL API (http://ncbi.github.io/blast-cloud/dev/api.html).
To use this feature, please set url_base to 'http://host.my.cloud.servic
format_object = 'Alignment'. For more details, please see 8. Biopython –
https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE = BlastDocs&DOC_TYPE = C
Usually, the arguments of the qblast function are basically analogous to different parameters that
you can set on the BLAST web page. This makes the qblast function easy to understand as well
as reduces the learning curve to use it.
Step 1 − Create a file named blast_example.fasta in the Biopython directory and give the below
sequence information as input
>sequence B ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattca
tattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
Step 4 − Now, call the qblast function passing sequence data as main parameter. The other
parameter represents the database (nt) and the internal program (blastn).
blast_results holds the result of our search. It can be saved to a file for later use and also,
parsed to get the details. We will learn how to do it in the coming section.
Step 5 − The same functionality can be done using Seq object as well rather than using the
whole fasta file as shown below −
Now, call the qblast function passing Seq object, record.seq as main parameter.
Step 6 − result_handle object will have the entire result and can be saved into a file for later
usage.
We will see how to parse the result file in the later section.
Connecting BLAST
In general, running BLAST locally is not recommended due to its large size, extra effort needed to
run the software, and the cost involved. Online BLAST is sufficient for basic and advanced
purposes. Of course, sometime you may be required to install it locally.
Consider you are conducting frequent searches online which may require a lot of time and high
network volume and if you have proprietary sequence data or IP related issues, then installing it
locally is recommended.
Step 1 − Download and install the latest blast binary using the given link −
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Step 2 − Download and unpack the latest and necessary database using the below link −
ftp://ftp.ncbi.nlm.nih.gov/blast/db/
BLAST software provides lot of databases in their site. Let us download alu.n.gz file from the
blast database site and unpack it into alu folder. This file is in FASTA format. To use this file in our
blast application, we need to first convert the file from FASTA format into blast database format.
BLAST provides makeblastdb application to do this conversion.
cd /path/to/alu
makeblastdb -in alu.n -parse_seqids -dbtype nucl -out alun
Running the above code will parse the input file, alu.n and create BLAST database as multiple
files alun.nsq, alun.nsi, etc. Now, we can query this database to find the sequence.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 31/79
06/10/2022, 18:18 Biopython - Quick Guide
We have installed the BLAST in our local server and also have sample BLAST database, alun to
query against it.
Step 3 − Let us create a sample sequence file to query the database. Create a file search.fsa
and put the below data into it.
>gnl|alu|Z15030_HSAL001056 (Alu-J)
AGGCTGGCACTGTGGCTCATGCTGAAATCCCAGCACGGCGGAGGACGGCGGAAGATTGCT
TGAGCCTAGGAGTTTGCGACCAGCCTGGGTGACATAGGGAGATGCCTGTCTCTACGCAAA
AGAAAAAAAAAATAGCTCTGCTGGTGGTGCATGCCTATAGTCTCAGCTATCAGGAGGCTG
GGACAGGAGGATCACTTGGGCCCGGGAGTTGAGGCTGTGGTGAGCCACGATCACACCACT
GCACTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTCAAAACAAACAAATAA
>gnl|alu|D00596_HSAL003180 (Alu-Sx)
AGCCAGGTGTGGTGGCTCACGCCTGTAATCCCACCGCTTTGGGAGGCTGAGTCAGATCAC
CTGAGGTTAGGAATTTGGGACCAGCCTGGCCAACATGGCGACACCCCAGTCTCTACTAAT
AACACAAAAAATTAGCCAGGTGTGCTGGTGCATGTCTGTAATCCCAGCTACTCAGGAGGC
TGAGGCATGAGAATTGCTCACGAGGCGGAGGTTGTAGTGAGCTGAGATCGTGGCACTGTA
CTCCAGCCTGGCGACAGAGGGAGAACCCATGTCAAAAACAAAAAAAGACACCACCAAAGG
TCAAAGCATA
>gnl|alu|X55502_HSAL000745 (Alu-J)
TGCCTTCCCCATCTGTAATTCTGGCACTTGGGGAGTCCAAGGCAGGATGATCACTTATGC
CCAAGGAATTTGAGTACCAAGCCTGGGCAATATAACAAGGCCCTGTTTCTACAAAAACTT
TAAACAATTAGCCAGGTGTGGTGGTGCGTGCCTGTGTCCAGCTACTCAGGAAGCTGAGGC
AAGAGCTTGAGGCTACAGTGAGCTGTGTTCCACCATGGTGCTCCAGCCTGGGTGACAGGG
CAAGACCCTGTCAAAAGAAAGGAAGAAAGAACGGAAGGAAAGAAGGAAAGAAACAAGGAG
AG
The sequence data are gathered from the alu.n file; hence, it matches with our database.
Step 4 − BLAST software provides many applications to search the database and we use blastn.
blastn application requires minimum of three arguments, db, query and out. db refers to the
database against to search; query is the sequence to match and out is the file to store results.
Now, run the below command to perform this simple query −
Running the above command will search and give output in the results.xml file as given below
(partially data) −
<BlastOutput_db>alun</BlastOutput_db>
<BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
<BlastOutput_query-def>gnl|alu|Z15030_HSAL001056 (Alu-J)</BlastOutput_qu
<BlastOutput_query-len>292</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>1</Parameters_sc-match>
<Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
<Parameters_gap-open>0</Parameters_gap-open>
<Parameters_gap-extend>0</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num><Iteration_query-ID>Quer
<Iteration_query-def>gnl|alu|Z15030_HSAL001056 (Alu-J)</Iteration_
<Iteration_query-len>292</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|alu|Z15030_HSAL001056</Hit_id>
<Hit_def>(Alu-J)</Hit_def>
<Hit_accession>Z15030_HSAL001056</Hit_accession>
<Hit_len>292</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>540.342</Hsp_bit-score>
<Hsp_score>292</Hsp_score>
<Hsp_evalue>4.55414e-156</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>292</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>292</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>292</Hsp_identity>
<Hsp_positive>292</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>292</Hsp_align-len>
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 33/79
06/10/2022, 18:18 Biopython - Quick Guide
<Hsp_qseq>
AGGCTGGCACTGTGGCTCATGCTGAAATCCCAGCACGGCGGAGGACGGCGGAAG
CGACCAGCCTGGGTGACATAGGGAGATGCCTGTCTCTACGCAAAAGAAAAAAAA
CCTATAGTCTCAGCTATCAGGAGGCTGGGACAGGAGGATCACTTGGGCCCGGGA
ACGATCACACCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTCAAA
</Hsp_qseq>
<Hsp_hseq>
AGGCTGGCACTGTGGCTCATGCTGAAATCCCAGCACGGCGGAGGACGGCGGAAG
GTTTGCGACCAGCCTGGGTGACATAGGGAGATGCCTGTCTCTACGCAAAAGAAA
GGTGGTGCATGCCTATAGTCTCAGCTATCAGGAGGCTGGGACAGGAGGATCACT
CTGTGGTGAGCCACGATCACACCACTGCACTCCAGCCTGGGTGACAGAGCAAGA
AAATAA
</Hsp_hseq>
<Hsp_midline>
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||
</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
.........................
.........................
.........................
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>327</Statistics_db-num>
<Statistics_db-len>80506</Statistics_db-len>
<Statistics_hsp-lenv16</Statistics_hsp-len>
<Statistics_eff-space>21528364</Statistics_eff-space>
<Statistics_kappa>0.46</Statistics_kappa>
<Statistics_lambda>1.28</Statistics_lambda>
<Statistics_entropy>0.85</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 34/79
06/10/2022, 18:18 Biopython - Quick Guide
The above command can be run inside the python using the below code −
Here, the first one is a handle to the blast output and second one is the possible error output
generated by the blast command.
Since we have provided the output file as command line argument (out = “results.xml”) and sets
the output format as XML (outfmt = 5), the output file will be saved in the current working
directory.
Now, open the file directly using python open method and use NCBIXML parse method as
given below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 35/79
06/10/2022, 18:18 Biopython - Quick Guide
Some of the popular databases which can be accessed through Entrez are listed below −
Pubmed
Pubmed Central
Nucleotide (GenBank Sequence Database)
Protein (Sequence Database)
Genome (Whole Genome Database)
Structure (Three Dimensional Macromolecular Structure)
Taxonomy (Organisms in GenBank)
SNP (Single Nucleotide Polymorphism)
UniGene (Gene Oriented Clusters of Transcript Sequences)
CDD (Conserved Protein Domain Database)
3D Domains (Domains from Entrez Structure)
In addition to the above databases, Entrez provides many more databases to perform the field
search.
Biopython provides an Entrez specific module, Bio.Entrez to access Entrez database. Let us
learn how to access Entrez using Biopython in this chapter −
Next set your email to identify who is connected with the code given below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 36/79
06/10/2022, 18:18 Biopython - Quick Guide
Now, call einfo function to find index term counts, last update, and available links for each
database as defined below −
The einfo method returns an object, which provides access to the information through its read
method as shown below −
<DbName>gds</DbName>
<DbName>geoprofiles</DbName>
<DbName>homologene</DbName>
<DbName>medgen</DbName>
<DbName>mesh</DbName>
<DbName>ncbisearch</DbName>
<DbName>nlmcatalog</DbName>
<DbName>omim</DbName>
<DbName>orgtrack</DbName>
<DbName>pmc</DbName>
<DbName>popset</DbName>
<DbName>probe</DbName>
<DbName>proteinclusters</DbName>
<DbName>pcassay</DbName>
<DbName>biosystems</DbName>
<DbName>pccompound</DbName>
<DbName>pcsubstance</DbName>
<DbName>pubmedhealth</DbName>
<DbName>seqannot</DbName>
<DbName>snp</DbName>
<DbName>sra</DbName>
<DbName>taxonomy</DbName>
<DbName>biocollections</DbName>
<DbName>unigene</DbName>
<DbName>gencoll</DbName>
<DbName>gtr</DbName>
</DbList>
</eInfoResult>
The data is in XML format, and to get the data as python object, use Entrez.read method as
soon as Entrez.einfo() method is invoked −
Here, record is a dictionary which has one key, DbList as shown below −
>>> record.keys()
[u'DbList']
Accessing the DbList key returns the list of database names shown below −
>>> record[u'DbList']
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss',
'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly',
'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar',
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 38/79
06/10/2022, 18:18 Biopython - Quick Guide
bioproject , biosample , blastdbinfo , books , cdd , clinvar ,
'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles
'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim',
'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay',
'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot',
Basically, Entrez module parses the XML returned by Entrez search system and provide it as
python dictionary and lists.
Search Database
To search any of one the Entrez databases, we can use Bio.Entrez.esearch() module. It is
defined below −
If you want to search across database, then you can use Entrez.egquery. This is similar to
Entrez.esearch except it is enough to specify the keyword and skip the database parameter.
Fetch Records
Enterz provides a special method, efetch to search and download the full details of a record from
Entrez. Consider the following simple example −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 40/79
06/10/2022, 18:18 Biopython - Quick Guide
PDB files distributed by the Protein Data Bank may contain formatting errors that make them
ambiguous or difficult to parse. The Bio.PDB module attempts to deal with these errors
automatically.
The Bio.PDB module implements two different parsers, one is mmCIF format and second one is
pdb format.
mmCIF Parser
Let us download an example database in mmCIF format from pdb server using the below
command −
This will download the specified file (2fat.cif) from the server and store it in the current working
directory.
Here, PDBList provides options to list and download files from online PDB FTP server.
retrieve_pdb_file method needs the name of the file to be downloaded without extension.
retrieve_pdb_file also have option to specify download directory, pdir and format of the file,
file_format. The possible values of file format are as follows −
Here, QUIET suppresses the warning during parsing the file. get_structure will parse the file
and return the structure with id as 2FAT (first argument).
After running the above command, it parses the file and prints possible warning, if available.
>>> data
<Structure id = 2FAT>
>>> print(type(data))
<class 'Bio.PDB.Structure.Structure'>
We have successfully parsed the file and got the structure of the protein. We will learn the details
of the protein structure and how to get it in the later chapter.
PDB Parser
Let us download an example database in PDB format from pdb server using the below command
−
This will download the specified file (pdb2fat.ent) from the server and store it in the current
working directory.
Here, get_structure is similar to MMCIFParser. PERMISSIVE option try to parse the protein data
as flexible as possible.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 42/79
06/10/2022, 18:18 Biopython - Quick Guide
Now, check the structure and its type with the code snippet given below −
>>> data
<Structure id = 2fat>
>>> print(type(data))
<class 'Bio.PDB.Structure.Structure'>
Well, the header structure stores the dictionary information. To perform this, type the below
command −
>>> print(data.header["name"])
an anti-urokinase plasminogen activator receptor (upar) antibody: crystal
structure and binding epitope
>>>
You can also check the date and resolution with the below code −
PDB Structure
PDB structure is composed of a single model, containing two chains.
Each residue is composed of multiple atoms, each having a 3D position represented by (x, y, z)
coordinates.
Let us learn how to get the structure of the atom in detail in the below section −
Model
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 43/79
06/10/2022, 18:18 Biopython - Quick Guide
The Structure.get_models() method returns an iterator over the models. It is defined below −
Here, a Model describes exactly one 3D conformation. It contains one or more chains.
Chain
The Model.get_chain() method returns an iterator over the chains. It is defined below −
Here, Chain describes a proper polypeptide structure, i.e., a consecutive sequence of bound
residues.
Residue
The Chain.get_residues() method returns an iterator over the residues. It is defined below −
Atoms
The Residue.get_atom() returns an iterator over the atoms as defined below −
An atom holds the 3D coordinate of an atom and it is called a Vector. It is defined below
>>> atoms[0].get_vector()
<Vector 18.49, 73.26, 44.16>
>>> print(seq.counts)
0 1 2 3
A: 2.00 1.00 0.00 1.00
C: 0.00 1.00 2.00 0.00
G: 0.00 1.00 1.00 0.00
T: 1.00 0.00 0.00 2.00
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 45/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> seq.counts["A", :]
(2, 1, 0, 1)
If you want to access the columns of counts, use the below command −
>>> seq.counts[:, 3]
{'A': 1, 'C': 0, 'T': 2, 'G': 0}
AGCTTACG
ATCGTACC
TTCCGAAT
GGTACGTA
AAGCTTGG
You can create your own logo using the following link − http://weblogo.berkeley.edu/
Add the above sequence and create a new logo and save the image named seq.png in your
biopython folder.
seq.png
>>> seq.weblogo("seq.png")
This DNA sequence motif is represented as a sequence logo for the LexA-binding motif.
JASPAR Database
JASPAR is one of the most popular databases. It provides facilities of any of the motif formats for
reading, writing and scanning sequences. It stores meta-information for each motif. The module
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 46/79
06/10/2022, 18:18 Biopython - Quick Guide
Let us create a JASPAR sites format named in sample.sites in biopython folder. It is defined
below −
sample.sites
>MA0001 ARNT 1
AACGTGatgtccta
>MA0001 ARNT 2
CAGGTGggatgtac
>MA0001 ARNT 3
TACGTAgctcatgc
>MA0001 ARNT 4
AACGTGacagcgct
>MA0001 ARNT 5
CACGTGcacgtcgt
>MA0001 ARNT 6
cggcctCGCGTGc
In the above file, we have created motif instances. Now, let us create a motif object from the
above instances −
Here, data reads all the motif instances from sample.sites file.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 47/79
06/10/2022, 18:18 Biopython - Quick Guide
To print all the instances from data, use the below command −
>>> print(data.counts)
0 1 2 3 4 5
A: 2.00 5.00 0.00 0.00 0.00 1.00
C: 3.00 0.00 5.00 0.00 0.00 0.00
G: 0.00 1.00 1.00 6.00 0.00 5.00
T: 1.00 0.00 0.00 0.00 6.00 0.00
>>>
MySQL (biosqldb-mysql.sql)
PostgreSQL (biosqldb-pg.sql)
Oracle (biosqldb-ora/*.sql)
SQLite (biosqldb-sqlite.sql)
It also provides minimal support for Java based HSQLDB and Derby databases.
BioPython provides very simple, easy and advanced ORM capabilities to work with BioSQL
based database. BioPython provides a module, BioSQL to do the following functionality −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 48/79
06/10/2022, 18:18 Biopython - Quick Guide
Parse a sequence database like GenBank, Swisport, BLAST result, Entrez result, etc., and
directly load it into the BioSQL database
Fetch the sequence data from the BioSQL database
Fetch taxonomy data from NCBI BLAST and store it in the BioSQL database
Run any SQL query against the BioSQL database
biodatabase
bioentry
biosequence
seqfeature
taxon
taxon_name
antology
term
dxref
Here, we shall create a SQLite based BioSQL database using the below steps.
Step 2 − Download the BioSQL project from the GitHub URL. https://github.com/biosql/biosql
Step 3 − Open a console and create a directory using mkdir and enter into it.
cd /path/to/your/biopython/sample
mkdir sqlite-biosql
cd sqlite-biosql
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 49/79
06/10/2022, 18:18 Biopython - Quick Guide
Step 5 − Copy the biosqldb-sqlite.sql file from the BioSQL project (/sql/biosqldb-sqlite.sql`) and
store it in the current directory.
Step 7 − Run the below command to see all the new tables in our database.
sqlite> .headers on
sqlite> .mode column
sqlite> .separator ROW "\n"
sqlite> SELECT name FROM sqlite_master WHERE type = 'table';
biodatabase
taxon
taxon_name
ontology
term
term_synonym
term_dbxref
term_relationship
term_relationship_term
term_path
bioentry
bioentry_relationship
bioentry_path
biosequence
dbxref
dbxref_qualifier_value
bioentry_dbxref
reference
bioentry_reference
comment
bioentry_qualifier_value
seqfeature
seqfeature_relationship
seqfeature_path
seqfeature_qualifier_value
seqfeature_dbxref
location
location_qualifier_value
sqlite>
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 50/79
06/10/2022, 18:18 Biopython - Quick Guide
The first three commands are configuration commands to configure SQLite to show the result in a
formatted manner.
Step 8 − Copy the sample GenBank file, ls_orchid.gbk provided by BioPython team
https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk
into the current directory and save it as orchid.gbk.
Step 9 − Create a python script, load_orchid.py using the below code and execute it.
db = server.new_database("orchid")
count = db.load(SeqIO.parse("orchid.gbk", "gb"), True) server.commit()
server.close()
The above code parses the record in the file and converts it into python objects and inserts it into
BioSQL database. We will analyze the code in later section.
Finally, we created a new BioSQL database and load some sample data into it. We shall discuss
the important tables in the next chapter.
Simple ER Diagram
biodatabase table is in the top of the hierarchy and its main purpose is to organize a set of
sequence data into a single group/virtual database. Every entry in the biodatabase refers to a
separate database and it does not mingle with another database. All the related tables in the
BioSQL database have references to biodatabase entry.
bioentry table holds all the details about a sequence except the sequence data. sequence data
of a particular bioentry will be stored in biosequence table.
taxon and taxon_name are taxonomy details and every entry refers this table to specify its taxon
information.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 51/79
06/10/2022, 18:18 Biopython - Quick Guide
After understanding the schema, let us look into some queries in the next section.
BioSQL Queries
Let us delve into some SQL queries to better understand how the data are organized and the
tables are related to each other. Before proceeding, let us open the database using the below
command and set some formatting commands −
.header and .mode are formatting options to better visualize the data. You can also use any
SQLite editor to run the query.
List the virtual sequence database available in the system as given below −
select
*
from
biodatabase;
*** Result ***
sqlite> .width 15 15 15 15
sqlite> select * from biodatabase;
biodatabase_id name authority description
--------------- --------------- --------------- ---------------
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 52/79
06/10/2022, 18:18 Biopython - Quick Guide
1 orchid
sqlite>
List the entries (top 3) available in the database orchid with the below given code
select
be.*,
bd.name
from
bioentry be
inner join
biodatabase bd
on bd.biodatabase_id = be.biodatabase_id
where
bd.name = 'orchid' Limit 1,
3;
*** Result ***
sqlite> .width 15 15 10 10 10 10 10 50 10 10
sqlite> select be.*, bd.name from bioentry be inner join biodatabase bd on
bd.biodatabase_id = be.biodatabase_id where bd.name = 'orchid' Limit 1,3;
bioentry_id biodatabase_id taxon_id name accession identifier division descripti
--------------- --------------- ---------- ---------- ---------- ---------- ----
---------- ---------- ----------- ---------- --------- ---------- ----------
2 1 19 Z78532 Z78532 2765657 PL
C.californicum 5.8S rRNA gene and ITS1 and ITS2 DN 1
orchid
3 1 20 Z78531 Z78531 2765656 PL
C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DN 1
orchid
4 1 21 Z78530 Z78530 2765655 PL
C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 D 1
orchid
sqlite>
List the sequence details associated with an entry (accession − Z78530, name − C. fasciculatum
5.8S rRNA gene and ITS1 and ITS2 DNA) with the given code −
select
substr(cast(bs.seq as varchar), 0, 10) || '...' as seq,
bs.length,
be.accession,
be.description,
bd.name
from
biosequence bs
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 53/79
06/10/2022, 18:18 Biopython - Quick Guide
inner join
bioentry be
on be.bioentry_id = bs.bioentry_id
inner join
biodatabase bd
on bd.biodatabase_id = be.biodatabase_id
where
bd.name = 'orchid'
and be.accession = 'Z78532';
sqlite> .width 15 5 10 50 10
sqlite> select substr(cast(bs.seq as varchar), 0, 10) || '...' as seq,
bs.length, be.accession, be.description, bd.name from biosequence bs inner
join bioentry be on be.bioentry_id = bs.bioentry_id inner join biodatabase bd
on bd.biodatabase_id = be.biodatabase_id where bd.name = 'orchid' and
be.accession = 'Z78532';
seq length accession description name
------------ ---------- ---------- ------------ ------------ ---------- --------
CGTAACAAG... 753 Z78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2
sqlite>
Get the complete sequence associated with an entry (accession − Z78530, name − C.
fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA) using the below code −
select
bs.seq
from
biosequence bs
inner join
bioentry be
on be.bioentry_id = bs.bioentry_id
inner join
biodatabase bd
on bd.biodatabase_id = be.biodatabase_id
where
bd.name = 'orchid'
and be.accession = 'Z78532';
*** Result ***
select distinct
tn.name
from
biodatabase d
inner join
bioentry e
on e.biodatabase_id = d.biodatabase_id
inner join
taxon t
on t.taxon_id = e.taxon_id
inner join
taxon_name tn
on tn.taxon_id = t.taxon_id
where
d.name = 'orchid' limit 10;
*** Result ***
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 55/79
06/10/2022, 18:18 Biopython - Quick Guide
server.load_database_sql(SQL_FILE)
server.commit()
db = server.new_database("orchid")
count = db.load(SeqIO.parse("orchid.gbk", "gb"), True) server.commit()
server.close()
We will have a deeper look at every line of the code and its purpose −
Line 2 − Loads the BioSeqDatabase module. This module provides all the functionality to interact
with BioSQL database.
Line 5 − open_database opens the specified database (db) with the configured driver (driver) and
returns a handle to the BioSQL database (server). Biopython supports sqlite, mysql, postgresql
and oracle databases.
Line 6-10 − load_database_sql method loads the sql from the external file and executes it.
commit method commits the transaction. We can skip this step because we already created the
database with schema.
Line 12 − new_database methods creates new virtual database, orchid and returns a handle db
to execute the command against the orchid database.
Line 13 − load method loads the sequence entries (iterable SeqRecord) into the orchid database.
SqlIO.parse parses the GenBank database and returns all the sequences in it as iterable
SeqRecord. Second parameter (True) of the load method instructs it to fetch the taxonomy details
of the sequence data from NCBI blast website, if it is not already available in the system.
Line 15 − close closes the database connection and destroys the server handle.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 56/79
06/10/2022, 18:18 Biopython - Quick Guide
db = server["orchid"]
seq_record = db.lookup(gi = 2765658)
print(seq_record.id, seq_record.description[:50] + "...")
print("Sequence length %i," % len(seq_record.seq))
Here, server["orchid"] returns the handle to fetch data from virtual databaseorchid. lookup
method provides an option to select sequences based on criteria and we have selected the
sequence with identifier, 2765658. lookup returns the sequence information as
SeqRecordobject. Since, we already know how to work with SeqRecord`, it is easy to get data
from it.
Remove a Database
Removing a database is as simple as calling remove_database method with proper database
name and then committing it as specified below −
Biopython provides Bio.PopGen module for population genetics and mainly supports `GenePop,
a popular genetics package developed by Michel Raymond and Francois Rousset.
A simple parser
Let us write a simple application to parse the GenePop format and understand the concept.
Download the genePop file provided by Biopython team in the link given below
−https://raw.githubusercontent.com/biopython/biopython/master/Tests/PopGen/c3line.gen
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 57/79
06/10/2022, 18:18 Biopython - Quick Guide
record = GenePop.read(open("c3line.gen"))
>>> record.loci_list
['136255903', '136257048', '136257636']
>>> record.pop_list
['4', 'b3', '5']
>>> record.populations
[[('1', [(3, 3), (4, 4), (2, 2)]), ('2', [(3, 3), (3, 4), (2, 2)]),
('3', [(3, 3), (4, 4), (2, 2)]), ('4', [(3, 3), (4, 3), (None, None)])],
[('b1', [(None, None), (4, 4), (2, 2)]), ('b2', [(None, None), (4, 4), (2,
('b3', [(None, None), (4, 4), (2, 2)])],
[('1', [(3, 3), (4, 4), (2, 2)]), ('2', [(3, 3), (1, 4), (2, 2)]),
('3', [(3, 2), (1, 1), (2, 2)]), ('4',
[(None, None), (4, 4), (2, 2)]), ('5', [(3, 3), (4, 4), (2, 2)])]]
>>>
Here, there are three loci available in the file and three sets of population: First population has 4
records, second population has 3 records and third population has 5 records. record.populations
shows all sets of population with alleles data for each locus.
>>> record.remove_population(0)
>>> record.populations
[[('b1', [(None, None), (4, 4), (2, 2)]),
('b2', [(None, None), (4, 4), (2, 2)]),
('b3', [(None, None), (4, 4), (2, 2)])],
[('1', [(3, 3), (4, 4), (2, 2)]),
('2', [(3, 3), (1, 4), (2, 2)]),
('3', [(3, 2), (1, 1), (2, 2)]),
('4', [(None, None), (4, 4), (2, 2)]),
('5', [(3, 3), (4, 4), (2, 2)])]]
>>>
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 58/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> record.remove_locus_by_position(0)
>>> record.loci_list
['136257048', '136257636']
>>> record.populations
[[('b1', [(4, 4), (2, 2)]), ('b2', [(4, 4), (2, 2)]), ('b3', [(4, 4), (2, 2
[('1', [(4, 4), (2, 2)]), ('2', [(1, 4), (2, 2)]),
('3', [(1, 1), (2, 2)]), ('4', [(4, 4), (2, 2)]), ('5', [(4, 4), (2, 2)]
>>>
First, install the GenePop software and place the installation folder in the system path. To get
basic information about GenePop file, create a EasyController object and then call get_basic_info
method as specified below −
Here, the first item is population list and second item is loci list.
To get all allele list of a particular locus, call get_alleles_all_pops method by passing locus name
as specified below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 59/79
06/10/2022, 18:18 Biopython - Quick Guide
To get allele list by specific population and locus, call get_alleles by passing locus name and
population position as given below −
Genome Diagram
Genome diagram represents the genetic information as charts. Biopython uses
Bio.Graphics.GenomeDiagram module to represent GenomeDiagram. The GenomeDiagram
module requires ReportLab to be installed.
Create a FeatureSet for each separate set of features you want to display, and add
Bio.SeqFeature objects to them.
Create a GraphSet for each graph you want to display, and add graph data to them.
Create a Track for each track you want on the diagram, and add GraphSets and
FeatureSets to the tracks you require.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 60/79
06/10/2022, 18:18 Biopython - Quick Guide
https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk
and read records from SeqRecord object then finally draw a genome diagram. It is explained
below,
Now, we can apply color theme changes using alternative colors from green to grey as defined
below −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 61/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> diagram.draw(
format = "linear", orientation = "landscape", pagesize = 'A4',
... fragments = 4, start = 0, end = len(record))
>>> diagram.write("orchid.pdf", "PDF")
>>> diagram.write("orchid.eps", "EPS")
>>> diagram.write("orchid.svg", "SVG")
>>> diagram.write("orchid.png", "PNG")
After executing the above command, you could see the following image saved in your Biopython
directory.
** Result **
genome.png
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 62/79
06/10/2022, 18:18 Biopython - Quick Guide
You can also draw the image in circular format by making the below changes −
>>> diagram.draw(
format = "circular", circular = True, pagesize = (20*cm,20*cm),
... start = 0, end = len(record), circle_core = 0.7)
>>> diagram.write("circular.pdf", "PDF")
Chromosomes Overview
DNA molecule is packaged into thread-like structures called chromosomes. Each chromosome is
made up of DNA tightly coiled many times around proteins called histones that support its
structure.
Chromosomes are not visible in the cell’s nucleus — not even under a microscope —when the
cell is not dividing. However, the DNA that makes up chromosomes becomes more tightly packed
during cell division and is then visible under a microscope.
In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. Twenty-two of
these pairs, called autosomes, look the same in both males and females. The 23rd pair, the sex
chromosomes, differ between males and females. Females have two copies of the X
chromosome, while males have one X and one Y chromosome.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 63/79
06/10/2022, 18:18 Biopython - Quick Guide
Biopython provides an excellent module, Bio.Phenotype to analyze phenotypic data. Let us learn
how to parse, interpolate, extract and analyze the phenotype microarray data in this chapter.
Parsing
Phenotype microarray data can be in two formats: CSV and JSON. Biopython supports both the
formats. Biopython parser parses the phenotype microarray data and returns as a collection of
PlateRecord objects. Each PlateRecord object contains a collection of WellRecord objects. Each
WellRecord object holds data in 8 rows and 12 columns format. The eight rows are represented
by A to H and 12 columns are represented by 01 to 12. For example, 4th row and 6th column are
represented by D06.
Let us understand the format and the concept of parsing with the following example −
Step 3 − Invoke phenotype.parse method passing the data file and format option (“pm-csv”). It
returns the iterable PlateRecord as below,
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 64/79
06/10/2022, 18:18 Biopython - Quick Guide
Step 5 − As discussed earlier, a plate contains 8 rows each having 12 items. WellRecord can be
access in two ways as specified below −
Step 6 − Each well will have series of measurement at different time points and it can be
accessed using for loop as specified below −
Interpolation
Interpolation gives more insight into the data. Biopython provides methods to interpolate
WellRecord data to get information for intermediate time points. The syntax is similar to list
indexing and so, easy to learn.
To get the data at 20.1 hours, just pass as index values as specified below −
>>> well[20.10]
69.40000000000003
>>>
We can pass start time point and end time point as well as specified below −
>>> well[20:30]
[67.0, 84.0, 102.0, 119.0, 135.0, 147.0, 158.0, 168.0, 179.0, 186.0]
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 65/79
06/10/2022, 18:18 Biopython - Quick Guide
>>>
The above command interpolate data from 20 hour to 30 hours with 1 hour interval. By default,
the interval is 1 hour and we can change it to any value. For example, let us give 15 minutes
(0.25 hour) interval as specified below −
>>> well[20:21:0.25]
[67.0, 73.0, 75.0, 81.0]
>>>
>>> well.fit()
Traceback (most recent call last):
...
Bio.MissingPythonDependencyError: Install scipy to extract curve parameters
>>> well.model
>>> getattr(well, 'min') 0.0
>>> getattr(well, 'max') 388.0
>>> getattr(well, 'average_height')
205.42708333333334
>>>
Biopython depends on scipy module to do advanced analysis. It will calculate min, max and
average_height details without using scipy module.
Biopython - Plotting
This chapter explains about how to plot sequences. Before moving to this topic, let us understand
the basics of plotting.
Plotting
Matplotlib is a Python plotting library which produces quality figures in a variety of formats. We
can create different types of plots like line chart, histograms, bar chart, pie chart, scatter chart,
etc.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 66/79
06/10/2022, 18:18 Biopython - Quick Guide
pyLab is a module that belongs to the matplotlib which combines the numerical module
numpy with the graphical plotting module pyplot.Biopython uses pylab module for plotting
sequences. To do this, we need to import the below code −
import pylab
Before importing, we need to install the matplotlib package using pip command with the
command given below −
>seq0 FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1 KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
>seq2 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3 MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDV
>seq4 EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5 SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6 FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7 SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8 SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9 KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10 FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
Line Plot
Now, let us create a simple line plot for the above fasta file.
>>> pylab.ylabel("count")
Text(0, 0.5, 'count')
>>>
>>> pylab.grid()
Step 6 − Draw simple line chart by calling plot method and supplying records as input.
>>> pylab.plot(records)
[<matplotlib.lines.Line2D object at 0x10b6869d 0>]
>>> pylab.savefig("lines.png")
Result
After executing the above command, you could see the following image saved in your Biopython
directory.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 68/79
06/10/2022, 18:18 Biopython - Quick Guide
Histogram Chart
A histogram is used for continuous data, where the bins represent ranges of data. Drawing
histogram is same as line chart except pylab.plot. Instead, call hist method of pylab module with
records and some custum value for bins (5). The complete coding is as follows −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 69/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> pylab.ylabel("count")
Text(0, 0.5, 'count')
>>>
>>> pylab.grid()
Step 6 − Draw simple line chart by calling plot method and supplying records as input.
>>> pylab.hist(records,bins=5)
(array([2., 3., 1., 3., 2.]), array([57., 60., 63., 66., 69., 72.]), <a lis
of 5 Patch objects>)
>>>
>>> pylab.savefig("hist.png")
Result
After executing the above command, you could see the following image saved in your Biopython
directory.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 70/79
06/10/2022, 18:18 Biopython - Quick Guide
GC Percentage in Sequence
GC percentage is one of the commonly used analytic data to compare different sequences. We
can do a simple line chart using GC Percentage of a set of sequences and immediately compare
it. Here, we can just change the data from sequence length to GC percentage. The complete
coding is given below −
>>> pylab.xlabel("Genes")
Text(0.5, 0, 'Genes')
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 71/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> pylab.grid()
Step 6 − Draw simple line chart by calling plot method and supplying records as input.
>>> pylab.plot(gc)
[<matplotlib.lines.Line2D object at 0x10b6869d 0>]
>>> pylab.savefig("gc.png")
Result
After executing the above command, you could see the following image saved in your Biopython
directory.
analysis, bioinformatics, etc. It can be achieved by various algorithms to understand how the
cluster is widely used in different analysis.
According to Bioinformatics, cluster analysis is mainly used in gene expression data analysis to
find groups of genes with similar gene expression.
In this chapter, we will check out important algorithms in Biopython to understand the
fundamentals of clustering on a real dataset.
Biopython uses Bio.Cluster module for implementing all the algorithms. It supports the following
algorithms −
Hierarchical Clustering
K - Clustering
Self-Organizing Maps
Principal Component Analysis
Hierarchical Clustering
Hierarchical clustering is used to link each node by a distance measure to its nearest neighbor
and create a cluster. Bio.Cluster node has three attributes: left, right and distance. Let us create a
simple cluster as shown below −
If you want to construct Tree based clustering, use the below command −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 73/79
06/10/2022, 18:18 Biopython - Quick Guide
The above function returns a Tree cluster object. This object contains nodes where the number of
items are clustered as rows or columns.
K - Clustering
It is a type of partitioning algorithm and classified into k - means, medians and medoids
clustering. Let us understand each of the clustering in brief.
K-means Clustering
This approach is popular in data mining. The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K.
The algorithm works iteratively to assign each data point to one of the K groups based on the
features that are provided. Data points are clustered based on feature similarity.
K-medians Clustering
It is another type of clustering algorithm which calculates the mean for each cluster to determine
its centroid.
K-medoids Clustering
This approach is based on a given set of items, using the distance matrix and the number of
clusters passed by the user.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 74/79
06/10/2022, 18:18 Biopython - Quick Guide
The kcluster function takes a data matrix as input and not Seq instances. You need to convert
your sequences to a matrix and provide that to the kcluster function.
One way of converting the data to a matrix containing numerical elements only is by using the
numpy.fromstring function. It basically translates each letter in a sequence to its ASCII
counterpart.
This creates a 2D array of encoded sequences that the kcluster function recognized and uses to
cluster your sequences.
Self-Organizing Maps
This approach is a type of artificial neural network. It is developed by Kohonen and often called
as Kohonen map. It organizes items into clusters based on rectangular topology.
Let us create a simple cluster using the same array distance as shown below −
>>> print(map)
[[[-1.36032469 0.38667395]]
[[-0.41170578 1.35295911]]]
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 75/79
06/10/2022, 18:18 Biopython - Quick Guide
>>> print(clusterid)
[[1 0]
[1 0]
[1 0]]
Here, clusterid is an array with two columns, where the number of rows is equal to the number
of items that were clustered, and data is an array with dimensions either rows or columns.
# define a matrix
>>> A = array([[1, 2], [3, 4], [5, 6]])
>>> print(A)
[[1 2]
[3 4]
[5 6]]
>>> print(C)
[[-2. -2.]
[ 0. 0.]
[ 2. 2.]]
>>> print(V)
[[ 4. 4.]
[ 4. 4.]]
>>> print(vectors)
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
>>> print(values)
[ 8. 0.]
Let us apply the same rectangular matrix data to Bio.Cluster module as defined below −
Supervised learning is based on input variable (X) and output variable (Y). It uses an algorithm to
learn the mapping function from the input to the output. It is defined below −
Y = f(X)
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 77/79
06/10/2022, 18:18 Biopython - Quick Guide
The main objective of this approach is to approximate the mapping function and when you have
new input data (x), you can predict the output variables (Y) for that data.
k-Nearest Neighbors
k-Nearest neighbors is also a supervised machine learning algorithm. It works by categorizing the
data based on nearest neighbors. Biopython provides Bio.KNN module to predict variables based
on k-nearest neighbors algorithm.
Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It
is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other. Biopython provides
Bio.NaiveBayes module to work with Naive Bayes algorithm.
Markov Model
A Markov model is a mathematical system defined as a collection of random variables, that
experiences transition from one state to another according to certain probabilistic rules.
Biopython provides Bio.MarkovModel and Bio.HMM.MarkovModel modules to work with
Markov models.
python run_tests.py
This will run all the test scripts and gives the following output −
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 78/79
06/10/2022, 18:18 Biopython - Quick Guide
python test_AlignIO.py
Conclusion
As we have learned, Biopython is one of the important software in the field of bioinformatics.
Being written in python (easy to learn and write), It provides extensive functionality to deal with
any computation and operation in the field of bioinformatics. It also provides easy and flexible
interface to almost all the popular bioinformatics software to exploit the its functionality as well.
https://www.tutorialspoint.com/biopython/biopython_quick_guide.htm 79/79