Bio Python 202111
Bio Python 202111
2021/11/11
Learning Objectives
• Biopython as a toolkit
• Seq objects and their methods
• SeqRecord objects have data fields
• SeqIO to read and write sequence
objects
• BLAST & Multiple sequence
alignment
• Direct access to GenBank with
Entrez.efetch
Modules
• Python functions are divided into 3 sets
1. A small core set that are always available
2. Some built-in modules such as math that can be imported from
the basic install (Eg. >>> import math)
3. An extremely large number of optional modules that must be
downloaded and installed before you can import them
4. Codes using such modules is said to have “dependencies”
• Biopython belongs to the third and fourth category
• The code for dependencies are located in different
places such as SourceForge, GitHub, and developer’s
own websites (Perl and R are better organized)
• Trouble?: Ask the TA’s as each persons problem is
mostly unique and no general solution
Install Biopython
• Website for installation instruction:
– http://biopython.org/wiki/Download
• Required Software
– Python (version above 2.6)
– NumPy (Numerical Python)
• Optional Software
– ReportLab – used for pdf graphics code
– psycopg – used for BioSQL with a PostgreSQL database
– mysql-connector – used for BioSQL with a MySQL database
– MySQLdb – An alternative MySQL library used by BioSQL
4
Information Source: http://biopython.org/wiki/Download
Is your Biopython installed correctly?
14
Turn a Seq object into a string
• Sometimes you will need to work with just the sequence
string in a Seq object using a tool that is not aware of the Seq
object methods
• Turn a Seq object into a string with str()
• You will lose the alphabet and just get back the string.
• You can input it into other programs and work with it
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
>>> seq_string=str(my_seq)
>>> seq_string
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’
16
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’
• Seq object is “read only”, immutable; have to set Seq as
Dot method works on the parameter
mutable preceding the dot
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq [5] = "G"
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA', 17
Seq objects have special methods:
MutableSeq
• Alternatively, you can create a MutableSeq object directly from a string:
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq =
MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
• Either way will give you a sequence object which can be changed:
>>> mutable_seq [5] = “G”
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA',
IUPACUnambiguousDNA())
• Note:
– You can’t use a MutableSeq object as a dictionary key.
– You can use a Python string or a Seq object as a
dictionary key.
Seq objects have special methods:
Changing case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
Dot method works on the parameter
>>> dna_seq.upper()
preceding the dot
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe() Only works on RNA alphabet
Seq('ATGGCCATTGTAATGG', IUPACUnambiguousDNA())
Seq objects have special methods: translate()
>>>coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", generic_dna)
>>>coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())
Seq objects have special methods:
translate()
>>>coding_dna= Seq("ATGGCCATTGTAATGG",
IUPAC.unambiguous_dna)
>>> coding_dna.translate()
//lib/python3.7/site-packages/Bio/Seq.py:2715:
BiopythonWarning: Partial codon, len(sequence) not a multiple of
three. Explicitly trim the sequence or add trailing N before
translation. This may become an error in future.
BiopythonWarning)
Seq('MAIVM', IUPACProtein())
>>> from Bio.Data import CodonTable
>>> coding_dna.translate()
Seq('MAIVM', IUPACProtein())
Seq objects have special methods: translate()
• Codon table
– By default, Biopython uses NCBI table id 1, Standard Code
>>> from Bio.Data import CodonTable
>>> print(CodonTable.unambiguous_dna_by_id[1])
>>>print(CodonTable.unambiguous_dna_by_name["Standard”])
• >>>
CodonTable.unambiguous_dna_by_name["Standard"].stop_c
odons
['TAA', 'TAG', 'TGA']
• >>> CodonTable.unambiguous_dna_by_id[2].stop_codons
['TAA', 'TAG', 'AGA', 'AGG’]
Types and usage of codon table in
Biopython
• >>> help(coding_dna.translate)
• NCBI genetic code number and name:
http://www.ncbi.nlm.nih.gov/Taxonomy/Util
s/wprintgc.cgi
Seq Objects have special methods
• The Bio.SeqUtils module has some useful methods, such as
GC() to calculate % of G+C bases in a DNA sequence.
• Lets assume you have already downloaded a FASTA file from GenBank, such as: NC_005816.fna, and
saved it as a text file in your current directory
Extension Meaning
fasta, fa, fas, fsa generic fasta
fna fasta nucleic acid
ffn FASTA nucleotide of gene regions(specific to coding regions)
faa fasta amino acid
frn FASTA non-coding RNA
Multiple FASTA Records in one file
• The FASTA format can store many sequences in one
text file
• SeqIO.parse() reads the records one by one
• .parse, .read and .write are iterable
• This code creates a list of SeqRecord objects:
CODE. DESCRIPTION
x. No gap penalties.
Same open and extend gap penalties for both
s. sequences.
The sequences have different open and extend gap
d. penalties.
c. A callback function returns the gap penalties.
Global alignment
>>> from Bio import pairwise2
>>> alignments = pairwise2.align.globalxx("ACCGT",
"ACG")
>>> from Bio.pairwise2 import format_alignment
>>> print(format_alignment(*alignments[0]))
ACCGT
| | |
A−CG−
Score=3
Local alignment: All possible alignments
ACCGT
| | |
A −CG−
Score=6
ACCGT
| | |
AC−G−
Score=6
Global alignment: Match & mismatch
AND Gap penalities
Gap open: -0.5 points, and gap extend: -0.1 points.
ACCGT
| | |
AC−G−
Score=5
SeqIO for FASTQ
• FASTQ is a format for Next Generation
DNA sequence data (FASTA + Quality)
• SeqIO can read (and write) FASTQ format
files
from Bio import SeqIO
count = 0
for rec in SeqIO.parse(”example.fastq", "fastq"):
count += 1
print(count)
You can do all the the things in fasta format on fastq format
Direct Access to GenBank, PubMed etc
• BioPython has modules that can directly access databases over the
Internet
• The Entrez module uses the NCBI Efetch service
• Entrez_efetch part of the Entrez module: works on many NCBI
databases including protein and PubMed literature citations
• The ‘gb’ data type contains much more annotation information, but
rettype=‘fasta’ also works
• With a few tweaks, this script could be used to download a list of
GenBank ID’s and save them as FASTA or GenBank files:
• Please zip and upload your answer with your name (including
the python scripts, .txt file) to Dropbox 49
Get a file by FTP in Python
>>> from ftplib import FTP
>>> host="ftp.sra.ebi.ac.uk"
>>> ftp=FTP(host)
>>> ftp.login()
'230 Login successful.‘
ftp.cwd('vol1/fastq/SRR020/SRR020192')
'250 Directory successfully changed.‘
>>> ftp.retrlines('LIST')
-r--r--r-- 1 ftp ftp 1777817 Jun 24 20:12 SRR020192.fastq.gz
'226 Directory send OK.'
>>> ftp.retrbinary('RETR SRR020192.fastq.gz', \
open('SRR020192.fastq.gz', 'wb').write)
'226 Transfer complete.'
>>> ftp.quit()
'221 Goodbye.'
Multiple Sequence Alignment
• ClustalW2/Muscle: a popular command
line tool for multiple sequence alignment
• Input: Two sequences
– S1: TATACATTAAA
– S2: TAGGATTCCAC
– S3: TATACATTAAG
• S1 and S3 are highly similar.
51
ClustalW
• Step 1: Download ClustalW2 (http://www.clustal.org/clustal2/)
>>> print(align)
SingleLetterAlphabet() alignment with 7 rows and 495 columns
MTVMSGENVDEASAAPGHPQDGSYPRPAEHDDHECCERVVINIS...TDV A.mel
MTVMSGENVDEASAAPGHPQEGSYPRPAEHEDHECCERVVINIS...TDV P.tig
MTVMSGENVDEASAAPGHPQDGSYPRQADHDDHECCERVVINIS...TDV H.sap
MTVMSGENADEASTAPGHPQDGSYPRQADHDDHECCERVVINIS...TDV
M.mus
MTVMSGENVEEASAAQGHPQDISYPRPADHDDHDCCERVVINIS...TDV S.har
MTVMAGENMDETSALPGHPQD-SY-QPAAHDDHECCERVVINIA...TDV E.gar
MTVMAGENMDETSALPGHPQD-SY-QPAAHDDHECCERVVINIA...TDV M.uni
BLAST
• BioPython has several methods to work with the popular
NCBI BLAST software
• NCBIWWW.qblast() sends queries directly to the NCBI
BLAST server. The query can be a Seq object, FASTA
file, or a GenBank ID.
Generating the query
>>> from Bio.Blast import NCBIWWW
>>> query = SeqIO.read("test.fasta", format="fasta")
>>> result_handle =Sending the query as seq part of the seq.record
NCBIWWW.qblast("blastn", "nt", query.seq)
>>> blast_file = open("my_blast.xml", "w")
#create an xml output file
>>> blast_file.write(result_handle.read())
Reading the result and writing to xml file
>>> blast_file.close()
>>> result_handle.close()
Parse BLAST Results
• It is often useful to obtain a BLAST result directly
(local BLAST server or via Web browser) and
then parse the result file with Python.
• Save the BLAST result in XML format
– NCBIXML.read() for a file with a single BLAST result (single
query)
– NCBIXML.parse() for a file with multiple BLAST results
(multiple queries)
>>> from Bio.Blast import NCBIXML
>>> handle = open("my_blast.xml")
>>> blast_record = NCBIXML.read(handle)
>>> for hit in blast_record.descriptions:
print hit.title
print hit.e
BLAST Record Object