Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
146 views

Bio Python 202111

1. Biopython is a collection of modules for biological computation, including tools for working with DNA/protein sequences, sequence alignments, and accessing common biological databases. 2. The Seq object class represents biological sequences, with data attributes for the sequence string and its alphabet. Seq objects can be indexed, sliced, and have string methods like len() and count(). 3. MutableSeq objects allow sequences to be modified, while regular Seq objects are immutable. Methods like upper(), lower(), and tomutable() allow changing between cases and mutability.

Uploaded by

Rohan Ray
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views

Bio Python 202111

1. Biopython is a collection of modules for biological computation, including tools for working with DNA/protein sequences, sequence alignments, and accessing common biological databases. 2. The Seq object class represents biological sequences, with data attributes for the sequence string and its alphabet. Seq objects can be indexed, sliced, and have string methods like len() and count(). 3. MutableSeq objects allow sequences to be modified, while regular Seq objects are immutable. Methods like upper(), lower(), and tomutable() allow changing between cases and mutability.

Uploaded by

Rohan Ray
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Introduction to Biopython

2021/11/11
Learning Objectives

• Biopython as a toolkit
• Seq objects and their methods
• SeqRecord objects have data fields
• SeqIO to read and write sequence
objects
• BLAST & Multiple sequence
alignment
• Direct access to GenBank with
Entrez.efetch
Modules
• Python functions are divided into 3 sets
1. A small core set that are always available
2. Some built-in modules such as math that can be imported from
the basic install (Eg. >>> import math)
3. An extremely large number of optional modules that must be
downloaded and installed before you can import them
4. Codes using such modules is said to have “dependencies”
• Biopython belongs to the third and fourth category
• The code for dependencies are located in different
places such as SourceForge, GitHub, and developer’s
own websites (Perl and R are better organized)
• Trouble?: Ask the TA’s as each persons problem is
mostly unique and no general solution
Install Biopython
• Website for installation instruction:
– http://biopython.org/wiki/Download
• Required Software
– Python (version above 2.6)
– NumPy (Numerical Python)

• Optional Software
– ReportLab – used for pdf graphics code
– psycopg – used for BioSQL with a PostgreSQL database
– mysql-connector – used for BioSQL with a MySQL database
– MySQLdb – An alternative MySQL library used by BioSQL

4
Information Source: http://biopython.org/wiki/Download
Is your Biopython installed correctly?

• Type the following on your terminal or


interpreter:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet.IUPAC import unambiguous_dna
>>> new_seq = Seq('GATCAGAAG', unambiguous_dna)
>>> new_seq[0:2]
Seq('GA', IUPACUnambiguousDNA())
>>> new_seq.translate()
Seq('DQK', IUPACProtein())
• Biopython is an integrated collection of modules for
“biological computation” including tools for working
with DNA/protein sequences, sequence alignments,
population genetics, and molecular structures
• It also provides interfaces to common biological
databases (eg. GenBank) and to some common
locally installed software (eg. BLAST).
• Loosely based on BioPerl
• Relatively fewer protein specific functions in
Biopython
Biopython Tutorial
• Biopython has a “Tutorial & Cookbook” :
http://biopython.org/DIST/docs/tutorial/Tutorial.html

by: Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck,


Michiel de Hoon, Peter Cock, Tiago Antao, Eric Talevich, Bartek
Wilczyński

Most of the examples in this class are drawn from the


above link
Python is an Object-Oriented language
• Composed of data structures (known as classes)
– can contain complex and well-defined forms of data,
and they can also have built in methods
• Complex objects are built from other objects
– Eg. String, list and other data types have certain
methods
• Many classes of objects have the same method and can
be used without a defined call
– Eg. “print” method
• Specifying the given data type belonging to this class,
and it inherits all the properties
The Seq object

• The Seq object class is simple and fundamental for


a lot of Biopython work. A Seq object can contain
DNA, RNA, or protein.

1. Data: this is the actual sequence data string of the


sequence.
2. Alphabet – an object describing what the individual
characters making up the string “mean” and how
they should be interpreted.
The Seq object: {Data, Alphabet}
• It is a complex object with a string sub-object (the
sequence)
 Inherits properties of the Python string object
 Also defines an alphabet for that string
 This constraints the allowed properties of the
string object
• The alphabets are actually Biopython objects such
as IUPACAmbiguousDNA or IUPACProtein (Int Union of Pure and
Applied Chem)
• Which are defined in the Bio.Alphabet module
• A Seq object with a DNA alphabet is different from an Amino
Acid alphabet
The Seq object: {Data, Alphabet}
1. Data: this is the actual sequence data string of the sequence.
2. alphabet – an object describing what the individual characters making
up the string “mean” and how they should be interpreted.
Biopython Seq method allows you to create the Seq object
library
>>> from Bio.Seq import Seq Importing the alphabets from its module
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('AGTACACTGGT', IUPAC.unambiguous_dna)
Function call that creates the Seq object. minimum: data attribute
>>> my_seq
Seq('AGTACACTGGT', IUPAC.unambiguous_dna())
>>> print(my_seq)
Eg. of Print method working on different objects, here the Seq object
AGTACACTGGT

>>> my_seq = Seq(‘MRTAVACTKGT')


>>> my_seq
Seq('MRTAVACTKGT', Alphabet())
You can create an ambiguous sequence with the default generic alphabet like this
Seq objects have string methods
• Seq objects have methods that work just like string objects
• You can get the len() of a Seq, slice it, and count() specific letters in it:

• Get single characters/Count sequence length


>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq): print("%i %s" % (index, letter))
…..
0G
1A
2T
3C
4G
>>> print(len(my_seq))
5
>>> print(my_seq[0]) #first letter
G
>>> print(my_seq[2]) #third letter
T
Seq objects have string methods
• Seq object has a len(), count() method, just like a string. Like a Python string,
this gives a non-overlapping count:

>>> from Bio.Seq import Seq


>>> "ATGCATAT".count("AT"))
3
>>>Seq(”AAAAA").count(”AA")
2
Eg. Determining GC content

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> my_seq =
Seq(“GATCGATGGGCCTATATAGGATCGAAAATCGS”,IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
13
43.75
Seq objects have string methods:
slice
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
>>> my_seq[4:12]
– Seq('GATGGGCC', IUPACUnambiguousDNA())

14
Turn a Seq object into a string
• Sometimes you will need to work with just the sequence
string in a Seq object using a tool that is not aware of the Seq
object methods
• Turn a Seq object into a string with str()
• You will lose the alphabet and just get back the string.
• You can input it into other programs and work with it

>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
>>> seq_string=str(my_seq)
>>> seq_string
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’

16
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’
• Seq object is “read only”, immutable; have to set Seq as
Dot method works on the parameter
mutable preceding the dot
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq [5] = "G"
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA', 17
Seq objects have special methods:
MutableSeq
• Alternatively, you can create a MutableSeq object directly from a string:
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq =
MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)

• Either way will give you a sequence object which can be changed:
>>> mutable_seq [5] = “G”
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA',
IUPACUnambiguousDNA())

• Note:
– You can’t use a MutableSeq object as a dictionary key.
– You can use a Python string or a Seq object as a
dictionary key.
Seq objects have special methods:
Changing case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
Dot method works on the parameter
>>> dna_seq.upper()
preceding the dot
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())

• Strictly speaking the IUPAC alphabets are for upper case


sequences only,
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())

• Note: You can also use MutableSeq to change case


Seq objects have special methods:
transcribe()
Seq objects have special methods: transcribe()

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> coding_dna= Seq("ATGGCCATTGTAATGG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGG', IUPACUnambiguousDNA())
Only works on DNA alphabet
>>> template_dna= coding_dna.reverse_complement()
>>> template_dna
Seq('CCATTACAATGGCCAT', IUPACUnambiguousDNA())

>>> messenger_rna= coding_dna.transcribe() Only works on DNA alphabet


>>> messenger_rna
Seq('AUGGCCAUUGUAAUGG', IUPACUnambiguousRNA())

>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe() Only works on RNA alphabet
Seq('ATGGCCATTGTAATGG', IUPACUnambiguousDNA())
Seq objects have special methods: translate()

>>>from Bio.Seq import Seq


>>>from Bio.Alphabet import IUPAC
>>>from Bio.Data import CodonTable ## Optional
>>>messenger_rna =
Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
IUPAC.unambiguous_rna)
>>>messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> messenger_rna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())

>>>coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", generic_dna)
>>>coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())
Seq objects have special methods:
translate()
>>>coding_dna= Seq("ATGGCCATTGTAATGG",
IUPAC.unambiguous_dna)
>>> coding_dna.translate()
//lib/python3.7/site-packages/Bio/Seq.py:2715:
BiopythonWarning: Partial codon, len(sequence) not a multiple of
three. Explicitly trim the sequence or add trailing N before
translation. This may become an error in future.
  BiopythonWarning)
Seq('MAIVM', IUPACProtein())
>>> from Bio.Data import CodonTable
>>> coding_dna.translate()
Seq('MAIVM', IUPACProtein())
Seq objects have special methods: translate()

• Codon table
– By default, Biopython uses NCBI table id 1, Standard Code
>>> from Bio.Data import CodonTable
>>> print(CodonTable.unambiguous_dna_by_id[1])
>>>print(CodonTable.unambiguous_dna_by_name["Standard”])
• >>>
CodonTable.unambiguous_dna_by_name["Standard"].stop_c
odons
['TAA', 'TAG', 'TGA']
• >>> CodonTable.unambiguous_dna_by_id[2].stop_codons
['TAA', 'TAG', 'AGA', 'AGG’]
Types and usage of codon table in
Biopython

• >>> help(coding_dna.translate)
• NCBI genetic code number and name:
http://www.ncbi.nlm.nih.gov/Taxonomy/Util
s/wprintgc.cgi
Seq Objects have special methods
• The Bio.SeqUtils module has some useful methods, such as
GC() to calculate % of G+C bases in a DNA sequence.

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGS',
IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875

>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)


43.75
Protein Alphabet
• You could re-define my_seq as a protein by changing the alphabet,
which will totally change the methods that will work on it.
• (‘G’,’A’,’T’,’C’ are valid protein letters)

>>> from Bio.SeqUtils import molecular_weight


>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> print(molecular_weight(my_seq))
3436.1957

>>> my_seq.alphabet = IUPAC.protein


>>> my_seq
Seq('AGTACACTGGT', IUPACProtein()) Try back_transcribe() and see
>>> print(molecular_weight(my_seq)) what you get
912.0004

• Relatively few protein specific functions in Biopython


• Eg. hydropathy plot, isoelectric point etc. are missing
SeqRecord Object
• The SeqRecord object is like a database record (such as
GenBank). It is a complex object that contains a Seq
object, and also annotation fields, known as “attributes”.
.seq
.id
.name
.description
.letter_annotations
.annotations
.features
.dbxrefs
• You can think of attributes as slots with names inside
the SeqRecord object. Each one may contain data or
be empty.
• You can reference a particular part of the object.
GenBank format: https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Creating a SeqRecord– Example
>>> from Bio.Seq import Seq Creating Seq object
This command enables SeqRecord objects
>>> test_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> test_seq_r = SeqRecord(test_seq)
• >>> test_seq_r.id
’<unknown id>’
• >>> test_seq_r.id = "xyz"
• >>> test_seq_r.name = "something"
• >>> test_seq_r.description = "Made up”
• >>> test_seq_r.seq
– Seq('GATC', Alphabet())
• >>> test_seq_r.annotations["evidence"] = "None “ #annotation: Dictionary attribute
• >>> print(test_seq_r.annotations["evidence"])
– None
• >>> test_seq_r.seq = 'ATGC'
• >>> test_seq_r.seq
• 'ATGC'
• Specify fields in the SeqRecord object with a . (dot) syntax
SeqIO and file formats
• SeqIO is the all purpose file read/write tool for SeqRecords
• SeqIO can read many file types: http://biopython.org/wiki/SeqIO
• SeqIO has .read() and .write() methods
• (do not need to “open” the file first)
• Eg. read a text file in FASTA format
• In Biopython, fasta is a type of SeqRecord with specific fields

• Lets assume you have already downloaded a FASTA file from GenBank, such as: NC_005816.fna, and
saved it as a text file in your current directory

>>> from Bio import SeqIO


>>> gene = SeqIO.read("Path_to_file/KCNA1_aves_mammals.fas", "fasta")
>>> gene.id
'gi|45478711|ref|NC_005816.1|'
>>> gene.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG
...CTG', SingleLetterAlphabet())
>>> print(gene.seq) or print(convert it to string): You will get the whole sequence
>>> len(gene.seq)
9609
>>>record = SeqIO.read(”Path_to_file/file.gb", ”genbank”)
Fasta format

Extension Meaning
fasta, fa, fas, fsa generic fasta
fna fasta nucleic acid
ffn FASTA nucleotide of gene regions(specific to coding regions)
faa fasta amino acid
frn FASTA non-coding RNA
Multiple FASTA Records in one file
• The FASTA format can store many sequences in one
text file
• SeqIO.parse() reads the records one by one
• .parse, .read and .write are iterable
• This code creates a list of SeqRecord objects:

>>> from Bio import SeqIO


reading using universal readline
>>> handle = open("KCNA1_aves_mammals.fasta", "rU") mode
# “handle” is a pointer to the file
>>> seq_list = list(SeqIO.parse(handle, "fasta"))
>>> handle.close()
>>> print(seq_list[0].seq) Dumping
#shows the
thewhole file into ain
first sequence list
the list
Reading Sequence Files: Next
>>> from Bio import SeqIO
>>> record_iterator = SeqIO.parse(”PATH/KCNA1_aves_mammals.fas", "fasta")
>>> first_record = next(record_iterator)
>>> print(first_record.id)
gi|2765658|emb|Z78533.1|CIZ78533
>>> print(first_record.description)
gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and
ITS2 DNA

>>> second_record = next(record_iterator)


>>> print(second_record.id)
gi|2765657|emb|Z78532.1|CCZ78532
>>> print(second_record.description)
gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1
and ITS2 DNA
Grab sequence from FASTA file
• If you have a large local FASTA file, and a list of
sequences (‘file.txt') that you want to grab:
Splitting using \n
>>> from Bio import SeqIO
Change according to
>>> output =open(‘path/selected_seqs.fasta', 'w') the delimiter
>>> list
=open(‘path/KCNA1_aves_mammals_id_list.txt').read().splitlines()
>>> for test in SeqIO.parse(KCNA1_aves_mammals.fas','fasta'):
for seqname in list:
name = seqname.strip() Looking up each sequence & not
if test.id == name: dumping the file into a list
SeqIO.write(test, output, 'fasta')
>>> output.close()

strip () = chomp in perl


Multiple GenBank Records in one file
>>>from Bio import SeqIO
>>>records = list(SeqIO.parse(”PATH/NC_005816.gb", "genbank"))
>>>all_species= [] #New array data type
>>>for rec in records:
all_species.append(rec.annotations["organism"])
>>>len(all_species)
1
>>>print(all_species[0])
Yersinia pestis biovar Microtus str. 91001
>>>print(all_species[93])
– Traceback (most recent call last):
– File "<pyshell#7>", line 1, in <module>
– print(all_species[93])
– ……………….
– Why is there an error message?
Pairwise sequence alignment using a
dynamic programming algorithm.
• This provides functions to get global and local
alignments between two sequences.
• When doing alignments, you can specify the match
score and gap penalties.
• The convention is <alignment type>XX
– where <alignment type> is either “global” or
“local” and XX is a 2 character code indicating
the parameters it takes.
<alignment type>XX
The 1st character indicates the parameters for matches
(and mismatches), and the 2nd indicates the parameters for
gap penalties.
Matches and Mismatches (1st):
CODE. DESCRIPTION
No parameters. Identical characters have score of
x. 1, otherwise 0.

A match score is the score of identical chars,


m. otherwise mismatch score.
A dictionary returns the score of any pair of
d. characters.
c. A callback function returns scores.
<alignment type>XX
The 1st character indicates the parameters for matches
(and mismatches), and the 2nd indicates the parameters for
gap penalties.
GAP Penalties (2nd):

CODE. DESCRIPTION
x. No gap penalties.
Same open and extend gap penalties for both
s. sequences.
The sequences have different open and extend gap
d. penalties.
c. A callback function returns the gap penalties.
Global alignment
>>> from Bio import pairwise2
>>> alignments = pairwise2.align.globalxx("ACCGT",
"ACG")
>>> from Bio.pairwise2 import format_alignment
>>> print(format_alignment(*alignments[0]))
ACCGT
| | |
A−CG−
Score=3
Local alignment: All possible alignments

>>> for a in pairwise2.align.localxx("ACCGT", "ACG"):


... for a in pairwise2.align.localxx("ACCGT", "ACG")
1 ACCG
  | | |
1 ACCG
1 A −CG || |
  Score=3 1 AC−G
Score=3

 *a means the  list containing elements of a will be unpacked, so .format(*a) works


similarly to .format(a[0], a[1], a[2]) (assuming a is a list with only three elements).
Global alignment: Match and mismatch
score
Match: 2 points, Mismatch: -1 point and Don’t penalize gaps.

>>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1):


... print(format_alignment(*a))
... 

ACCGT
| | | 
A −CG−
  Score=6
ACCGT
| | | 
AC−G−
  Score=6
Global alignment: Match & mismatch
AND Gap penalities
Gap open: -0.5 points, and gap extend: -0.1 points.

>>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1):


...     print(format_alignment(*a))
... 
ACCGT
| | | 
A−CG−
  Score=5

ACCGT
| | | 
AC−G−
  Score=5
SeqIO for FASTQ
• FASTQ is a format for Next Generation
DNA sequence data (FASTA + Quality)
• SeqIO can read (and write) FASTQ format
files
from Bio import SeqIO
count = 0
for rec in SeqIO.parse(”example.fastq", "fastq"):
count += 1
print(count)

You can do all the the things in fasta format on fastq format
Direct Access to GenBank, PubMed etc
• BioPython has modules that can directly access databases over the
Internet
• The Entrez module uses the NCBI Efetch service
• Entrez_efetch part of the Entrez module: works on many NCBI
databases including protein and PubMed literature citations
• The ‘gb’ data type contains much more annotation information, but
rettype=‘fasta’ also works
• With a few tweaks, this script could be used to download a list of
GenBank ID’s and save them as FASTA or GenBank files:

>>> from Bio import Entrez


>>>Entrez.email = ”ks@xxx.edu.com"
Parameters you want to search
# NCBI requires your valid identity
>>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb",
retmode="text")
handle: Temporary
>>> record variable “genbank")
= SeqIO.read(handle,
You could iterate this variable in a loop by swapping variables (Eg. new ids)
Grab sequence from FASTA file
• If you have a large local FASTA file, and a list of
sequences (‘file.txt') that you want to grab:
Splitting using \n
>>> from Bio import SeqIO
Change according to
>>> output =open(‘path/selected_seqs.fasta', 'w') the delimiter
>>> list
=open(‘path/KCNA1_aves_mammals_id_list.txt').read().splitlines()
>>> for test in SeqIO.parse(KCNA1_aves_mammals.fas','fasta'):
for seqname in list:
name = seqname.strip() Looking up each sequence & not
if test.id == name: dumping the file into a list
SeqIO.write(test, output, 'fasta')
>>> output.close()

strip () = chomp in perl


>>> print(record)
ID: EU490707.1
Name: EU490707
Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast.
Number of features: 3
/sequence_version=1 These are sub-fields of the .annotations field
/source=chloroplast Selenipedium aequinoctiale
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta',
'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae',
'Cypripedioideae', 'Selenipedium']
/keywords=['']
/references=[Reference(title='Phylogenetic utility of ycf1 in orchids: a plastid gene more
variable than matK', ...), Reference(title='Direct Submission', ...)]
/accessions=['EU490707']
/data_file_division=PLN
/date=15-JAN-2009
/organism=Selenipedium aequinoctiale
/gi=186972394
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GA
A', IUPACAmbiguousDNA())
Illumina Sequences
• Illumina sequence files are usually stored in the FASTQ
format. Similar to FASTA, but with an additional pair of
lines for the quality annotation of each base.

@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152


NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAAC
CTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC
+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152
+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGG
GGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII
@SRR350953.6 MENDEL_0047_FC62MN8AAXX:1:1:1686:935 length=152
NATTTTTACTAGTTTATTCTAGAACAGAGCATAAACTACTATTCAATAAACGTATGAAGCACTACTCACCTCCATTAACATGACGTTTT
TCCCTAATCTGATGGGTCATTATGACCAGAGTATTGCCGCGGTGGAAATGGAGGTGAGTAGTG
+SRR350953.6 MENDEL_0047_FC62MN8AAXX:1:1:1686:935 length=152
+83355@@@CC@C22@@C@@CC@@C@@@CC@@@@@@@@@@@@C?
C22@@C@:::::@@@@@@C@@@@@@@@CIGIHIIDGIGIIIIHHIIHGHHIIHHIFIIIIIHIIIIIIBIIIEIFGIIIFGFIBGDGGGGGGF
IGDIFGADGAE
@SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGG
AAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT
+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
#.,')2/
@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHI
HIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG
Assignments
• Please use a for-loop to parse DNA records in a given .gb file (filename:
NC 005816.gb), and print out the id, sequence, and the length of DNA
records in the .gb file. 
• Please compute the number of nuclieic acids of each DNA sequence. For
example, given two DNA sequences, ATGATAAA and TTCCGGA, the
number of A, T, C, and G in the first DNA sequence is 5, 2, 0, and 1,
respectively. And the number of A, T, C, and G of the second DNA
sequence is 1, 2, 2, and 2, respectively.
• Please transcribe and translate each DNA sequence into a protein
sequence
– (a) Without considering stop codons
– (b) considering stop codons
– (c) Using the appropriate codon table
• Do all of the above for ciliate_ortholog.fasta
Assignment
• Use Biopython to answer the following assignments (you have
to write your own python scripts)
1. Download two arbitrary protein sequences and DNA sequence from
NCBI in fasta format using Biopython. For the two protein sequences,
please individually compute their amino acid and nuclieotide
compositions and output your results in a .txt file.
2. Download the DNA sequence from the file “Ciliate_ortholog.fasta”,
please perform transcription and translation using Biopython, by using
the correct codon table for ciliates and report the final output (i.e.,
protein sequence) in the same .txt file. Note that the protein sequence
should be reported in a fasta format.
3. For each ortholog set in the file “Ciliate_ortholog.fasta”: Translate the
sequence and find the frequency of all the two amino acids. Please
write your results into a .txt file

• Please zip and upload your answer with your name (including
the python scripts, .txt file) to Dropbox 49
Get a file by FTP in Python
>>> from ftplib import FTP
>>> host="ftp.sra.ebi.ac.uk"
>>> ftp=FTP(host)
>>> ftp.login()
'230 Login successful.‘
ftp.cwd('vol1/fastq/SRR020/SRR020192')
'250 Directory successfully changed.‘
>>> ftp.retrlines('LIST')
-r--r--r-- 1 ftp ftp 1777817 Jun 24 20:12 SRR020192.fastq.gz
'226 Directory send OK.'
>>> ftp.retrbinary('RETR SRR020192.fastq.gz', \
open('SRR020192.fastq.gz', 'wb').write)
'226 Transfer complete.'
>>> ftp.quit()
'221 Goodbye.'
Multiple Sequence Alignment
• ClustalW2/Muscle: a popular command
line tool for multiple sequence alignment
• Input: Two sequences
– S1: TATACATTAAA
– S2: TAGGATTCCAC
– S3: TATACATTAAG
• S1 and S3 are highly similar.

• Output: Aligned sequences in a graph

51
ClustalW
• Step 1: Download ClustalW2 (http://www.clustal.org/clustal2/)

• Step 2: Install ClustalW2 (example file: opuntia.fasta)

• Step 3: Run the following scripts in Command line


– clustalw2 -infile=opuntia.fasta

• Step 4: Run the following scripts in Python


>>> from Bio import AlignIO
>>> align = AlignIO.read("opuntia.aln", "clustal")
>>> print(align)
SingleLetterAlphabet() alignment with 3 rows and 11 columns
TAGGATTCCAC gi|6273290|gb|AF191664.1|AF191
TATACATTAAA gi|6273291|gb|AF191665.1|AF191
TAAGGTCTTTG gi|6273289|gb|AF191663.1|AF191
>>> from Bio import Phylo
>>> tree = Phylo.read("opuntia.dnd", "newick")
>>> Phylo.draw_ascii(tree)
52
Biopython wrapper: command line tool
• Command line tool, will often print text output directly to
screen.
• This text can be captured or redirected, via two “pipes”:
standard output (the normal results) and standard error
(for error messages and debug messages).
• There is also standard input, which is any text fed into
the tool.
• These names get shortened to stdin, stdout and stderr.
MSA with ClustalW
>>> from Bio.Align.Applications import ClustalwCommandline
>>> help(ClustalwCommandline)
>>> in_file = "/path/KCNA1_aves_mammals.fas"
>>> in_file = r"C:\path\file"

>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file)


>>> print(clustalw_cline)
clustalw2 -infile=/path/KCNA1_aves_mammals.fas
>>> out_file = "/path/aligned.fasta"
>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file,outfile=out_file)
>>> print(clustalw_cline)
clustalw2 -infile=/path/KCNA1_aves_mammals.fas
-outfile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta
>>> stdout, stderr = clustalw_cline() ### Only when you run this you will get output.
>>> stdout
>>> from Bio import AlignIO
>>> align =
AlignIO.read("/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta", "clustal")
MSA with ClustalW
>>> align
<<class 'Bio.Align.MultipleSeqAlignment'> instance (7 records of length 495,
SingleLetterAlphabet()) at 107170b38>

>>> print(align)
SingleLetterAlphabet() alignment with 7 rows and 495 columns
MTVMSGENVDEASAAPGHPQDGSYPRPAEHDDHECCERVVINIS...TDV A.mel
MTVMSGENVDEASAAPGHPQEGSYPRPAEHEDHECCERVVINIS...TDV P.tig
MTVMSGENVDEASAAPGHPQDGSYPRQADHDDHECCERVVINIS...TDV H.sap
MTVMSGENADEASTAPGHPQDGSYPRQADHDDHECCERVVINIS...TDV
M.mus
MTVMSGENVEEASAAQGHPQDISYPRPADHDDHDCCERVVINIS...TDV S.har
MTVMAGENMDETSALPGHPQD-SY-QPAAHDDHECCERVVINIA...TDV E.gar
MTVMAGENMDETSALPGHPQD-SY-QPAAHDDHECCERVVINIA...TDV M.uni
BLAST
• BioPython has several methods to work with the popular
NCBI BLAST software
• NCBIWWW.qblast() sends queries directly to the NCBI
BLAST server. The query can be a Seq object, FASTA
file, or a GenBank ID.
Generating the query
>>> from Bio.Blast import NCBIWWW
>>> query = SeqIO.read("test.fasta", format="fasta")
>>> result_handle =Sending the query as seq part of the seq.record
NCBIWWW.qblast("blastn", "nt", query.seq)
>>> blast_file = open("my_blast.xml", "w")
#create an xml output file
>>> blast_file.write(result_handle.read())
Reading the result and writing to xml file
>>> blast_file.close()
>>> result_handle.close()
Parse BLAST Results
• It is often useful to obtain a BLAST result directly
(local BLAST server or via Web browser) and
then parse the result file with Python.
• Save the BLAST result in XML format
– NCBIXML.read() for a file with a single BLAST result (single
query)
– NCBIXML.parse() for a file with multiple BLAST results
(multiple queries)
>>> from Bio.Blast import NCBIXML
>>> handle = open("my_blast.xml")
>>> blast_record = NCBIXML.read(handle)
>>> for hit in blast_record.descriptions:
print hit.title
print hit.e
BLAST Record Object

HSP: High Scoring Pairs


View Aligned Sequence
>>> from Bio.Blast import NCBIXML
>>> handle = open("my_blast.xml")
>>> blast_record = NCBIXML.read(handle)
>>> for hit in blast_record.alignments:
for hsp in hit.hsps:
print hit.title
print hsp.expect
print (hsp.query[0:75] + '...')
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...')

gi|731383573|ref|XM_002284686.2| PREDICTED: Vitis vinifera cold-regulated 413 plasma


membrane protein 2 (LOC100248690), mRNA
2.5739e-53
ATGCTAGTATGCTCGGTCATTACGGGTTTGGCACT-CATTTCCTCAAATGGCTCGCCTGCCTTGCGGCTATTTAC...
|||| | || ||| ||| | || ||||||||| |||||| | | ||| | || | |||| || ||||| ...
ATGCCATTAAGCTTGGTGGTCTGGGCTTTGGCACTACATTTCTTGAG-TGGTTGGCTTCTTTTGCTGCCATTTAT...
Many Matches
• Often a BLAST search will return many matches
for a single query (save as an XML format file)
• NCBIXML.parse() can return these as BLAST
record objects in a list, or deal with them directly
in a for loop.

from Bio.Blast import NCBIXML


E_VALUE_THRESH = 1e-20 
for record in NCBIXML.parse(open("my_blast.xml")):
if record.alignments : #skip queries with no matches
print "QUERY: %s" % record.query[:60]
for align in record.alignments:
for hsp in align.hsps:
if hsp.expect < E_VALUE_THRESH:
print "MATCH: %s " % align.title[:60]
print hsp.expect
Thank you
MSA with ClustalW
>>> from Bio.Align.Applications import ClustalwCommandline
>>> in_file = "/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/KCNA1_aves_mammals.fas"
>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file)
>>> print(clustalw_cline)
clustalw2 -infile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/KCNA1_aves_mammals.fas
>>> out_file = "/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta"
>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file,outfile=out_file)
>>> print(clustalw_cline)
clustalw2 -infile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/KCNA1_aves_mammals.fas
-outfile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta
>>> stdout, stderr = clustalw_cline()
>>> from Bio import AlignIO
>>> align =
AlignIO.read("/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta", "clustal")

You might also like