0% found this document useful (0 votes)

146 views

Bio Python 202111

1. Biopython is a collection of modules for biological computation, including tools for working with DNA/protein sequences, sequence alignments, and accessing common biological databases. 2. The Seq object class represents biological sequences, with data attributes for the sequence string and its alphabet. Seq objects can be indexed, sliced, and have string methods like len() and count(). 3. MutableSeq objects allow sequences to be modified, while regular Seq objects are immutable. Methods like upper(), lower(), and tomutable() allow changing between cases and mutability.

Uploaded by

Rohan Ray

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views

Bio Python 202111

Uploaded by

Rohan Ray

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Introduction to Biopython

2021/11/11
Learning Objectives

• Biopython as a toolkit
• Seq objects and their methods
• SeqRecord objects have data fields
• SeqIO to read and write sequence
objects
• BLAST & Multiple sequence
alignment
• Direct access to GenBank with
Entrez.efetch
Modules
• Python functions are divided into 3 sets
1. A small core set that are always available
2. Some built-in modules such as math that can be imported from
the basic install (Eg. >>> import math)
3. An extremely large number of optional modules that must be
downloaded and installed before you can import them
4. Codes using such modules is said to have “dependencies”
• Biopython belongs to the third and fourth category
• The code for dependencies are located in different
places such as SourceForge, GitHub, and developer’s
own websites (Perl and R are better organized)
• Trouble?: Ask the TA’s as each persons problem is
mostly unique and no general solution
Install Biopython
• Website for installation instruction:
– http://biopython.org/wiki/Download
• Required Software
– Python (version above 2.6)
– NumPy (Numerical Python)

• Optional Software
– ReportLab – used for pdf graphics code
– psycopg – used for BioSQL with a PostgreSQL database
– mysql-connector – used for BioSQL with a MySQL database
– MySQLdb – An alternative MySQL library used by BioSQL

4
Information Source: http://biopython.org/wiki/Download
Is your Biopython installed correctly?

• Type the following on your terminal or

interpreter:

>>> from Bio.Seq import Seq

>>> from Bio.Alphabet.IUPAC import unambiguous_dna
>>> new_seq = Seq('GATCAGAAG', unambiguous_dna)
>>> new_seq[0:2]
Seq('GA', IUPACUnambiguousDNA())
>>> new_seq.translate()
Seq('DQK', IUPACProtein())
• Biopython is an integrated collection of modules for
“biological computation” including tools for working
with DNA/protein sequences, sequence alignments,
population genetics, and molecular structures
• It also provides interfaces to common biological
databases (eg. GenBank) and to some common
locally installed software (eg. BLAST).
• Loosely based on BioPerl
• Relatively fewer protein specific functions in
Biopython
Biopython Tutorial
• Biopython has a “Tutorial & Cookbook” :
http://biopython.org/DIST/docs/tutorial/Tutorial.html

by: Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck,

Michiel de Hoon, Peter Cock, Tiago Antao, Eric Talevich, Bartek
Wilczyński

Most of the examples in this class are drawn from the

above link
Python is an Object-Oriented language
• Composed of data structures (known as classes)
– can contain complex and well-defined forms of data,
and they can also have built in methods
• Complex objects are built from other objects
– Eg. String, list and other data types have certain
methods
• Many classes of objects have the same method and can
be used without a defined call
– Eg. “print” method
• Specifying the given data type belonging to this class,
and it inherits all the properties
The Seq object

• The Seq object class is simple and fundamental for

a lot of Biopython work. A Seq object can contain
DNA, RNA, or protein.

1. Data: this is the actual sequence data string of the

sequence.
2. Alphabet – an object describing what the individual
characters making up the string “mean” and how
they should be interpreted.
The Seq object: {Data, Alphabet}
• It is a complex object with a string sub-object (the
sequence)
 Inherits properties of the Python string object
 Also defines an alphabet for that string
 This constraints the allowed properties of the
string object
• The alphabets are actually Biopython objects such
as IUPACAmbiguousDNA or IUPACProtein (Int Union of Pure and
Applied Chem)
• Which are defined in the Bio.Alphabet module
• A Seq object with a DNA alphabet is different from an Amino
Acid alphabet
The Seq object: {Data, Alphabet}
1. Data: this is the actual sequence data string of the sequence.
2. alphabet – an object describing what the individual characters making
up the string “mean” and how they should be interpreted.
Biopython Seq method allows you to create the Seq object
library
>>> from Bio.Seq import Seq Importing the alphabets from its module
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('AGTACACTGGT', IUPAC.unambiguous_dna)
Function call that creates the Seq object. minimum: data attribute
>>> my_seq
Seq('AGTACACTGGT', IUPAC.unambiguous_dna())
>>> print(my_seq)
Eg. of Print method working on different objects, here the Seq object
AGTACACTGGT

>>> my_seq = Seq(‘MRTAVACTKGT')

>>> my_seq
Seq('MRTAVACTKGT', Alphabet())
You can create an ambiguous sequence with the default generic alphabet like this
Seq objects have string methods
• Seq objects have methods that work just like string objects
• You can get the len() of a Seq, slice it, and count() specific letters in it:

• Get single characters/Count sequence length

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq): print("%i %s" % (index, letter))
…..
0G
1A
2T
3C
4G
>>> print(len(my_seq))
5
>>> print(my_seq[0]) #first letter
G
>>> print(my_seq[2]) #third letter
T
Seq objects have string methods
• Seq object has a len(), count() method, just like a string. Like a Python string,
this gives a non-overlapping count:

>>> from Bio.Seq import Seq

>>> "ATGCATAT".count("AT"))
3
>>>Seq(”AAAAA").count(”AA")
2
Eg. Determining GC content

>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC
>>> my_seq =
Seq(“GATCGATGGGCCTATATAGGATCGAAAATCGS”,IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
13
43.75
Seq objects have string methods:
slice
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
>>> my_seq[4:12]
– Seq('GATGGGCC', IUPACUnambiguousDNA())

14
Turn a Seq object into a string
• Sometimes you will need to work with just the sequence
string in a Seq object using a tool that is not aware of the Seq
object methods
• Turn a Seq object into a string with str()
• You will lose the alphabet and just get back the string.
• You can input it into other programs and work with it

>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
>>> seq_string=str(my_seq)
>>> seq_string
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’

16
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’
• Seq object is “read only”, immutable; have to set Seq as
Dot method works on the parameter
mutable preceding the dot
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq [5] = "G"
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA', 17
Seq objects have special methods:
MutableSeq
• Alternatively, you can create a MutableSeq object directly from a string:
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq =
MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)

• Either way will give you a sequence object which can be changed:
>>> mutable_seq [5] = “G”
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA',
IUPACUnambiguousDNA())

• Note:
– You can’t use a MutableSeq object as a dictionary key.
– You can use a Python string or a Seq object as a
dictionary key.
Seq objects have special methods:
Changing case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
Dot method works on the parameter
>>> dna_seq.upper()
preceding the dot
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())

• Strictly speaking the IUPAC alphabets are for upper case

sequences only,
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())

• Note: You can also use MutableSeq to change case

Seq objects have special methods:
transcribe()
Seq objects have special methods: transcribe()

>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC
>>> coding_dna= Seq("ATGGCCATTGTAATGG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGG', IUPACUnambiguousDNA())
Only works on DNA alphabet
>>> template_dna= coding_dna.reverse_complement()
>>> template_dna
Seq('CCATTACAATGGCCAT', IUPACUnambiguousDNA())

>>> messenger_rna= coding_dna.transcribe() Only works on DNA alphabet

>>> messenger_rna
Seq('AUGGCCAUUGUAAUGG', IUPACUnambiguousRNA())

>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe() Only works on RNA alphabet
Seq('ATGGCCATTGTAATGG', IUPACUnambiguousDNA())
Seq objects have special methods: translate()

>>>from Bio.Seq import Seq

>>>from Bio.Alphabet import IUPAC
>>>from Bio.Data import CodonTable ## Optional
>>>messenger_rna =
Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
IUPAC.unambiguous_rna)
>>>messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> messenger_rna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())

>>>coding_dna =
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", generic_dna)
>>>coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())
Seq objects have special methods:
translate()
>>>coding_dna= Seq("ATGGCCATTGTAATGG",
IUPAC.unambiguous_dna)
>>> coding_dna.translate()
//lib/python3.7/site-packages/Bio/Seq.py:2715:
BiopythonWarning: Partial codon, len(sequence) not a multiple of
three. Explicitly trim the sequence or add trailing N before
translation. This may become an error in future.
BiopythonWarning)
Seq('MAIVM', IUPACProtein())
>>> from Bio.Data import CodonTable
>>> coding_dna.translate()
Seq('MAIVM', IUPACProtein())
Seq objects have special methods: translate()

• Codon table
– By default, Biopython uses NCBI table id 1, Standard Code
>>> from Bio.Data import CodonTable
>>> print(CodonTable.unambiguous_dna_by_id[1])
>>>print(CodonTable.unambiguous_dna_by_name["Standard”])
• >>>
CodonTable.unambiguous_dna_by_name["Standard"].stop_c
odons
['TAA', 'TAG', 'TGA']
• >>> CodonTable.unambiguous_dna_by_id[2].stop_codons
['TAA', 'TAG', 'AGA', 'AGG’]
Types and usage of codon table in
Biopython

• >>> help(coding_dna.translate)
• NCBI genetic code number and name:
http://www.ncbi.nlm.nih.gov/Taxonomy/Util
s/wprintgc.cgi
Seq Objects have special methods
• The Bio.SeqUtils module has some useful methods, such as
GC() to calculate % of G+C bases in a DNA sequence.

>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGS',
IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875

>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

43.75
Protein Alphabet
• You could re-define my_seq as a protein by changing the alphabet,
which will totally change the methods that will work on it.
• (‘G’,’A’,’T’,’C’ are valid protein letters)

>>> from Bio.SeqUtils import molecular_weight

>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> print(molecular_weight(my_seq))
3436.1957

>>> my_seq.alphabet = IUPAC.protein

>>> my_seq
Seq('AGTACACTGGT', IUPACProtein()) Try back_transcribe() and see
>>> print(molecular_weight(my_seq)) what you get
912.0004

• Relatively few protein specific functions in Biopython

• Eg. hydropathy plot, isoelectric point etc. are missing
SeqRecord Object
• The SeqRecord object is like a database record (such as
GenBank). It is a complex object that contains a Seq
object, and also annotation fields, known as “attributes”.
.seq
.id
.name
.description
.letter_annotations
.annotations
.features
.dbxrefs
• You can think of attributes as slots with names inside
the SeqRecord object. Each one may contain data or
be empty.
• You can reference a particular part of the object.
GenBank format: https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Creating a SeqRecord– Example
>>> from Bio.Seq import Seq Creating Seq object
This command enables SeqRecord objects
>>> test_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> test_seq_r = SeqRecord(test_seq)
• >>> test_seq_r.id
’<unknown id>’
• >>> test_seq_r.id = "xyz"
• >>> test_seq_r.name = "something"
• >>> test_seq_r.description = "Made up”
• >>> test_seq_r.seq
– Seq('GATC', Alphabet())
• >>> test_seq_r.annotations["evidence"] = "None “ #annotation: Dictionary attribute
• >>> print(test_seq_r.annotations["evidence"])
– None
• >>> test_seq_r.seq = 'ATGC'
• >>> test_seq_r.seq
• 'ATGC'
• Specify fields in the SeqRecord object with a . (dot) syntax
SeqIO and file formats
• SeqIO is the all purpose file read/write tool for SeqRecords
• SeqIO can read many file types: http://biopython.org/wiki/SeqIO
• SeqIO has .read() and .write() methods
• (do not need to “open” the file first)
• Eg. read a text file in FASTA format
• In Biopython, fasta is a type of SeqRecord with specific fields

• Lets assume you have already downloaded a FASTA file from GenBank, such as: NC_005816.fna, and
saved it as a text file in your current directory

>>> from Bio import SeqIO

>>> gene = SeqIO.read("Path_to_file/KCNA1_aves_mammals.fas", "fasta")
>>> gene.id
'gi|45478711|ref|NC_005816.1|'
>>> gene.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG
...CTG', SingleLetterAlphabet())
>>> print(gene.seq) or print(convert it to string): You will get the whole sequence
>>> len(gene.seq)
9609
>>>record = SeqIO.read(”Path_to_file/file.gb", ”genbank”)
Fasta format

Extension Meaning
fasta, fa, fas, fsa generic fasta
fna fasta nucleic acid
ffn FASTA nucleotide of gene regions(specific to coding regions)
faa fasta amino acid
frn FASTA non-coding RNA
Multiple FASTA Records in one file
• The FASTA format can store many sequences in one
text file
• SeqIO.parse() reads the records one by one
• .parse, .read and .write are iterable
• This code creates a list of SeqRecord objects:

>>> from Bio import SeqIO

reading using universal readline
>>> handle = open("KCNA1_aves_mammals.fasta", "rU") mode
# “handle” is a pointer to the file
>>> seq_list = list(SeqIO.parse(handle, "fasta"))
>>> handle.close()
>>> print(seq_list[0].seq) Dumping
#shows the
thewhole file into ain
first sequence list
the list
Reading Sequence Files: Next
>>> from Bio import SeqIO
>>> record_iterator = SeqIO.parse(”PATH/KCNA1_aves_mammals.fas", "fasta")
>>> first_record = next(record_iterator)
>>> print(first_record.id)
gi|2765658|emb|Z78533.1|CIZ78533
>>> print(first_record.description)
gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and
ITS2 DNA

>>> second_record = next(record_iterator)

>>> print(second_record.id)
gi|2765657|emb|Z78532.1|CCZ78532
>>> print(second_record.description)
gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1
and ITS2 DNA
Grab sequence from FASTA file
• If you have a large local FASTA file, and a list of
sequences (‘file.txt') that you want to grab:
Splitting using \n
>>> from Bio import SeqIO
Change according to
>>> output =open(‘path/selected_seqs.fasta', 'w') the delimiter
>>> list
=open(‘path/KCNA1_aves_mammals_id_list.txt').read().splitlines()
>>> for test in SeqIO.parse(KCNA1_aves_mammals.fas','fasta'):
for seqname in list:
name = seqname.strip() Looking up each sequence & not
if test.id == name: dumping the file into a list
SeqIO.write(test, output, 'fasta')
>>> output.close()

strip () = chomp in perl

Multiple GenBank Records in one file
>>>from Bio import SeqIO
>>>records = list(SeqIO.parse(”PATH/NC_005816.gb", "genbank"))
>>>all_species= [] #New array data type
>>>for rec in records:
all_species.append(rec.annotations["organism"])
>>>len(all_species)
1
>>>print(all_species[0])
Yersinia pestis biovar Microtus str. 91001
>>>print(all_species[93])
– Traceback (most recent call last):
– File "<pyshell#7>", line 1, in <module>
– print(all_species[93])
– ……………….
– Why is there an error message?
Pairwise sequence alignment using a
dynamic programming algorithm.
• This provides functions to get global and local
alignments between two sequences.
• When doing alignments, you can specify the match
score and gap penalties.
• The convention is <alignment type>XX
– where <alignment type> is either “global” or
“local” and XX is a 2 character code indicating
the parameters it takes.
<alignment type>XX
The 1st character indicates the parameters for matches
(and mismatches), and the 2nd indicates the parameters for
gap penalties.
Matches and Mismatches (1st):
CODE. DESCRIPTION
No parameters. Identical characters have score of
x. 1, otherwise 0.

A match score is the score of identical chars,

m. otherwise mismatch score.
A dictionary returns the score of any pair of
d. characters.
c. A callback function returns scores.
<alignment type>XX
The 1st character indicates the parameters for matches
(and mismatches), and the 2nd indicates the parameters for
gap penalties.
GAP Penalties (2nd):

CODE. DESCRIPTION
x. No gap penalties.
Same open and extend gap penalties for both
s. sequences.
The sequences have different open and extend gap
d. penalties.
c. A callback function returns the gap penalties.
Global alignment
>>> from Bio import pairwise2
>>> alignments = pairwise2.align.globalxx("ACCGT",
"ACG")
>>> from Bio.pairwise2 import format_alignment
>>> print(format_alignment(*alignments[0]))
ACCGT
| | |
A−CG−
Score=3
Local alignment: All possible alignments

>>> for a in pairwise2.align.localxx("ACCGT", "ACG"):

... for a in pairwise2.align.localxx("ACCGT", "ACG")
1 ACCG
| | |
1 ACCG
1 A −CG || |
Score=3 1 AC−G
Score=3

a means the list containing elements of a will be unpacked, so .format(a) works

similarly to .format(a[0], a[1], a[2]) (assuming a is a list with only three elements).
Global alignment: Match and mismatch
score
Match: 2 points, Mismatch: -1 point and Don’t penalize gaps.

>>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1):

... print(format_alignment(*a))
...

ACCGT
| | |
A −CG−
Score=6
ACCGT
| | |
AC−G−
Score=6
Global alignment: Match & mismatch
AND Gap penalities
Gap open: -0.5 points, and gap extend: -0.1 points.

>>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1):

... print(format_alignment(*a))
...
ACCGT
| | |
A−CG−
Score=5

ACCGT
| | |
AC−G−
Score=5
SeqIO for FASTQ
• FASTQ is a format for Next Generation
DNA sequence data (FASTA + Quality)
• SeqIO can read (and write) FASTQ format
files
from Bio import SeqIO
count = 0
for rec in SeqIO.parse(”example.fastq", "fastq"):
count += 1
print(count)

You can do all the the things in fasta format on fastq format
Direct Access to GenBank, PubMed etc
• BioPython has modules that can directly access databases over the
Internet
• The Entrez module uses the NCBI Efetch service
• Entrez_efetch part of the Entrez module: works on many NCBI
databases including protein and PubMed literature citations
• The ‘gb’ data type contains much more annotation information, but
rettype=‘fasta’ also works
• With a few tweaks, this script could be used to download a list of
GenBank ID’s and save them as FASTA or GenBank files:

>>> from Bio import Entrez

>>>Entrez.email = ”ks@xxx.edu.com"
Parameters you want to search
# NCBI requires your valid identity
>>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb",
retmode="text")
handle: Temporary
>>> record variable “genbank")
= SeqIO.read(handle,
You could iterate this variable in a loop by swapping variables (Eg. new ids)
Grab sequence from FASTA file
• If you have a large local FASTA file, and a list of
sequences (‘file.txt') that you want to grab:
Splitting using \n
>>> from Bio import SeqIO
Change according to
>>> output =open(‘path/selected_seqs.fasta', 'w') the delimiter
>>> list
=open(‘path/KCNA1_aves_mammals_id_list.txt').read().splitlines()
>>> for test in SeqIO.parse(KCNA1_aves_mammals.fas','fasta'):
for seqname in list:
name = seqname.strip() Looking up each sequence & not
if test.id == name: dumping the file into a list
SeqIO.write(test, output, 'fasta')
>>> output.close()

strip () = chomp in perl

>>> print(record)
ID: EU490707.1
Name: EU490707
Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast.
Number of features: 3
/sequence_version=1 These are sub-fields of the .annotations field
/source=chloroplast Selenipedium aequinoctiale
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta',
'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae',
'Cypripedioideae', 'Selenipedium']
/keywords=['']
/references=[Reference(title='Phylogenetic utility of ycf1 in orchids: a plastid gene more
variable than matK', ...), Reference(title='Direct Submission', ...)]
/accessions=['EU490707']
/data_file_division=PLN
/date=15-JAN-2009
/organism=Selenipedium aequinoctiale
/gi=186972394
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GA
A', IUPACAmbiguousDNA())
Illumina Sequences
• Illumina sequence files are usually stored in the FASTQ
format. Similar to FASTA, but with an additional pair of
lines for the quality annotation of each base.

@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152

NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAAC
CTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC
+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152
+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGG
GGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII
@SRR350953.6 MENDEL_0047_FC62MN8AAXX:1:1:1686:935 length=152
NATTTTTACTAGTTTATTCTAGAACAGAGCATAAACTACTATTCAATAAACGTATGAAGCACTACTCACCTCCATTAACATGACGTTTT
TCCCTAATCTGATGGGTCATTATGACCAGAGTATTGCCGCGGTGGAAATGGAGGTGAGTAGTG
+SRR350953.6 MENDEL_0047_FC62MN8AAXX:1:1:1686:935 length=152
+83355@@@CC@C22@@C@@CC@@C@@@CC@@@@@@@@@@@@C?
C22@@C@:::::@@@@@@C@@@@@@@@CIGIHIIDGIGIIIIHHIIHGHHIIHHIFIIIIIHIIIIIIBIIIEIFGIIIFGFIBGDGGGGGGF
IGDIFGADGAE
@SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGG
AAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT
+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152
#.,')2/
@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHI
HIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG
Assignments
• Please use a for-loop to parse DNA records in a given .gb file (filename:
NC 005816.gb), and print out the id, sequence, and the length of DNA
records in the .gb file.
• Please compute the number of nuclieic acids of each DNA sequence. For
example, given two DNA sequences, ATGATAAA and TTCCGGA, the
number of A, T, C, and G in the first DNA sequence is 5, 2, 0, and 1,
respectively. And the number of A, T, C, and G of the second DNA
sequence is 1, 2, 2, and 2, respectively.
• Please transcribe and translate each DNA sequence into a protein
sequence
– (a) Without considering stop codons
– (b) considering stop codons
– (c) Using the appropriate codon table
• Do all of the above for ciliate_ortholog.fasta
Assignment
• Use Biopython to answer the following assignments (you have
to write your own python scripts)
1. Download two arbitrary protein sequences and DNA sequence from
NCBI in fasta format using Biopython. For the two protein sequences,
please individually compute their amino acid and nuclieotide
compositions and output your results in a .txt file.
2. Download the DNA sequence from the file “Ciliate_ortholog.fasta”,
please perform transcription and translation using Biopython, by using
the correct codon table for ciliates and report the final output (i.e.,
protein sequence) in the same .txt file. Note that the protein sequence
should be reported in a fasta format.
3. For each ortholog set in the file “Ciliate_ortholog.fasta”: Translate the
sequence and find the frequency of all the two amino acids. Please
write your results into a .txt file

• Please zip and upload your answer with your name (including
the python scripts, .txt file) to Dropbox 49
Get a file by FTP in Python
>>> from ftplib import FTP
>>> host="ftp.sra.ebi.ac.uk"
>>> ftp=FTP(host)
>>> ftp.login()
'230 Login successful.‘
ftp.cwd('vol1/fastq/SRR020/SRR020192')
'250 Directory successfully changed.‘
>>> ftp.retrlines('LIST')
-r--r--r-- 1 ftp ftp 1777817 Jun 24 20:12 SRR020192.fastq.gz
'226 Directory send OK.'
>>> ftp.retrbinary('RETR SRR020192.fastq.gz', \
open('SRR020192.fastq.gz', 'wb').write)
'226 Transfer complete.'
>>> ftp.quit()
'221 Goodbye.'
Multiple Sequence Alignment
• ClustalW2/Muscle: a popular command
line tool for multiple sequence alignment
• Input: Two sequences
– S1: TATACATTAAA
– S2: TAGGATTCCAC
– S3: TATACATTAAG
• S1 and S3 are highly similar.

• Output: Aligned sequences in a graph

51
ClustalW
• Step 1: Download ClustalW2 (http://www.clustal.org/clustal2/)

• Step 2: Install ClustalW2 (example file: opuntia.fasta)

• Step 3: Run the following scripts in Command line

– clustalw2 -infile=opuntia.fasta

• Step 4: Run the following scripts in Python

>>> from Bio import AlignIO
>>> align = AlignIO.read("opuntia.aln", "clustal")
>>> print(align)
SingleLetterAlphabet() alignment with 3 rows and 11 columns
TAGGATTCCAC gi|6273290|gb|AF191664.1|AF191
TATACATTAAA gi|6273291|gb|AF191665.1|AF191
TAAGGTCTTTG gi|6273289|gb|AF191663.1|AF191
>>> from Bio import Phylo
>>> tree = Phylo.read("opuntia.dnd", "newick")
>>> Phylo.draw_ascii(tree)
52
Biopython wrapper: command line tool
• Command line tool, will often print text output directly to
screen.
• This text can be captured or redirected, via two “pipes”:
standard output (the normal results) and standard error
(for error messages and debug messages).
• There is also standard input, which is any text fed into
the tool.
• These names get shortened to stdin, stdout and stderr.
MSA with ClustalW
>>> from Bio.Align.Applications import ClustalwCommandline
>>> help(ClustalwCommandline)
>>> in_file = "/path/KCNA1_aves_mammals.fas"
>>> in_file = r"C:\path\file"

>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file)

>>> print(clustalw_cline)
clustalw2 -infile=/path/KCNA1_aves_mammals.fas
>>> out_file = "/path/aligned.fasta"
>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file,outfile=out_file)
>>> print(clustalw_cline)
clustalw2 -infile=/path/KCNA1_aves_mammals.fas
-outfile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta
>>> stdout, stderr = clustalw_cline() ### Only when you run this you will get output.
>>> stdout
>>> from Bio import AlignIO
>>> align =
AlignIO.read("/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta", "clustal")
MSA with ClustalW
>>> align
<<class 'Bio.Align.MultipleSeqAlignment'> instance (7 records of length 495,
SingleLetterAlphabet()) at 107170b38>

>>> print(align)
SingleLetterAlphabet() alignment with 7 rows and 495 columns
MTVMSGENVDEASAAPGHPQDGSYPRPAEHDDHECCERVVINIS...TDV A.mel
MTVMSGENVDEASAAPGHPQEGSYPRPAEHEDHECCERVVINIS...TDV P.tig
MTVMSGENVDEASAAPGHPQDGSYPRQADHDDHECCERVVINIS...TDV H.sap
MTVMSGENADEASTAPGHPQDGSYPRQADHDDHECCERVVINIS...TDV
M.mus
MTVMSGENVEEASAAQGHPQDISYPRPADHDDHDCCERVVINIS...TDV S.har
MTVMAGENMDETSALPGHPQD-SY-QPAAHDDHECCERVVINIA...TDV E.gar
MTVMAGENMDETSALPGHPQD-SY-QPAAHDDHECCERVVINIA...TDV M.uni
BLAST
• BioPython has several methods to work with the popular
NCBI BLAST software
• NCBIWWW.qblast() sends queries directly to the NCBI
BLAST server. The query can be a Seq object, FASTA
file, or a GenBank ID.
Generating the query
>>> from Bio.Blast import NCBIWWW
>>> query = SeqIO.read("test.fasta", format="fasta")
>>> result_handle =Sending the query as seq part of the seq.record
NCBIWWW.qblast("blastn", "nt", query.seq)
>>> blast_file = open("my_blast.xml", "w")
#create an xml output file
>>> blast_file.write(result_handle.read())
Reading the result and writing to xml file
>>> blast_file.close()
>>> result_handle.close()
Parse BLAST Results
• It is often useful to obtain a BLAST result directly
(local BLAST server or via Web browser) and
then parse the result file with Python.
• Save the BLAST result in XML format
– NCBIXML.read() for a file with a single BLAST result (single
query)
– NCBIXML.parse() for a file with multiple BLAST results
(multiple queries)
>>> from Bio.Blast import NCBIXML
>>> handle = open("my_blast.xml")
>>> blast_record = NCBIXML.read(handle)
>>> for hit in blast_record.descriptions:
print hit.title
print hit.e
BLAST Record Object

HSP: High Scoring Pairs

View Aligned Sequence
>>> from Bio.Blast import NCBIXML
>>> handle = open("my_blast.xml")
>>> blast_record = NCBIXML.read(handle)
>>> for hit in blast_record.alignments:
for hsp in hit.hsps:
print hit.title
print hsp.expect
print (hsp.query[0:75] + '...')
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...')

gi|731383573|ref|XM_002284686.2| PREDICTED: Vitis vinifera cold-regulated 413 plasma

membrane protein 2 (LOC100248690), mRNA
2.5739e-53
ATGCTAGTATGCTCGGTCATTACGGGTTTGGCACT-CATTTCCTCAAATGGCTCGCCTGCCTTGCGGCTATTTAC...
|||| | || ||| ||| | || ||||||||| |||||| | | ||| | || | |||| || ||||| ...
ATGCCATTAAGCTTGGTGGTCTGGGCTTTGGCACTACATTTCTTGAG-TGGTTGGCTTCTTTTGCTGCCATTTAT...
Many Matches
• Often a BLAST search will return many matches
for a single query (save as an XML format file)
• NCBIXML.parse() can return these as BLAST
record objects in a list, or deal with them directly
in a for loop.

from Bio.Blast import NCBIXML

E_VALUE_THRESH = 1e-20
for record in NCBIXML.parse(open("my_blast.xml")):
if record.alignments : #skip queries with no matches
print "QUERY: %s" % record.query[:60]
for align in record.alignments:
for hsp in align.hsps:
if hsp.expect < E_VALUE_THRESH:
print "MATCH: %s " % align.title[:60]
print hsp.expect
Thank you
MSA with ClustalW
>>> from Bio.Align.Applications import ClustalwCommandline
>>> in_file = "/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/KCNA1_aves_mammals.fas"
>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file)
>>> print(clustalw_cline)
clustalw2 -infile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/KCNA1_aves_mammals.fas
>>> out_file = "/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta"
>>> clustalw_cline = ClustalwCommandline("clustalw2", infile=in_file,outfile=out_file)
>>> print(clustalw_cline)
clustalw2 -infile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/KCNA1_aves_mammals.fas
-outfile=/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta
>>> stdout, stderr = clustalw_cline()
>>> from Bio import AlignIO
>>> align =
AlignIO.read("/Users/mahavishnu/Documents/Ahmedabad_Univerisity/Courses_Teaching_material/
BioInformatics_course/bioinfo_class/aligned.fasta", "clustal")

Knights Templar in Britain 1st Edition Evelyn Lord all chapter instant download
100% (2)
Knights Templar in Britain 1st Edition Evelyn Lord all chapter instant download
81 pages
Bioinformatics Courses
No ratings yet
Bioinformatics Courses
2 pages
Catch A Killer
50% (2)
Catch A Killer
3 pages
Final Sample
No ratings yet
Final Sample
10 pages
Karen Horney
No ratings yet
Karen Horney
28 pages
Samundra Institute of Maritime Studies: Dns - Sample Question Paper
No ratings yet
Samundra Institute of Maritime Studies: Dns - Sample Question Paper
7 pages
Gautam Buddha University
No ratings yet
Gautam Buddha University
20 pages
Blind 75 PDF
No ratings yet
Blind 75 PDF
129 pages
Lab Activity 9 CSS Boxes and Horizontal Menu
No ratings yet
Lab Activity 9 CSS Boxes and Horizontal Menu
10 pages
ICOfrauddetectionpaper 1
No ratings yet
ICOfrauddetectionpaper 1
17 pages
Visual Basic - Net (VB - Net) - Cheat Sheets - OneCompiler
No ratings yet
Visual Basic - Net (VB - Net) - Cheat Sheets - OneCompiler
7 pages
Latex Cheat Sheet
No ratings yet
Latex Cheat Sheet
2 pages
NGS and Sequence Analysis With Biopython For Prospective Brain Cancer Therapeutic Studies
No ratings yet
NGS and Sequence Analysis With Biopython For Prospective Brain Cancer Therapeutic Studies
14 pages
ISIMMED 2014 Proceeding - Cover+artikel Yayat
No ratings yet
ISIMMED 2014 Proceeding - Cover+artikel Yayat
26 pages
The Impact of Indias Cyber Security Law and Cyber Forensic On Building Techno-Centric Smartcity IoT Environment
No ratings yet
The Impact of Indias Cyber Security Law and Cyber Forensic On Building Techno-Centric Smartcity IoT Environment
9 pages
CD Computer Science 03
No ratings yet
CD Computer Science 03
114 pages
Water Grid in Telangana: Management
No ratings yet
Water Grid in Telangana: Management
3 pages
Nps PFM SchemeInfo
No ratings yet
Nps PFM SchemeInfo
4 pages
3D Graphics With OpenGL
No ratings yet
3D Graphics With OpenGL
31 pages
Practical Techniques Booklet Qm1hek
No ratings yet
Practical Techniques Booklet Qm1hek
49 pages
Non Edible Oil Seeds Based Livelihood22
No ratings yet
Non Edible Oil Seeds Based Livelihood22
107 pages
F.Y.B.Tech Course Contents - 2021-22
No ratings yet
F.Y.B.Tech Course Contents - 2021-22
50 pages
Y9 Intro To GCSE HL Standard Mat - 0
No ratings yet
Y9 Intro To GCSE HL Standard Mat - 0
2 pages
Cte 211
No ratings yet
Cte 211
8 pages
Com - Dualaccount.multispace - Multiaccount Logcat
No ratings yet
Com - Dualaccount.multispace - Multiaccount Logcat
8 pages
Rto Receiving Process
No ratings yet
Rto Receiving Process
39 pages
Project Proposal Guide
No ratings yet
Project Proposal Guide
3 pages
16721-Article Text-46417-1-10-20220112
No ratings yet
16721-Article Text-46417-1-10-20220112
19 pages
Department of Mathematics Faculty of Science Jagannath University, Dhaka
No ratings yet
Department of Mathematics Faculty of Science Jagannath University, Dhaka
129 pages
Abductive Knowledge Induction From Raw Data
No ratings yet
Abductive Knowledge Induction From Raw Data
7 pages
Sco2-Heat Exchanger
No ratings yet
Sco2-Heat Exchanger
21 pages
A Python
No ratings yet
A Python
103 pages
UTILTS User Guide
No ratings yet
UTILTS User Guide
137 pages
Sa1, 6th, Maths Revision Paper 2022-23
100% (2)
Sa1, 6th, Maths Revision Paper 2022-23
4 pages
Lec - 05 AAA - Brute Force and Exhaustive Search
No ratings yet
Lec - 05 AAA - Brute Force and Exhaustive Search
39 pages
4.vehicle Motion Control PDF
No ratings yet
4.vehicle Motion Control PDF
69 pages
A Deep Steganography Approach To Secure Data Transmission in Ad-Hoc Cloud Systems
No ratings yet
A Deep Steganography Approach To Secure Data Transmission in Ad-Hoc Cloud Systems
97 pages
English 7-Q2-Las-Melc7
No ratings yet
English 7-Q2-Las-Melc7
7 pages
Sample Love-Ghosts
No ratings yet
Sample Love-Ghosts
17 pages
Variance Guide Appendix C Sas
No ratings yet
Variance Guide Appendix C Sas
55 pages
Ques
No ratings yet
Ques
3 pages
Green Marketing
No ratings yet
Green Marketing
18 pages
Document
No ratings yet
Document
29 pages
Instruction Manual Fieldvue dvc6200 hw2 Digital Valve Controller en 123052
No ratings yet
Instruction Manual Fieldvue dvc6200 hw2 Digital Valve Controller en 123052
108 pages
Bluetooth
No ratings yet
Bluetooth
8 pages
.PDF 2
No ratings yet
.PDF 2
12 pages
A Machine Learning Model For Average Fuel Consumption in Heavy Vehicles
No ratings yet
A Machine Learning Model For Average Fuel Consumption in Heavy Vehicles
59 pages
Mec385 (2018-2019) PDF
No ratings yet
Mec385 (2018-2019) PDF
110 pages
Long Range Plan For Unit: Intro/pre-Assessment Class
No ratings yet
Long Range Plan For Unit: Intro/pre-Assessment Class
27 pages
Blockchain Assignment 2 Group 6
No ratings yet
Blockchain Assignment 2 Group 6
17 pages
Term Paper Claveria
No ratings yet
Term Paper Claveria
10 pages
CODE2
No ratings yet
CODE2
42 pages
Ict Link Tree
No ratings yet
Ict Link Tree
3 pages
Eikon User Manual1 Library
No ratings yet
Eikon User Manual1 Library
214 pages
CPE440 Computer Architecture
No ratings yet
CPE440 Computer Architecture
7 pages
Affordable Housing Design Problem 1
No ratings yet
Affordable Housing Design Problem 1
8 pages
Everything Is Object in Python
No ratings yet
Everything Is Object in Python
11 pages
Virtualization in Distributed System: A Brief Overview
100% (1)
Virtualization in Distributed System: A Brief Overview
5 pages
5 BSC It Report Sample's
No ratings yet
5 BSC It Report Sample's
82 pages
9-19-22 Beer
No ratings yet
9-19-22 Beer
1 page
ANGLAIS QUATRIEMES SEMAINE 18 22 Mai
No ratings yet
ANGLAIS QUATRIEMES SEMAINE 18 22 Mai
3 pages
JSTL Core Tags: Celsina Bignoli
No ratings yet
JSTL Core Tags: Celsina Bignoli
22 pages
Bioinformatics Session16!17!25102021
No ratings yet
Bioinformatics Session16!17!25102021
39 pages
Bioinformatics Session4
No ratings yet
Bioinformatics Session4
27 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Bioinformatics Session1
No ratings yet
Bioinformatics Session1
35 pages
Bioinformatics Session11
No ratings yet
Bioinformatics Session11
19 pages
Mindoro Biodiversity Conservation Progra PDF
No ratings yet
Mindoro Biodiversity Conservation Progra PDF
56 pages
Utmb Usda Complaint
No ratings yet
Utmb Usda Complaint
9 pages
Report on Lab Visits
No ratings yet
Report on Lab Visits
3 pages
Agilent Milk Protein Analysis
No ratings yet
Agilent Milk Protein Analysis
8 pages
Protectant and Systemic Fungicides
No ratings yet
Protectant and Systemic Fungicides
3 pages
Pemodelan Matematika Harvesting
No ratings yet
Pemodelan Matematika Harvesting
13 pages
SOAL Rehab 2010 Kak Nelfi
No ratings yet
SOAL Rehab 2010 Kak Nelfi
5 pages
LEsson 2 Sanitizing Cleaning - PPT New
100% (10)
LEsson 2 Sanitizing Cleaning - PPT New
18 pages
351 RC 2014
No ratings yet
351 RC 2014
130 pages
Download ebooks file Plasmonic Sensors and their Applications 1st Edition Adil Denizli (Editor) all chapters
100% (5)
Download ebooks file Plasmonic Sensors and their Applications 1st Edition Adil Denizli (Editor) all chapters
61 pages
Artikel Manfaat Zat Besi Untuk Bayi
No ratings yet
Artikel Manfaat Zat Besi Untuk Bayi
2 pages
gf-1 Viral Nucleic Acid Extraction Kit
No ratings yet
gf-1 Viral Nucleic Acid Extraction Kit
12 pages
Review Practice - Unit 2
No ratings yet
Review Practice - Unit 2
14 pages
Identification of Important Pests and Diseases in My Locality
No ratings yet
Identification of Important Pests and Diseases in My Locality
3 pages
Menchu Naturesunifyingpatterns
No ratings yet
Menchu Naturesunifyingpatterns
3 pages
Lyuba Smirnova Vol 1
100% (1)
Lyuba Smirnova Vol 1
392 pages
Plant Abiotic Stress and Adaptations
No ratings yet
Plant Abiotic Stress and Adaptations
18 pages
Chapter 4 The Cell
No ratings yet
Chapter 4 The Cell
5 pages
Dengue Report
No ratings yet
Dengue Report
5 pages
U4 - Study Guide - CELL DIVISION & GROWTH
No ratings yet
U4 - Study Guide - CELL DIVISION & GROWTH
7 pages
X-Ray Technician
No ratings yet
X-Ray Technician
12 pages
Jurnal Kesekian Ga Tau Bilang
No ratings yet
Jurnal Kesekian Ga Tau Bilang
10 pages
Composition and Structure of Apatite Formed On Organic Polymer in Simulated Body Uid With A High Content of Carbonate Ion
No ratings yet
Composition and Structure of Apatite Formed On Organic Polymer in Simulated Body Uid With A High Content of Carbonate Ion
6 pages
Lesson 1 - Introduction - CD - 21 - 22
No ratings yet
Lesson 1 - Introduction - CD - 21 - 22
25 pages
Food Deterioration and Its Causes
No ratings yet
Food Deterioration and Its Causes
3 pages
Middle Range Theory: A Perspective On Development and Use
No ratings yet
Middle Range Theory: A Perspective On Development and Use
13 pages
Whitepaper EN Plant-Empowerment Hoogendoorn-Growth-Management 042017
No ratings yet
Whitepaper EN Plant-Empowerment Hoogendoorn-Growth-Management 042017
10 pages