Python 13 Jul 2016
Python 13 Jul 2016
July 2016
What is the Yale Center for Research Computing?
>>> radius=2
>>> pi=3.14
>>> diam=radius*2
>>> area=pi*(radius**2)
>>> title="fun with strings"
>>> pi="cherry"
>>> longnum=31415926535897932384626433832795028841971693993751058\
2097494459230781640628620899862803482534211706798214808651
>>> delicious=True
>>> l=[1,2,3,4,5,6,7,8,9]
>>> l+[11,12,13]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13]
>>> l[3:6]=[’three to six’]
>>> l
[1, 2, 3, ’three to six’, 7, 8, 9]
>>> t=(1,2,3,4,5,6,7,8,9)
>>> t[4:6]
(5, 6)
>>> t[6]="changeme"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ’tuple’ object does not support item assignment
>>> t.append(’more’)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: ’tuple’ object has no attribute ’append’
While statements execute one or more statements repeatedly until the test is false:
For statements take some sort of iterable object and loop once for every value.
If you loop over a dict, you’ll get just keys. Use iteritems() for keys and values.
>>> for denom in coins: print denom
...
quarter
nickle
penny
dime
>>> for denom, value in coins.iteritems(): print denom, value
...
quarter 25
nickle 5
penny 1
dime 10
While and For loops can skip steps (continue) or terminate early (break).
Functions allow you to write code once and use it many times.
Functions also hide details so code is more understandable.
>>> def area(w, h):
... return w*h
>>> area(3, 4)
12
>>> area(5, 10)
50
Some languages differentiate between functions and procedures. In python,
everything is a function. Procedures are functions that return no values.
FCID,Lane,Sample_ID,SampleRef,index,Description,Control,Recipe,...
160212,1,A1,human,TAAGGCGA-TAGAT,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A2,human,CGTACTAG-CTCTC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A3,human,AGGCAGAA-TATCC,None,N,Eland-rna,Mei,Jon_mix10
160212,1,A4,human,TCCTGAGC-AGAGT,None,N,Eland-rna,Mei,Jon_mix10
...
import sys
fp=open(sys.argv[1])
print fp.readline().strip()
We’ll call readline() on the file pointer to get a single line from the file. (the
header line).
Strip() removes the return at the end of the line.
Then we print it.
for l in fp:
...
flds=l.strip().split(’,’)
flds[4]=flds[4][:-3]
print ’,’.join(flds)
Join takes a list of strings, and combines them into one string using the string
provided. Then we just print that string.
Reviewing:
import sys
fp=open(sys.argv[1])
print fp.readline().strip()
for l in fp:
flds=l.strip().split(’,’)
flds[4]=flds[4][:-3]
print ’,’.join(flds)
We could skip certain lines (with other than human in the 3rd column)
We could also specify the output file on the command line
import sys
fp=open(sys.argv[1])
ofp=open(sys.argv[2], ’w’)
print >> ofp, fp.readline().strip()
for l in fp:
flds=l.strip().split(’,’)
if flds[3] != ’human’: continue
flds[4]=flds[4][:-3]
print >> ofp, ’,’.join(flds)
import sys
wrotehdr=False
for f in sys.argv[1:]:
fp=open(f)
hdr=fp.readline().strip()
if not wrotehdr:
print hdr
wrotehdr=True
for l in fp:
flds=l.strip().split(’,’)
flds[4]=flds[4][:-3]
print ’,’.join(flds)
We need a way to traverse all the files and directories. os.walk(dir) starts at dir
and visits every subdirectory below it. It returns a list of files and subdirectories at
each subdirectory.
For example, imagine we have the following dirs and files:
d1
d1/d2
d1/d2/f2.txt
d1/f1.txt
>>> import os
>>> for d, dirs, files in os.walk(’d1’):
... print d, dirs, files
...
d1 [’d2’] [’f1.txt’]
d1/d2 [] [’f2.txt’]
import subprocess
ret=subprocess.call(cmd, shell=True)
ret=subprocess.call(’quip -c myfile.fastq > myfile.fastq.qp’, shell=True)
Dictionaries associate names with data, and allow quick retrieval by name.
By nesting dictionaries, powerful lookups are easy.
In this example, we’ll:
create a dict containing objects
load the objects with search data
use the dict to retrieve the appropriate object for a search
perform the search
We have another file with dna sequences, and where they mapped:
HWI-ST830:206:D2411ACXX:1:1114:6515:89952 401 chr1 10536 0 76M = 222691803 222681343 TACCACCGAAATCTGTGCAG
GCTCTCCGGGTCTGTGCTGAGGAGAACGC ##B<2DDDDDDDCCDCC@CC@C@282BBCCDDBDDFHIJJJIGJIIGIGFIGJJIJJJJJJJJHGGHHFFFFDCC@ XA:i:1 MD:Z:24C51 NM:i:1 XP:Z
HWI-ST830:206:D2411ACXX:1:1114:6515:89952 177 chr1 10536 0 76M chr3 197908818 0 TACCACCGAAATCTGTGCAGAGGAGAAC
GGTCTGTGCTGAGGAGAACGC ##B<2DDDDDDDCCDCC@CC@C@282BBCCDDBDDFHIJJJIGJIIGIGFIGJJIJJJJJJJJHGGHHFFFFDCC@ XA:i:1 MD:Z:24C51 NM:i:1 XP:Z:chr3 19
HWI-ST830:206:D2411ACXX:1:1114:6515:89952 401 chr1 10536 0 76M chr4 120370019 0 TACCACCGAAATCTGTGCAGAGGAGAAC
GGTCTGTGCTGAGGAGAACGC ##B<2DDDDDDDCCDCC@CC@C@282BBCCDDBDDFHIJJJIGJIIGIGFIGJJIJJJJJJJJHGGHHFFFFDCC@ XA:i:1 MD:Z:24C51 NM:i:1 XP:Z:chr4 12
HWI-ST830:206:D2411ACXX:1:1114:6515:89952 433 chr1 10536 0 76M chr9 141135264 0 TACCACCGAAATCTGTGCAGAGGAGAAC
GGTCTGTGCTGAGGAGAACGC ##B<2DDDDDDDCCDCC@CC@C@282BBCCDDBDDFHIJJJIGJIIGIGFIGJJIJJJJJJJJHGGHHFFFFDCC@ XA:i:1 MD:Z:24C51 NM:i:1 XP:Z:chr9 14
...
We’d like to be able to quickly determine the genes overlapped by a dna sequence.
We’ll use interval trees, one for each chromosome, to store an interval for each gene.
Then we’ll find the overlaps for mapped dna sequences.
Again, in pseudocode:
# create the interval trees
create empty dict
open the gene file
for each line in the file
get gene name, chrom, start, end
initialize an intervaltree for the chrom, if needed, and add to dict
add the interval and gene name to the interval tree
# use the interval trees to find overlapped genes
open the dna sequence file
for each line in the file:
get chrom, mapped position, and dna seq
look up the interval tree for that chrom in the dict
search the interval tree for overlaps [pos, pos+len]
print out the gene names
import sys
from intervaltree import IntervalTree
print "initializing"
genefinder={}
for line in open(sys.argv[1]):
genename, chrm, strand, start, end = line.split()[0:5]
if not chrm in genefinder:
genefinder[chrm]=IntervalTree()
genefinder[chrm][int(start):int(end)]=genename
print "reading sequences"
for line in open(sys.argv[2]):
tag, flag, chrm, pos, mapq, cigar, rnext,
pnext, tlen, seq, qual = line.split()[0:11]
genes=genefinder[chrm][int(pos):int(pos)+len(seq)]
if genes:
print tag
for gene in genes:
print ’\t’,gene.data