Perl Bioinf 0411 PDF
Perl Bioinf 0411 PDF
Perl Bioinf 0411 PDF
Science and Technology Support Group High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH 43212-1163
Table of Contents
Section 1
Concatenate sequences Transcribe DNA to RNA Reverse complement of sequences Read sequence data from files Searching for motifs in DNA or proteins Exercises 1
Section 3
Read FASTA Files Exercises 3
Section 4
GenBank Files and Libraries Exercises 4
Section 2
Subroutines Mutations and Randomization Translating DNA into Proteins using Libraries of Subroutines BioPerl Modules Exercises 2
Section 5
PDB Exercises 5
Section 6
Blast Files Exercises 6
2
Using Perl for Bioinformatics
3
Using Perl for Bioinformatics
Note the different uses of the assignment to DNA3 achieve the same result: 1. $DNA3 = $DNA1$DNA2; 2. $DNA3 = $DNA1.$DNA2; Results of running example 1-1: Here are the original two DNA fragments: ACGGGAGGACGGGAAAATTACTACGGCATTAGC ATAGTGCCGTGAGAGTGATGTAGTA Here is the concatenation of the first two fragments (version 1): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA Here is the concatenation of the first two fragments (version 2): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA Here is the concatenation of the first two fragments (version 3): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA
4
Using Perl for Bioinformatics
5
Using Perl for Bioinformatics
1. 2.
Assign the variable $RNA to the string $DNA. $RNA =~ s/T/U/g; is evaluated as substitute all uppercase Ts with uppercase Us.
Results of running example 1-2: Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the result of transcribing the DNA to RNA: ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC
6
Using Perl for Bioinformatics
This would result in error!? Fortunately there is an operation with regular expressions called translator.
7
Using Perl for Bioinformatics
Note that the translator replaces the characters in the first sequence with the corresponding character in the second sequence. In this example both uppercase and lowercase replacement of the bases are translated. Results of running example 1-3: Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the reverse complement DNA: GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT
8
Using Perl for Bioinformatics
9
Using Perl for Bioinformatics
10
Using Perl for Bioinformatics
11
Using Perl for Bioinformatics
Compare the motif entered from the terminal to the protein string.
Use regular expression comparison. Exit the program when motif only contains whitespaces.
12
Using Perl for Bioinformatics
13
Using Perl for Bioinformatics
Results from running example1-5.pl: Please type the filename of the protein sequence data: NM_021964fragment.pep Enter a motif to search for: SVLQ I found it! Enter a motif to search for: sqlv I couldnt find it. Enter a motif to search for: QDSV I found it! Enter a motif to search for: HERLPQGLQ I found it! Enter a motif to search for: I couldnt find it.
14
Using Perl for Bioinformatics
2. 3. 4.
5.
6. 7.
8.
Write a program to calculate the reverse complement of a strand of DNA. Do not use the s/// or the tr functions. Use the substr function, and examine each base one at a time in the original while you build up the reverse complement. (Hint: you might find it easier to examine the original right to left, rather than left to right, although either is possible.) Write a program to report how GC-rich some sequence is. (In other words, just give the percentage of G and C in the DNA.) Modify Example 1-5 to not only find motifs by regular expressions but to print out the motif that was found. For example, if you search, using regular expressions, for the motif EE.*EE, your program should print EETVKNDEE. You can use the special variable $&. After a successful pattern match, this special variable is set to hold the pattern that was matched. Write a program that switches two bases in a DNA string at specified positions. (Hint: you can use the Perl functions substr or slice.
15
Using Perl for Bioinformatics
16
Using Perl for Bioinformatics
17
Using Perl for Bioinformatics
18
Using Perl for Bioinformatics
19
Using Perl for Bioinformatics
This is the main program which seeds the random number algorithm and calls the subroutine, mutate(). The call to srand() uses the seed of time|$$, ORs the current time with the process id, creating a unique seed. This is not a very secure method but it will do for our purposes. The argument to mutate() is the current DNA string.
20
Using Perl for Bioinformatics
The subroutine mutate() takes the argument from the special array @_ and assigns it to the variable $dna. The array @ nucleotides is intialized with the values which are our nucleotides. The subroutine randomposition() takes the current dna string and returns a position within the string. The subroutine randomnucleotide() takes the our array of bases and returns a randomly selected value. Finally, the perl module substr() takes the DNA string, the random position, a length of our substitution string, here it is 1, the replacement string and returns the new string in the variable $dna.
21
Using Perl for Bioinformatics
Randomnucleotide() passes our array of bases to the function randomelement(), and in turn, returns the randomly chosen nucleotide. In randomelement(), an array is given to the function and returns a randomly selected element from the array. How is this done? Rand() expects a scalar value, evaluating the array @array in a scalar context, the size of @array. Perl was designed to take as array subscripts the integer part of a floating-point value. Here $array[rand @array] returns the element of the array associated with the subscript randomly chosen from 0 to n-1, where n is the length of the array.
22
Using Perl for Bioinformatics
23
Using Perl for Bioinformatics
Module codon2aa() returns a single character amino acid from the 3-character codon input Need to write a loop which will grab 3 characters while stepping through the RNA sequence
24
Using Perl for Bioinformatics
if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else{ print STDERR "Bad codon \"$codon\"!!\n"; exit; } }
This subroutine takes, as an argument, a three character DNA sequence and returns the single character representation of the amino acid. The data type used is a hash lookup. The condition if (exists $genetic_code($codon)) searches for a match between the 3 characters of the codon and the list of keys in the hash. The associated value of the key, if found, is returned. Otherwise an error is reported and the program terminates. This module is included in the module BeginPerlBioinf.pm, which will be used with other subroutines, throughout the rest of the workshop.
25
Using Perl for Bioinformatics
This is the perl code which, with only a few lines, translates DNA into a protein sequence. The command use lib instructs the perl compiler to append the search path for necessary libraries, like BeginPerlBioinfo.pm. BeginPerlBioinfo.pm is a part of the book Beginning Perl for Bioinformatics, by James Tysdall. The for loop references the dna string sequence by threes starting at the 0 Index : 0 3 6 9 . CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC The 3 character substring is assigned to the $codon variable by the perl command substr. Then $protein, returned by the subroutine codon2aa() is appended to the end of the current protein string. Results from running example2-3.pl: I translated the DNA CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC into the protein RRLRTGLARVGR
exit;
26
Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN Example 2-4 : Installing and testing bioperl My own experiences were slightly different
Download the core bioperl install file, version 1.4 the most recent Follow the make instructions included in the INSTALL documentation Carefully follow the make test instruction
Make sure you have an internet connection
I noticed that the LPW and IO::Strings were involved in quite a few failures
Here is where I installed the missing modules using the CPAN shell
>> perl MCPAN e shell At the CPAN prompt, install the missing module
cpan > install LPW
After exiting the CPAN shell, try make test to see if it lessens the failed responses
After concluding that the failures wont impede using bioperl, use the make install This usually puts the modules in /usr/lib/perl5/5.x.x/site_perl, on Linux systems
28
Using Perl for Bioinformatics
The last perl script uses NCBI to BLAST a sequence and saves the results to a file. This should be used judiciously as we dont want to abuse the computing cycles of NCBI. These requests should be done for individual searches. Download the blast package locally to do large numbers of BLAST searches.
29
Using Perl for Bioinformatics
3.
4.
5. 6.
30
Using Perl for Bioinformatics
In this section we will focus on reading FASTA format Sample of FASTA format:
> sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
31
Using Perl for Bioinformatics
read in data return array which contains each line of the file, @data Read in array of file data in fasta format Discard all header, blank and comment lines If first character of first line is >, discard it Read in the rest of the file, joined in a scalar, edit out non-sequence data, white spaces return sequence More often than not, the sequence to print is longer than most page widths Need to specify a length parameter to control the output
32
Using Perl for Bioinformatics
33
Using Perl for Bioinformatics
Extract_sequence_from_fasta_data() takes the array that is the contents of the fasta file. The foreach loop takes each of the elements of the array, a complete line of the file, and assigns it to the variable $line. The different conditions help us ignore the blank, comment and header lines: /^\s*$/ looks for lines that have just white spaces from beginning to end /^\s*#/ look for lines which have the pound character, preceded by white spaces, as a comment line /^>/ look for lines which have the greater-than symbol at the beginning of the line, the fasta header line all other lines are concatenated together into the $sequence variable When all is done, all white space characters are removed: $sequence =~ s/\s//g; The sequence is returned to the calling routine.
# remove non-sequence data (in this case,whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; }
34
Using Perl for Bioinformatics
Finally, the print_sequence() routine takes the cleaned string and an integer specifying the number of characters to print, per line. Again notice that the variables are assigned from the special array, @_. This is accomplished by the for for loop and the substr module. The print command takes a substring of the complete string on a new line.
Well, now that we have the produced the subroutines needed for our program, these subroutines have been installed in the BeginPerlBioinfo.pm module. Our program may be succinctly written as in the code to the left. The final command prints the sequence, passing the character string and the length to the print_sequence subroutine. Output from example3-1
agatggcggcgctgaggggtcttgg gggctctaggccggccacctactgg tttgcagcggagacgacgcatgggg cctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagt agttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgc aggttcgcgggtcgctggcgggggt Cgtgagggagtgcgccgggagcgga gaagttcgggggccccaacaagatc cggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgta caagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgc caaggccccgccggccactgcccac ccaacagcagccacagccatcacag aagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagt caaggagcctcctgaggctacagcc acacctgagccactctcagatgagg accta
Example 3-1
#!/usr/bin/perl # Read a fasta file and extract the sequence data use lib ../ModLib/; # Must point to where BeginPerlBioinfo.pm resides use strict; use warnings; use BeginPerlBioinfo; # Declare and initialize variables my @file_data = ( ); my $dna = ; # Read in the contents of the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Print the sequence in lines 25 characters long print_sequence($dna, 25); exit;
35
Using Perl for Bioinformatics
36
Using Perl for Bioinformatics
We are going to reuse our old code from Section 1, revcom(). We have to rewrite it as a subroutine. Now we need to design that subroutine which will break the DNA strings into our frames and translate the string into proteins. Our old perl command substr() should do the trick for taking apart our frames. The unless($end) condition checks for a value in the variable $end, if no value then it calculates the end value as the length of the sequence. The length of the desired sequence doesnt change with the change in indices, since: (end - 1) - (start - 1) + 1 = end - start + 1 Translating to peptides we revisite our codon2aa() subroutine, from Section 2. This has been included in a subroutine dna2peptide() which is, already, in BeginPerlBioin.pm.
37
Using Perl for Bioinformatics
Now that we have done all that work, and it appears that our subroutines will provide us with the functon we need, these routines are provided in BeginPerlBioinf.pm. So, the Perl program is a short exercise and is very modular. Output from example 3-2
-------Reading Frame 1-------RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAE WSVQVRGSLAGVVRECAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKP DINCFMIGCDNCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKS RERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSP QPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKI RQKCRLRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRI REDEGAVASSTVKEPPEATATPEPLSDEDL -------Reading Frame 5-------RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDS EGVTGESEEGKYLYDSRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRH ASHSPHMRADRLICCCCCW_CWLGVATKGCGEDLWGEAEPRASMAPTPVPDPARR CRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFSLHSRQYHSRM ALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGG SGSEPSPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAY SYCAGPMRRLRCKPVGGRPRAPKTPQRRH -------Reading Frame 6-------GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTL RASLVRARKGSTCTIPGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADM PHTHHTCGLTV_SAAAAAGDAGWVWPPRAAERICGAKQSPEQAWPQPLSLTLPGA AGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSLCTPDSTTPGW PWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEA LGLNHLPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRT PIAQAPCVVSAANQ_VAGLEPPRPLSAAI
38
Using Perl for Bioinformatics
3. 4.
5.
39
Using Perl for Bioinformatics
Section 4 : GenBank (Genetic Sequence Data Bank) Files International repository of known genetic sequences from a variety of organisms GenBank is a flat file, an ASCII text file, that is easily readable GenBank referred to as a databank or data store
Databases have a relational structure includes associated indices links and a query language.
Perl modules and constructs are ideal for processing flat files For additional bioinformatics software, reference these web sites
National Center for Biotechnology Information (NCBI) National Institutes of Health (NIH), http://proxy.lib.ohio-state.edu:2224 European Bioinformatics Institute (EBI), http://www.ebi.ac.uk European Molecular Biology Laboratory (EMBL), http://www.embl-heidelberg.de/
40
Using Perl for Bioinformatics
41
Using Perl for Bioinformatics
For a view of the complete file and its format, look at record.gb in Section 4 of the exercises. A typical GenBank entry is packed with information. With perl we will be able to separate the different parts. For instance, by extracting the sequence, we can search for motifs, calculate statistics on the sequence, or compare with other sequences. Also, separating the various parts of the data annotation, we have access to ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation includes specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on. The format specification of GenBank files and a great deal of other information about GenBank can be found in theGenBank release notes, gbrel.txt, on the GenBank web site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.
42
Using Perl for Bioinformatics
We need a subroutine which will parse the annotation part of the file from the dna sequence. Here in the main Perl code, the subroutine, parse1(), will be that subroutine. But there is a different twist this time. The variables @annotation and $sequence are included as arguments in the call, and preceded by the back-slash character. This instructs the compiler to pass the variable by reference, rather than by value. The actual location in memory is passed to the subroutine and any changes made there will remain after the subroutine passes back to the calling routine.
43
Using Perl for Bioinformatics
The parse1() routine takes the cleaned string and an integer specifying the number of characters to print, per line. Again notice that the variables are assigned from the special array, @_. The two variables accept the reference to memory for @annotation and $dna. The array, @GenBankFile is parsed in the foreach loop. Each line is extracted and evaluated. There are two Because the annotation part of the file appears first, the flag, $in_sequence, is set to false. The conditional checks are : 1. Check for the end line \\; last leaves the logical block 2. If in_sequence is true, append the line to the DNA sequence 3. If the word ORIGIN is at the beginning, set the in_sequence variable to true 4. Otherwise append the array with the the annotation line The last command is to delete all white spaces and numbers from the DNA sequence.
44
Using Perl for Bioinformatics
Section 4 : GenBank (Genetic Sequence Data Bank) Files Example 4-2 : Parsing GenBank file using scalars A second way to separate annotations from sequences in GenBank records is to read the entire record into a scalar variable Then operate on it with regular expressions For some kinds of data, this can be a more convenient way to parse the input Problem with multiple newlines, \n, in the sequence Previous regular expressions have used the caret (^), dot (.), and dollar sign ($) metacharacters The following two pattern modifiers affect these three metacharacters:
The /s modifier assumes you want to treat the whole string as a single line, even with embedded newlines; it makes the dot metacharacter match any character including newlines. The /m modifier assumes you want to treat the whole string as a multiline, with embedded newlines; extends the ^ and the $ to match after, or before, a newline, embedded in the string.
45
Using Perl for Bioinformatics
The output of this program is just about the same as Example 4-1. There is the annotation section and the DNA section. The input separator is initially set the the two forward slashes, //\n, which marks the end of the Genbank file. The contents of the file is assigned to the one string variable, $record. Now we will use the regular expression that parses the annotation and sequence out of the $record variable: $record = /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s. There are two pairs of parentheses in the regular expression: (LOCUS.*ORIGIN\s*\n) and (.*). The parentheses are metacharacters which remember the parts of the data that match the pattern within the parentheses, here the annotation and the sequence. Also note that the pattern match returns an array whose elements are the matched parenthetical patterns. After matching the annotation and the sequence within the pairs of parentheses in the regular expression, the matched patterns are assigned to the two variables $annotation and $dna: ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); Notice that at the end of the pattern, weve added the /s pattern matching modifier, which, as youve seen earlier, allows a dot to match any character including an embedded newline.
46
Using Perl for Bioinformatics
2.
3. 4.
47
Using Perl for Bioinformatics
Section 5 : Protein Data Bank (PDB) PDB is the main source for information about 3D structures of acromolecules proteins
peptides, viruses, protein/nucleic acid complexes, nucleic acids, carbohydrates
PDB files are like GenBank records, human-readable ASCII flat files May be problems with consistency of PDB files Routine/programs which work well with newer PDB files may have problems with older files We will look at tools which operate on large numbers of files and folders
Example 5-1 : Print the contents of folders and all subfolders Use the Perl module of opendir and readdir Need to construct a logic to read into subfolders Use the test of -d and -f to test for regular files and folders
48
Using Perl for Bioinformatics
Section 5 : PDB
Example 5-1
#!/usr/bin/perl # Example 5-1 Demonstrating how to open a folder and list its contents # -distinguishing between files and subfolders, which # are themselves listed use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; my @files = ( ); my $folder = 'pdb'; # Open the folder unless(opendir(FOLDER, $folder)) { print "Cannot open folder $folder!\n"; exit; } # Read the folder, ignoring special entries "." and ".." @files = grep (!/^\.\.?$/, readdir(FOLDER)); closedir(FOLDER); # If file, print its name # If folder, print its name and contents # # Notice that we need to prepend the folder name! foreach my $file (@files) { # If the folder entry is a regular file if (-f "$folder/$file") { print "$folder/$file\n"; # If the folder entry is a subfolder }elsif( -d "$folder/$file") {
This Perl code attemts to open a folder and subfolders to list the files contained in each. The readdir command opens a directory a creates a list of entries. Using the grep command and regular expression, we can cull out the entries of . and ..:
@files = grep (!/^\.\.?$/, readdir(FOLDER));
Here the !/^\.\.?$/ checks against the list created by readdir and looks for anything NOT a . or ... Also, the imbedded condition of checking if an entry is a folder/directory permits us to dive into each folder and retrieve the files.
49
Using Perl for Bioinformatics
Section 5 : Protein Data Bank (PDB) Example 5-2 : Extract sequence chains from PDB file Take a look at a PDB file Can be very lengthy PDB files are composed of lines of 80 columns that begin with one of several predefined record names and end with a newline. Lets start with extracting the amino acid sequence data. To extract the amino acid primary sequence information, you need to parse the record type SEQRES The SEQRES record type is one of four record types in the Primary Structure Section Represents the primary structure of the peptide or nucleotide sequence: Here is a SEQRES line from the PDB file
SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL
50
Using Perl for Bioinformatics
51
Using Perl for Bioinformatics
Section 5 : PDB
Example 5-2 #!/usr/bin/perl # Extract sequence chains from PDB file use lib ../ModLib; use strict; use warnings; use BeginPerlBioinfo; # Read in PDB file: Warning - some files are very large! my @file = get_file_data(pdb/c1/pdb1c1f.ent); # Parse the record types of the PDB file my %recordtypes = parsePDBrecordtypes(@file); # Extract the amino acid sequences of all chains in the protein my @chains = extractSEQRES( $recordtypes{SEQRES} ); # Translate the 3-character codes to 1-character codes, and print foreach my $chain (@chains) { print "****chain $chain **** \n"; print "$chain\n"; print iub3to1($chain), "\n"; } exit; This Perl code will incorporate three subroutines: 1. parsePDBrecordtypes(), takes an array and returns a key, value pair hash data. 2. extractSEQRES(), extract from the hash data given an scalar containing SEQRES lines, return an array containing the chains of the sequence 3. iub3tol(), change string of 3-character IUB amino acid codes whitespace separated) into a string of 1-character amino acid codes Because of the size of PDB files memory limitations might be a problem. The use of memory can be lessened by not saving the results of reading in the file, but instead passing the file data directly to the parsePDBrecordtypes subroutine: # Get the file data and parse the record types of the PDB file %recordtypes = parsePDBrecordtypes(get_file_data(pdb/c1/pdb1c1f.ent)); Further savings of memory are possible by rewriting the program to just read the file one line at a time while parsing the data into the record types. Output from Exercise 5-2
SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN SER LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY ALA ASP GLN ASN THR ILE VAL MET ASN SER THR LEU LYS GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN PHE THR LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU SER TYR ASP ILE ASN LYS PHE TYR ILE ASP ILE LEU ASP GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU PHE LEU PRO PHE LEU SER LEU ALA GLY ASP ALA ARG LEU THR LEU VAL LYS LEU GLU SGGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVM NSTLKGDNGWETEQRSTNFTLSAGQYFEITLSYDINKFYIDILDGPNLEFPNRYSKEFLPFLSL AGDARLTLVKLE
52
Using Perl for Bioinformatics
Section 5 : PDB
Example 5-2 # parsePDBrecordtypes # #-given an array of a PDB file, return a hash with # keys = record type names # values = scalar containing lines for that record type sub parsePDBrecordtypes { my @file = @_; use strict; use warnings; my %recordtypes = ( ); foreach my $line (@file) { # Get the record type name which begins at the # start of the line and ends at the first space # The pattern (\S+) is returned and saved in $recordtype my($recordtype) = ($line =~ /^(\S+)/); parsePDBrecordtypes() parses the PDB record types from an array containing the lines of the PDB record. Follow the comments which describe whats happening. Basically, each line is examined for its record type and is then added to the value of a hash entry with he record type as the key. In the code the RE /^(\S+)/ matches any word at the beginning of the line. Remember the \S is interpreted as any non-white-space character. The enclosing parentheses represents the successful match saved as a special variable the lines that match that first key are associated with that key: key => value COMPND => COMPND MOL_ID: 1;\n COMPND 2 MOLECULE: CONGERIN I;\nCOMPND 2 MOLECULE: CONGERIN I;\n COMPND 3 CHAIN: A; \nCOMPND 4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN; \nCOMPND 5 BIOLOGICAL_UNIT: HOMODIMER\n Every line in the array with the same record type is concatenated onto the rest. The hash is returned from the subroutine. Example 5-2(contd) # .= fails if a key is undefined, so we have to # test for definition and use either .= or = depending if(defined $recordtypes{$recordtype} ) { $recordtypes{$recordtype} .= $line; }else{ $recordtypes{$recordtype} = $line; } } return %recordtypes; }
53
Using Perl for Bioinformatics
Section 5 : PDB
Example 5-2
# extractSEQRES # #-given an scalar containing SEQRES lines, # return an array containing the chains of the sequence sub extractSEQRES { use strict; use warnings; my($seqres) = @_; my $lastchain = ''; my $sequence = ''; my @results = ( ); # make array of lines my @record = split ( /\n/, $seqres); foreach my $line (@record) { # Chain is in column 12, residues start in column 20 my ($thischain) = substr($line, 11, 1); my($residues) = substr($line, 19, 52); # add space at end # Check if a new chain, or continuation of previous chain if("$lastchain" eq "") { $sequence = $residues; }elsif("$thischain" eq "$lastchain") { $sequence .= $residues;
Example 5-2
# Finish gathering previous chain (unless first record) }elsif ( $sequence ) { push(@results, $sequence); $sequence = $residues; } $lastchain = $thischain; } # save last chain push(@results, $sequence); return @results; }
Lets examine the subroutine extractSEQRES. The record types have been parsed out, and extracted the primary amino acid sequence. We need to extract each chain separately and return an array of one or more strings of sequence corresponding to those chains, instead of just one sequence. When passed to the subroutine, the varialble $seqres is assigned the required SEQRES record type, which stretches over several lines, in a scalar string that is the value of the key SEQRES in a hash. Here we use the same approach as with the previous parsePDBrecordtypes subroutine that used iteration over lines (as opposed to regular expressions over multiline strings). The split Perl function enables us to turn a multiline string into an array. As we iterate through the lines of the SEQRES record type, mark when a new chain is starting, save the previous chain in @results, reset the $sequence array, and reset the $lastchain flag to the new chain. Also, when done with all the lines, make sure to save the last sequence chain in the @results array.
54
Using Perl for Bioinformatics
Section 5 : PDB
Example 5-2 # iub3to1 #-change string of 3-character IUB amino acid codes (whitespace separated) # into a string of 1-character amino acid codes sub iub3to1 { my($input) = @_; my %three2one = (
Example 5-2
# clean up the input $input =~ s/\n/ /g; my $seq = ''; # This use of split separates on any contiguous whitespace my @code3 = split(' ', $input); foreach my $code (@code3) { # A little error checking if(not defined $three2one{$code}) { print "Code $code not defined\n"; next; } $seq .= $three2one{$code}; } return $seq; }
'ALA' => 'A', 'VAL' => 'V', 'LEU' => 'L', 'ILE' => 'I', 'PRO' => 'P', 'TRP' => 'W', 'PHE' => 'F', 'MET' => 'M', 'GLY' => 'G', 'SER' => 'S', 'THR' => 'T', 'TYR' => 'Y', 'CYS' => 'C', 'ASN' => 'N', 'GLN' => 'Q', 'LYS' => 'K', 'ARG' => 'R', 'HIS' => 'H', 'ASP' => 'D', 'GLU' => 'E',
);
The subroutine iub3t01 translates the three-character codes, which the in PDB sequence information is in , into one-character codes. The hash is defined within the subroutine. The split Perl function is, again, used to create an array from a string. Now the foreach loop merely matches the array entry to the list of hash keys. The single character from the translation is added to the end of our sequence.
55
Using Perl for Bioinformatics
Section 5 : PDB
Exercises for Section 5 1. 2. 3. Write a recursive subroutine to list a filesystem. Be sure to check if an entry is a file or folder. Write a recursive subroutine to determine the size of an array. You may want to use the pop or unshift functions. Write a recursive subroutine that extracts the primary amino acid sequence from the SEQRES record type of a PDB file. Given an atom and a distance, find all other atoms in a PDB file that are within that distance of the atom.
4.
56
Using Perl for Bioinformatics
Section 6 : Basic Local Alignment Search Tool (BLAST) Search for sequence similarity is very important BLAST is one of the popular software tools in biological research BLAST tests a query sequence against a library of known sequences in order to find similarity A collection of programs with versions for query-to-database pairs
nucleotide-nucleotide protein-nucleotide protein-protein nucleotide-protein
Goal of this section is to write Perl code to parse a BLAST output file using regular expressions The code is basic and efficient, which may lead to more extensive algorithms Online documentation for BLAST is extensive Here we are interested in parsing the BLAST file, rather than the theory
57
Using Perl for Bioinformatics
Section 6 : Basic Local Alignment Search Tool (BLAST) Example 6-1 : Extracting annotations and alignments BLAST File used in this section, blast.txt, created from a BLAST query using the file sample.dna in Sections 3 and 4 Introduce two new subroutines, parse_blast and parse_blast_alignment Use regular expressions to extract the various bits of data from a scalar string
58
Using Perl for Bioinformatics
The main program does no more than call the parsing subroutine and print the results. The arguments, initialized as empty, are passed by reference, with the preceding \ in the routine call. The subroutine parse_blast does the parsing job of separating the three sections of a BLAST output file: 1. the annotation at the beginning, 2. the alignments in the middle, 3. the annotation at the end. It then calls the parse_blast_alignment subroutine to extract the individual alignments from that middle alignment section. The data is first read in from the named file with get_file_data subroutine, using the join function to store the array of file data into a scalar string. The pattern match contains three parenthesized expressions: 1. (.*^ALIGNMENTS\n) is assigned to $$beginning_annotation 2. (.*) is saved in $alignment_section 3. (^ Database:.*) is saved in $$ending_annotation Lets see if we can agree what is happening. The regular expression in the parenthesized expression matches everything up to the word ALIGNMENTS followed by an end-of-line; collect everything for a while, (.*), which is the $alignment_section; then a line that begins with two spaces and the word Database followed by the rest of the file (^ Database:.*). These are the three desired parts of the BLAST output file; the beginning annotation, the alignment section, and the ending annotation. The last command creates the hash %alignments by calling the parse_blast_alignment subroutine, which takes a scalar string argument, the section which contains the alignment.
59
Using Perl for Bioinformatics
The subroutine parse_blast_alignment. This subroutine has one important loop. The while loop, which, remember the + is interpreted to mean one or more matches; the /g matches throughout the string; while using the m modifier to keep the matches from the beginning of each line. Each time the program cycles through the loop, the pattern match finds the value (the entire alignment), then determines the key. The key and values are saved in the hash %alignment_hash. The regular expression ^>.*\n looks for > at the beginning of the BLAST output line, followed by .*, which matches the first line of the alignment. The rest of the regular expression, (^(?!>).*\n)+ a negative lookahead assertion, (?!>), ensuring that a > doesnt follow. The .* matches all non-newline characters, up to the final \n at the end of the line. The surrounding +, matches all the available REs. The call to split, split(/\|/, $value), splits $value into pieces delimited by | characters. That is, the | symbol is used to determine where one list element ends and the next one begins. This function creates an array of the sections of the string, parsing before and after the verticle bars. Surrounding the call to split with parentheses and adding an array offset ([1]), the second index into the array, saving the key into $key.
60
Using Perl for Bioinformatics
Here is our main routine, which merely calls the our two subroutines and creates the output. The data are passed by reference in the parse_blast subroutine. As the discussed before subroutines have been added to BeginPerlBioinf.pm, this extremely short routine is all we need to run our code.
61
Using Perl for Bioinformatics
Section 6 : Basic Local Alignment Search Tool (BLAST) Example 6-2 : Parse alignments from BLAST output file Taking exercise 6-1 a little further, some of the alignments include more than one aligned string To parse each alignment, we have to parse out each of the matched strings, which are called high-scoring pairs (HSPs). Parse each HSP into annotation, query string, and subject string, together with the starting and ending positions of the strings We include a pair of subroutines
one to parse the alignments into their HSPs the second to extract the sequences and their start-end positions
62
Using Perl for Bioinformatics
The subroutine parse_blast_alignment_HSP, takes one of the alignments from the BLAST output and separates out the individual HSP string matches. Here, the first regular expression: ($beginning_annotation, $HSP_section ) = ($alignment =~ /(.*?)(^ Score =.*)/ms); parses out the annotation and the section containing the HSPs. The first parentheses in the regular expression is (.*?) This is the minimal matching, which gathers up everything before the first line that begins Score = (without the ? after the *, it would gather everything until the final line that begins Score =). This is the exactly what we want, dividing between the beginning annotation and the HSP string matches. The while loop and regular expression separates the individual HSP string matches: while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) { push(@HSPs, $&); } This is the same kind of global string match in a while loop that keeps iterating as long as the match can be found. The other modifier /m is the multiline modifier, which enables the metacharacters $ and ^ to match before and after embedded newlines. The expression within the first pair of parentheses, (^ Score =.*\n), matches a line that begins Score =, which is the beginning of an HSP string match section. The RE Compare within the second pair of parentheses, (^(?! Score =).*\n)+, matches one or more (the +) lines that do not begin with Score =. The ?! at the beginning of the embedded parentheses is the negative lookahead assertion we saw in Example 6-1. So, simply, the regular expression captures a line beginning with Score = and all the following adjacent lines that dont begin with Score =. Remember, the RE special variable $& has the value of the last successful pattern match. This will create an array of all the listed HSPs in the BLAST output.
63
Using Perl for Bioinformatics
The subroutine extract_HSP_information returns the parsed values to the main program; parsed from the HSP information. As an exercise explain how the regular expressions parse the information. Remember some details about REs: 1. \S+ matches all non-white-space characters 2. () saves the enclosed match in a special variable, eg. $1 3. \d+ matches any string of numbers; \D matches nondigits Try running the program and see the output. Look at what values were created.
64
Using Perl for Bioinformatics
#!/usr/bin/perl # Example 6-2 Parse alignments from BLAST output file use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; # declare and initialize variables my $beginning_annotation = ''; my $ending_annotation = ''; my %alignments = ( ); my $alignment = ''; my $filename = 'blast.txt'; my @HSPs = ( ); my($expect, $query, $query_range, $subject, $subject_range) = ('','','','',''); parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename); $alignment = $alignments{'AK017941.1'}; @HSPs = parse_blast_alignment_HSP($alignment); ($expect, $query, $query_range, $subject, $subject_range) = extract_HSP_information($HSPs[1]); # Print the results print "\n-> Expect value: $expect\n"; print "\n-> Query string: $query\n"; print "\n-> Query range: $query_range\n"; print "\n-> Subject String: $subject\n"; print "\n-> Subject range: $subject_range\n"; exit;
In this example, the key is coded into the program, $alignment = $alignments{'AK017941.1'}; How could we change the code to look for all possible keys? Again, we have installed the two subroutines in BeginPerlBioinfo.pm. Here is the output of running this succinct code: -> Expect value: 5e-52 -> Query string: ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaat ggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca -> Query range: 235..400
-> Subject String: ctggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgaga atgggctgagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatg attggacttgtgacaactgcaacgagtggttccatggagactgcatccggatca -> Subject range: 1048..1213
65
Using Perl for Bioinformatics
66
Using Perl for Bioinformatics
67
Using Perl for Bioinformatics
References
Perl Programming for Biologists, Jamison, Curtis D., John Wiley & Sons, Inc., 2003 Beginning Perl for Bioinformatics, James Tisdall, OReilly Pub., 2001 [****], very much recommended Mastering Perl for Bioinformatics, James Tisdall, O'Reilly Pub., 2003
68
Using Perl for Bioinformatics
69
Using Perl for Bioinformatics