Perl Bioinf 0411 PDF

Using Perl for Bioinformatics
Science and Technology Support Group High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH 43212-1163
Table of Contents
Section 1
Concatenate sequences Transcribe DNA to RNA Reverse complement of sequences Read sequence data from files Searching for motifs in DNA or proteins Exercises 1
Section 3
Read FASTA Files Exercises 3
Section 4
GenBank Files and Libraries Exercises 4
Section 2
Subroutines Mutations and Randomization Translating DNA into Proteins using Libraries of Subroutines BioPerl Modules Exercises 2
Section 5
PDB Exercises 5
Section 6
Blast Files Exercises 6
2
Section 1 : Sequences and Regular Expressions

Example 1-1 : Concatenation of two strings of DNA Concatenating two DNA sequences defined by two perl variables.
Two character sequences assigned to scalar variables. The two sequences are used to create a third variable. The third variable is the concatenated sequence by use of the ..
Use print command to print concatenated sequence stdout.

Example 1-1 uses many different routines to print out the concatenated sequence. Use of the newline character, \n.
3

Example 1-1 #!/usr/bin/perl -w # Example 1-1 Concatenating DNA # Store two DNA fragments into two variables called $DNA1 and $DNA2 $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; # Print the DNA onto the screen print "Here are the original two DNA fragments:\n\n"; print $DNA1, "\n"; print $DNA2, "\n\n"; # Concatenate the DNA fragments into a third variable and print them # Using "string interpolation" $DNA3 = "$DNA1$DNA2"; print "Here is the concatenation of the first two fragments (version 1):\n\n"; print "$DNA3\n\n"; # An alternative way using the "dot operator": # Concatenate the DNA fragments into a third variable and print them $DNA3 = $DNA1 . $DNA2; print "Here is the concatenation of the first two fragments (version 2):\n\n"; print "$DNA3\n\n"; # Print the same thing without using the variable $DNA3 print "Here is the concatenation of the first two fragments (version 3):\n\n"; print $DNA1, $DNA2, "\n"; exit;
Note the different uses of the assignment to DNA3 achieve the same result: 1. $DNA3 = $DNA1$DNA2; 2. $DNA3 = $DNA1.$DNA2; Results of running example 1-1: Here are the original two DNA fragments: ACGGGAGGACGGGAAAATTACTACGGCATTAGC ATAGTGCCGTGAGAGTGATGTAGTA Here is the concatenation of the first two fragments (version 1): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA Here is the concatenation of the first two fragments (version 2): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA Here is the concatenation of the first two fragments (version 3): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA
4

Example 1-2 : Transcribing DNA to RNA Converting all thymine with uracil in the DNA
Replace all the T characters in the string with U. Use binding operator =~. Regular expression substitution, globally, s/T/U/g.
5

Example 1-2 #!/usr/bin/perl -w # Transcribing DNA into RNA # The DNA $DNA = ACGGGAGGACGGGAAAATTACTACGG CATTAGC; # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting # all Ts with Us. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; # Exit the program. exit;
1. 2.
Assign the variable $RNA to the string $DNA. $RNA =~ s/T/U/g; is evaluated as substitute all uppercase Ts with uppercase Us.
Results of running example 1-2: Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the result of transcribing the DNA to RNA: ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC
6

Example 1-3 : Calculating the Reverse Compliment of a DNA strand Find the reverse of the DNA string. Calculate the compliment of the reversed string.
Substitute for all bases their compliment.
A -> T; T -> A; C -> G; G -> C.
Could use the substitute function of the regular expression

$var =~ s/A/T/g; $var =~ s/T/A/g; $var =~ s/C/G/g; $var =~ s/G/C/g;
This would result in error!? Fortunately there is an operation with regular expressions called translator.
7

Example 1-3 #!/usr/bin/perl -w # Calculating the reverse complement of strand of DNA # The DNA $DNA =ACGGGAGGACGGGAAAATTACTACGGCATTAGC; # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Make a new copy of the DNA $revcom = reverse $DNA; # See the text for a discussion of tr/// $revcom =~ tr/ACGTacgt/TGCAtgca/; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; exit;
Note that the translator replaces the characters in the first sequence with the corresponding character in the second sequence. In this example both uppercase and lowercase replacement of the bases are translated. Results of running example 1-3: Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the reverse complement DNA: GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT
8

Example 1-4 : Reading protein sequences from a file Use open.
Use a character string variable. open(FILEPOINTER, $filename);
Read in the contents.

Use angle brackets, <FILEPOINTER>. Need to create a loop to read in all lines
Read from a file named in the command line.

Use angle brackets, <>. Do not need to create a filepointer. Read into an array Need to create a loop to read in all lines of the array
9

Example 1-4 #!/usr/bin/perl -w $longprotein = ''; # Example 4-5 Reading protein sequence data from a file # Usage: perl example1-4.pl # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; # First we have to "open" the file, and associate # a "filehandle" with it. We choose the filehandle # PROTEINFILE for readability. open(PROTEINFILE, $proteinfilename); # Now we do the actual reading of the protein sequence from the # file by using the angle brackets < and > to get the input from the # filehandle. We store the data into our variable $protein. while ($protein = <PROTEINFILE>) { $longprotein .= $protein; } # Now that we've got our data, we can close the file. close PROTEINFILE; # Print the protein onto the screen print "Here is the protein:\n\n"; print $longprotein; exit; The filename is set by assigning the string variable $proteinfilename. The while loop reads in from the file one line at a time. Each line from the file is concatenated on the end of the previous string. It is good programming practice to close the file pointer when done. Note how the output is each line of the file is on a newline. Results of running example 1-4: Here is the protein:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR
10

Example 1-4 #!/usr/bin/perl w $longprotein = ''; # Example 4-5 Reading protein sequence data from a file # Usage: perl example1-4b.pl filename # The filename of the file containing the protein sequence data # is in the command line. The '<>' is shortcut for <ARGV>. # the<ARGV> treats the @ARGV array as a list of # filenames, returning the contents # of those files one line at a time. The contents of those files are # available to the program, using the angle brackets <>, # without a filehandle. @data_from_file = <>; # Using the foreach loop, we access the data from the array, # one line at a time. Removing the 'newline' from the string, # concatenate to the string variable, making one long protein # string. foreach (@data_from_file) { chop $_; $longprotein .= $_; } # Print the protein onto the screen print "Here is the protein:\n"; print $longprotein."\n"; exit; The filename is given as an argument on the command line. This is much more convenient than writing a different perl script for each file we need to open. The command: @data_from_file = <>; treats each list on the command line as a file, opens each file, and then reads each line of the file into the array. Creating a filehandle is not needed. The foreach loop then retrieves each element of the array, discards the newline at the end, then concatenates the string onto the end of the string variable $longprotein. Results of running example 1-4b: Here is the protein:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQDSVLQDRSMPHQEILAAD EVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQGLQYALNVPISVK QEITFTDVSEQLMRDKKQIR
11

Example 1-5 : Searching for motifs in DNA or proteins Prompt the user for filename and protein strings
Specify a filename to open open(FILEPOINTER, $filename);
Read in the contents.

Read the lines of the file into an array. Concatenate all lines of the array into a scalar variable. Remove all newlines and blanks from the scalar variable.
Compare the motif entered from the terminal to the protein string.
Use regular expression comparison. Exit the program when motif only contains whitespaces.
12

Example 1-5 #!/usr/bin/perl -w # Example 5-3 Searching for motifs # Ask the user for the filename of the file containing # the protein sequence data, and collect it from the keyboard print "Please type the filename of the protein sequence data: "; $proteinfilename = <STDIN>; # Remove the newline from the protein filename chomp $proteinfilename; # open the file, or exit unless ( open(PROTEINFILE, $proteinfilename) ) { print "Cannot open file \"$proteinfilename\"\n\n"; exit; } # Read the protein sequence data from the file, and store it # into the array variable @protein @protein = <PROTEINFILE>; # Close the file - we've read all the data into @protein now. close PROTEINFILE; # Put the protein sequence data into a single string, as it's easier # to search for a motif in a string than in an array of # lines (what if the motif occurs over a line break?) $protein = join( '', @protein); # Remove whitespace $protein =~ s/\s//g; The filename is given as standard input to the question: $proteinfilename = <STDIN>; The unless condition checks for the presence of the file, exiting if not found: unless ( open(PROTEINFILE, $proteinfilename) ) Each line of the file is then put into an array, @protein, after which the filehandle is closed: @protein = <PROTEINFILE>; By using join each line in the array is put into one long character string, including newline characters: $protein = join( '', @protein); All whitespaces, including newlines, tabs and blanks, are then removed. $protein =~ s/\s//g;
13

Example 1-5 (contd) # In a loop, ask the user for a motif, search for the motif, # and report if it was found. # Exit if no motif is entered. do { print "Enter a motif to search for: "; $motif = <STDIN>; # Remove the newline at the end of $motif chomp $motif; # Look for the motif if ( $protein =~ /$motif/ ) { print "I found it!\n\n"; } else { print "I couldn\'t find it.\n\n"; } # exit on an empty user input } until ( $motif =~ /^\s*$/ ); # exit the program exit; The loop controls the search for the character string in the entire protein string. The variable $motif is assigned the character string typed in the shell: $motif = <STDIN>; The newline character is removed from the end of the string: chomp $motif; The character string $motif is compared to the protein string for a match: $protein =~ /$motif/ When the user types nothing but whitespaces, the program exits: until ( $motif =~ /^\s*$/ );
Results from running example1-5.pl: Please type the filename of the protein sequence data: NM_021964fragment.pep Enter a motif to search for: SVLQ I found it! Enter a motif to search for: sqlv I couldnt find it. Enter a motif to search for: QDSV I found it! Enter a motif to search for: HERLPQGLQ I found it! Enter a motif to search for: I couldnt find it.
14

Exercises for Section 1
1. Explore the sensitivity of programming languages to errors of syntax. Try removing the semicolon from the end of any statement of one of our working programs and examining the error messages that result, if any. Try changing other syntactical items: add a parenthesis or a curly brace; misspell some command, like "print" or some other reserved word; just type in, or delete, anything. Programmers get used to seeing such errors; even after getting to know the language well, it is still common to have some syntax errors as you gradually add code to a program. Notice how one error can lead to many lines of error reporting. Is Perl accurately reporting the line where the error is? Write a program that prints DNA (which could be in upper- or lowercase originally) in lowercase (acgt); write another that prints the DNA in uppercase (ACGT). Use the function tr///. Do the same thing as Exercise 2, but use the string directives \U and \L for upper- and lowercase. For instance, print "\U$DNA" prints the data in $DNA in uppercase. Prompt the user to enter two (short) strings of DNA. Concatenate the two strings of DNA by appending the second to the first using the .= assignment operator. Print the two strings as concatenated, and then print the second string lined up over its copy at the end of the concatenated strings. For example, if the input strings are AAAA and TTTT, print: AAAATTTT
TTTT
2. 3. 4.
5.
6. 7.
8.
Write a program to calculate the reverse complement of a strand of DNA. Do not use the s/// or the tr functions. Use the substr function, and examine each base one at a time in the original while you build up the reverse complement. (Hint: you might find it easier to examine the original right to left, rather than left to right, although either is possible.) Write a program to report how GC-rich some sequence is. (In other words, just give the percentage of G and C in the DNA.) Modify Example 1-5 to not only find motifs by regular expressions but to print out the motif that was found. For example, if you search, using regular expressions, for the motif EE.*EE, your program should print EETVKNDEE. You can use the special variable $&. After a successful pattern match, this special variable is set to hold the pattern that was matched. Write a program that switches two bases in a DNA string at specified positions. (Hint: you can use the Perl functions substr or slice.
15
Section 2 : Mutations, Randomization and Modules

Example 2-1 : Counting bases in DNA string, using subroutines. Subroutines are very efficient
Write once, use many times. Routines which have a pervasive utility may be stored in a library for future use.
Lexical scoping using my declaration

Important to understand the scope of variables Use my to declare variables with in the scope of the code Variable names may be used in different code segments Declare use strict to enforce variables to be defined with my
Use special array to pass arguments to subroutine

my($var1, $var2, $var3) = @_; This will assign the values of arguments passed to the subroutine to the named variables Mistake of not using the @_
Variables will not have their passed values
16

Example 2-1 #!/usr/bin/perl -w # Example 2-1 Counting the number of G's in some DNA on the # command line use strict; # Collect the DNA from the arguments on the command line # when the user calls the program. # If no arguments are given, print a USAGE statement and exit. # $0 is a special variable that has the name of the program. my($USAGE) = "$0 DNA\n\n"; # @ARGV is an array containing allcommand-line arguments. # # If it is empty, the test will fail and the print USAGE and exit # statements will be called. unless(@ARGV) { print $USAGE; exit; } # Read in the DNA from the argument on the command line. my($dna) = $ARGV[0]; The command use strict requires all variables to use the my declaration for all variables. This will limit the scope of any variable. Declare a string variable to keep usage line. The unless condition will make sure there are arguments on the command line. The special array, @ARGV, exists only if there are arguments present on the command line. Assign the value of the character string in the command line to the variable $dna. Here the first value of the array of argument array, and in this case the only argument, is represented by the variable $ARGV[0]. Here the individual elements of an array are references by the syntax $array1[n].
17

Example 2-1 (contd) # Call the subroutine that does the real work, and collect the result. my($num_of_Gs) = countG ( $dna ); # Report the result and exit. print "\nThe DNA $dna has $num_of_Gs G\'s in it!\n\n"; exit; ######################################## # Subroutines for Example 2-1 ######################################## sub countG { # return a count of the number of G's in the argument $dna # initialize arguments and variables my($dna) = @_; my($count) = 0; # Use the tr on the regular expression for # counting nucleotides in DNA $count = ( $dna =~ tr/Gg//); return $count; } The subroutine countG takes a character string as an argument and returns a number. The line my($num_of_Gs) = countG($dna); passes the dna sequence to the subroutine countG and assingns the returned number to the variable $num_of_Gs. The variable $dna, now lexically scoped only to the subroutine, is assigned the value passed. The variable count is initialized to the value 0. The translate of the dna string, $dna =~ tr/Gg//, will effectively remove any upper or lower case G from the string. The assignment to the variable $count is a count of the list which is the successful tranlations, and is returned. Results from running example2-1.pl: perl example2-1.pl CGGATTTAGCGCGT The DNA CGGATTTAGCGCGT has 5 G's in it!
18

Example 2-2 : Creating mutant DNA using Perls random number generator Simulate mutating DNA using random number generator
Randomly pick a nucleotide in a DNA string Randomly pick a basis from the four, A, C, T, G Replace the picked nucleotide in the selected position of the DNA string with the randomly selected basis
Random number algorithms are only psuedo-random numbers

With the same seed, random number generators will produce the series of numbers Algorithms are designed to give an even distribution of values
Random numbers require a seed

Should be selected randomly, as well Different seed values will produce different sequences of random numbers If program security and privacy issues, patient records,is important, you should consult the Perldocumentation, and the Math::Random and Math::TrulyRandom modules from CPAN
19

Example 2-2
#!/usr/bin/perl -w # Example 2-2 Mutate DNA # using a random number generator to randomly select bases to mutate use strict; use warnings; # Declare the variables # The DNA is chosen to make it easy to see mutations: my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'; # $i is a common name for a counter variable, short for "integer" my $i; my $mutant; # Seed the random number generator. # time|$$ combines the current time with the current process id srand(time|$$); $mutant = mutate($DNA); print "\nMutate DNA\n\n"; print "\nHere is the original DNA:\n\n"; print "$DNA\n"; print "\nHere is the mutant DNA:\n\n"; print "$mutant\n"; # Let's put it in a loop and watch that bad boy accumulate mutations: print "\nHere are 10 more successive mutations:\n\n"; for ($i=0 ; $i < 10 ; ++$i) { $mutant = mutate($mutant); print "$mutant\n"; } exit;
This is the main program which seeds the random number algorithm and calls the subroutine, mutate(). The call to srand() uses the seed of time|$$, ORs the current time with the process id, creating a unique seed. This is not a very secure method but it will do for our purposes. The argument to mutate() is the current DNA string.
20

Example 2-2 (contd)
######################################################## # Subroutines for Example 2-2 ######################################################## # A subroutine to perform a mutation in a string of DNA # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub mutate { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T'); # Pick a random position in the DNA my($position) = randomposition($dna); # Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides); # Insert the random nucleotide into the random position in the DNA # The substr arguments mean the following: # In the string $dna at position $position change 1 character to # the string in $newbase substr($dna,$position,1,$newbase); return $dna; }
The subroutine mutate() takes the argument from the special array @_ and assigns it to the variable $dna. The array @ nucleotides is intialized with the values which are our nucleotides. The subroutine randomposition() takes the current dna string and returns a position within the string. The subroutine randomnucleotide() takes the our array of bases and returns a randomly selected value. Finally, the perl module substr() takes the DNA string, the random position, a length of our substitution string, here it is 1, the replacement string and returns the new string in the variable $dna.
21

Example 2-2 (contd)
# A subroutine to randomly select an element from an array # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomelement { my(@array) = @_; # Here the code is succinctly represented rather than # return $array[int rand scalar @array]; return $array[rand @array]; } # randomnucleotide # # A subroutine to select at random one of the four nucleotides # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'); # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return randomelement(@nucleotides); }
Randomnucleotide() passes our array of bases to the function randomelement(), and in turn, returns the randomly chosen nucleotide. In randomelement(), an array is given to the function and returns a randomly selected element from the array. How is this done? Rand() expects a scalar value, evaluating the array @array in a scalar context, the size of @array. Perl was designed to take as array subscripts the integer part of a floating-point value. Here $array[rand @array] returns the element of the array associated with the subscript randomly chosen from 0 to n-1, where n is the length of the array.
22

Example 2-2 (contd) # randomposition # # A subroutine to randomly select a position in a string. # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomposition { my($string) = @_; # Notice the "nested" arguments: # # $string is the argument to length # length($string) is the argument to rand # rand(length($string))) is the argument to int # int(rand(length($string))) is the argument to return # # rand returns a decimal number between 0 and its argument. # int returns the integer portion of a decimal number. # # The whole expression returns a random number # between 0 and length-1, # which is how the positions in a string are numbered in Perl. # return int rand length $string; }
Randomposition() takes an string argument and calculates a random position withing the string. It is very concise and useful. The return command could have been written: return (int (rand (length $string))); Certainly, this is more understandable, but I believe there is no loss of clarity as in Perl we can write these as a sequence of Perl modules. Chaining single-argument functions is often done in Perl. Rand() takes the length as an argument and calculates a floating point number between 0 and the length. Int() will round the floating point number down to a range of integers, 0 to length-1. Results from running example2-2.pl:
Mutate DNA Here is the original DNA: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Here is the mutant DNA: AAAAAAAAAAAAAAAAAAAAAAAGAAAAAA Here are 10 more successive mutations: AAAAAAAAAAAAAAAAAAAAAAAGAAAAAG AAAAAAAAAAAAAAAAAAAACAAGAAAAAG AAAAAAAAAAAAAAAAAAAACAAGAAAAAG CAAAAAAAAAAAAAAAAAAACAAGAAAAAG CAAAAAAAAAAAAAAAAAAACAAGATAAAG CAAAAAAAAAAAGAAAAAAACAAGATAAAG CAAAAAAAAAAAGAACAAAACAAGATAAAG GAAAAAAAAAAAGAACAAAACAAGATAAAG GAAAAAAAAAAAGAACAAAAGAAGATAAAG GAAAAAAAAAAAGAACAAAAGCAGATAAAG
23

Example 2-3 : Translating DNA into proteins using modules First transcribe DNA to RNA Translate RNA to amino acids
Four bases, A, U, C, G Codon defined by sequece of three bases 64 possible combinations, 43. There are only 20 amino acids and a stop Redundancy with codons, more than one codon represents each amino acid Refer to Table 1 on page ?? Use subroutine defined in BegPerlBioinfo.pm
Specify module filename in perl code If not installed in a known library path, need use lib pathname to specify where to find the module
Module codon2aa() returns a single character amino acid from the 3-character codon input Need to write a loop which will grab 3 characters while stepping through the RNA sequence
24

Example 2-3 # # codon2aa # # A subroutine to translate a DNA 3-character # codon to an amino acid # Using hash lookup sub codon2aa { my($codon) = @_; $codon = uc $codon; my(%genetic_code) = ( 'TCA' => 'S', 'TCC' => 'S', 'TCG' => 'S', 'TCT' => 'S', 'TTC' => 'F', 'TTT' => 'F', 'TTA' => 'L', 'TTG' => 'L', 'TAC' => 'Y', 'TAT' => 'Y', 'TAA' => '_', 'TAG' => '_', 'TGC' => 'C', 'TGT' => 'C', 'TGA' => '_', 'TGG' => 'W', 'CTA' => 'L', 'CTC' => 'L', 'CTG' => 'L', 'CTT' => 'L', 'CCA' => 'P', # Serine # Serine # Serine # Serine # Phenylalanine # Phenylalanine # Leucine # Leucine # Tyrosine # Tyrosine # Stop # Stop # Cysteine # Cysteine # Stop # Tryptophan # Leucine # Leucine # Leucine # Leucine # Proline Example 2-3 'CAT' => 'H', 'CAA' => 'Q', 'CAG' => 'Q', 'CGA' => 'R', 'CGC' => 'R', 'CGG' => 'R', 'CGT' => 'R', 'ATA' => 'I', 'ATC' => 'I', 'ATT' => 'I', 'ATG' => 'M', 'ACA' => 'T', 'ACC' => 'T', 'ACG' => 'T', 'ACT' => 'T', 'AAC' => 'N', 'AAT' => 'N', 'AAA' => 'K', 'AAG' => 'K', 'AGC' => 'S', 'AGT' => 'S', 'AGA' => 'R', 'AGG' => 'R', 'CCC' => 'P', 'CCG' => 'P', # Histidine # Glutamine # Glutamine # Arginine # Arginine # Arginine # Arginine # Isoleucine # Isoleucine # Isoleucine # Methionine # Threonine # Threonine # Threonine # Threonine # Asparagine # Asparagine # Lysine # Lysine # Serine # Serine # Arginine # Arginine # Proline # Proline Example 2-3 'CCT' => 'P', 'CAC' => 'H', 'GTA' => 'V', 'GTC' => 'V', 'GTG' => 'V', 'GTT' => 'V', 'GCA' => 'A', 'GCC' => 'A', 'GCG' => 'A', 'GCT' => 'A', 'GAC' => 'D', 'GAT' => 'D', 'GAA' => 'E', 'GAG' => 'E', 'GGA' => 'G', 'GGC' => 'G', 'GGG' => 'G', 'GGT' => 'G', ); # Proline # Histidine # Valine # Valine # Valine # Valine # Alanine # Alanine # Alanine # Alanine # Aspartic Acid # Aspartic Acid # Glutamic Acid # Glutamic Acid # Glycine # Glycine # Glycine # Glycine
if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; } else{ print STDERR "Bad codon \"$codon\"!!\n"; exit; } }
This subroutine takes, as an argument, a three character DNA sequence and returns the single character representation of the amino acid. The data type used is a hash lookup. The condition if (exists $genetic_code($codon)) searches for a match between the 3 characters of the codon and the list of keys in the hash. The associated value of the key, if found, is returned. Otherwise an error is reported and the program terminates. This module is included in the module BeginPerlBioinf.pm, which will be used with other subroutines, throughout the rest of the workshop.
25

Example 2-3
#!/usr/bin/perl -w # Example 2-3 : Translate DNA into protein use lib ../ModLib/; use strict; use warnings; use BeginPerlBioinfo; # This does not require the .pm in the use command # Initialize variables my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC'; my $protein = ''; my $codon; # Translate each three-base codon into an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n";
This is the perl code which, with only a few lines, translates DNA into a protein sequence. The command use lib instructs the perl compiler to append the search path for necessary libraries, like BeginPerlBioinfo.pm. BeginPerlBioinfo.pm is a part of the book Beginning Perl for Bioinformatics, by James Tysdall. The for loop references the dna string sequence by threes starting at the 0 Index : 0 3 6 9 . CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC The 3 character substring is assigned to the $codon variable by the perl command substr. Then $protein, returned by the subroutine codon2aa() is appended to the end of the current protein string. Results from running example2-3.pl: I translated the DNA CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC into the protein RRLRTGLARVGR
exit;
26
Section 2 : BioPerl and CPAN

Example 2-4 : Installing and testing bioperl http://bioperl.org The Bioperl Project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research. The Bioperl server provides an online resource for modules, scripts, and web links for developers of Perl-based software for life science research. Bioperl modules and documentation are very extensive Good examples to illustrate uses Will discuss installation of bioperl Also take a quick look at some test scripts In Chapter 9 of Mastering Perl for Bioinformatics, James Tisdall gives a personal account of installing bioperl.
Depends on installing using CPAN shell Linux installations vary from site to site, so it is advised that someone with administrator privileges install bioperl
27
Section 2 : BioPerl and CPAN Example 2-4 : Installing and testing bioperl My own experiences were slightly different
Download the core bioperl install file, version 1.4 the most recent Follow the make instructions included in the INSTALL documentation Carefully follow the make test instruction
Make sure you have an internet connection
Note where the test script fails

You will see module names like LPW, IO::Strings, etc.
I noticed that the LPW and IO::Strings were involved in quite a few failures
Here is where I installed the missing modules using the CPAN shell
>> perl MCPAN e shell At the CPAN prompt, install the missing module
cpan > install LPW
After exiting the CPAN shell, try make test to see if it lessens the failed responses
After concluding that the failures wont impede using bioperl, use the make install This usually puts the modules in /usr/lib/perl5/5.x.x/site_perl, on Linux systems
28
Section 2 : BioPerl and CPAN

Example bptest0.pl #!/usr/bin/perl w use Bio::Perl; exit; ###################################################### Example bptest1.pl #!/usr/bin/perl -w # Example to Test the Bioperl installation In the file bptest1.pl, we need internet access. The perl program retrieves a use Bio::Perl; swissprot sequence and prints it to a file, roa1.fasta, in FASTA format. # Must use this script with an internet connection $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); write_sequence("> roa1.fasta", 'fasta', $seq_object); exit; ###################################################### Example bptest2.pl #!/usr/bin/perl w # Example to Test the Bioperl installation use Bio::Perl; # Must use this script with an internet connection $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); $blast_result = blast_sequence(($seq_object); write_blast(>raol1.blast, $blast_result); exit; These simple tests measure if bioperl is installed correctly. Test bptest0.pl simply checks if Perl can find Bio::Perl. If it doesnt complain, we are one step closer.
The last perl script uses NCBI to BLAST a sequence and saves the results to a file. This should be used judiciously as we dont want to abuse the computing cycles of NCBI. These requests should be done for individual searches. Download the blast package locally to do large numbers of BLAST searches.
29
Section 2 : Mutations, Randomization and Bioperl

Exercises for Section 2 1. 2. Write a subroutine to concatenate two strings of DNA. Write a subroutine to report the percentage of each nucleotide in DNA. Count the number of each nucleotide, divide by the total length of the DNA, then multiply by 100 to get the percentage. Your arguments should be the DNA and the nucleotide you want to report on. The int function can be used to discard digits after the decimal point, if needed. Write a module that contains subroutines that report various statistics on DNA sequences, for instance length, GC content, presence or absence of poly-T sequences (long stretches of mostly Ts at the 5 (left) end of many $DNA sequences), or other measures of interest. Write a program that asks you to pick an amino acid and then keeps (randomly) guessing which amino acid you picked. Write a program to mutate protein sequence, similar to the code in Example 2-2 that mutates DNA. Write a program that uses Bioperl to perform a BLAST search at the NCBI web site, then use Bioperl to parse the BLAST output.
3.
4.
5. 6.
30
Section 3 : Fasta Files and Frames

Many different formats for saving sequence data and annotations in files Perhaps as many as 20 such formats for DNA Some of the most popular
FASTA and BLAST, Basic Local Alignment Search Technique, both using the FASTA format Genetic Sequence Data Bank (GenBank) European Molecular Biology Laboratory (EMBL)
In this section we will focus on reading FASTA format Sample of FASTA format:
> sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
31

Example 3-1: Reading FASTA format and extract sequence data Write three subroutines and rely on regular expressions First subroutine will get data from a file
Read filename from command li neargument = filename open file
if cant open, print error message and exit
read in data return array which contains each line of the file, @data Read in array of file data in fasta format Discard all header, blank and comment lines If first character of first line is >, discard it Read in the rest of the file, joined in a scalar, edit out non-sequence data, white spaces return sequence More often than not, the sequence to print is longer than most page widths Need to specify a length parameter to control the output
Second subroutine extracts sequence data from fasta file

Third subroutine writes the sequence data

32

Example 3-1 # get_file_data # # A subroutine to get data from a file given its filename sub get_file_data { my($filename) = @_; use strict; use warnings; # Initialize variables my @filedata = ( ); unless ( open (GET_FILE_DATA, $filename) ) { print STDERR "Cannot open file \"$filename\"\n\n"; exit; } @filedata = <GET_FILE_DATA>; close GET_FILE_DATA; return @filedata; } Get_file_data() take a string argument, the filename. The unless condition attempts to open a file. If unsuccessful, it prints an error statement and exits the program. If the file exists, it saves each line of the file, one by one, into the array @filedata. Returns the array to the main routine, after closing the file pointer, of course.
33

Example 3-1
# extract_sequence_from_fasta_data # # A subroutine to extract FASTA sequence data from an array sub extract_sequence_from_fasta_data { my(@fasta_file_data) = @_; use strict; use warnings; # Declare and initialize variables my $sequence = ; foreach my $line (@fasta_file_data) { # discard blank line if ($line =~ /^\s*$/) { next; # discard comment line } elsif($line =~ /^\s*#/) { next; # discard fasta header line } elsif($line =~ /^>/) { next; # keep line, add to sequence string } else { $sequence .= $line; } }
Extract_sequence_from_fasta_data() takes the array that is the contents of the fasta file. The foreach loop takes each of the elements of the array, a complete line of the file, and assigns it to the variable $line. The different conditions help us ignore the blank, comment and header lines: /^\s*$/ looks for lines that have just white spaces from beginning to end /^\s*#/ look for lines which have the pound character, preceded by white spaces, as a comment line /^>/ look for lines which have the greater-than symbol at the beginning of the line, the fasta header line all other lines are concatenated together into the $sequence variable When all is done, all white space characters are removed: $sequence =~ s/\s//g; The sequence is returned to the calling routine.
# remove non-sequence data (in this case,whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; }
34
Section 3 : Fasta file format

Example 3-1
# print_sequence # # A subroutine to format and print sequencedata sub print_sequence { my($sequence, $length) = @_; use strict; use warnings; # Print sequence in lines of $length for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) { print substr($sequence, $pos, $length), "\n"; } }
Finally, the print_sequence() routine takes the cleaned string and an integer specifying the number of characters to print, per line. Again notice that the variables are assigned from the special array, @_. This is accomplished by the for for loop and the substr module. The print command takes a substring of the complete string on a new line.
Well, now that we have the produced the subroutines needed for our program, these subroutines have been installed in the BeginPerlBioinfo.pm module. Our program may be succinctly written as in the code to the left. The final command prints the sequence, passing the character string and the length to the print_sequence subroutine. Output from example3-1
agatggcggcgctgaggggtcttgg gggctctaggccggccacctactgg tttgcagcggagacgacgcatgggg cctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagt agttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgc aggttcgcgggtcgctggcgggggt Cgtgagggagtgcgccgggagcgga gaagttcgggggccccaacaagatc cggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgta caagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgc caaggccccgccggccactgcccac ccaacagcagccacagccatcacag aagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagt caaggagcctcctgaggctacagcc acacctgagccactctcagatgagg accta
Example 3-1
#!/usr/bin/perl # Read a fasta file and extract the sequence data use lib ../ModLib/; # Must point to where BeginPerlBioinfo.pm resides use strict; use warnings; use BeginPerlBioinfo; # Declare and initialize variables my @file_data = ( ); my $dna = ; # Read in the contents of the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Print the sequence in lines 25 characters long print_sequence($dna, 25); exit;
35

Example 3-2: Translate a DNA sequence in all six reading frames Given a sequence of DNA, it is necessary to examine all six reading frames of the DNA to find the coding regions the cell uses to make proteins Genes very often occur in pieces that are spliced together during the transcription/translation process Since the codons are three bases long, the translation happens in three "frames, starting at the first base, or the second, or perhaps the third. Each starting place gives a different series of codons, and, as a result, a different series of amino acids. Examine all six reading frames of a DNA sequence and to look at the resulting protein translations Stop codons are definite breaks in the DNA => protein translation process If a stop codon is reached, the translation stops We need some code to represent the reverse compliment of the DNA Need to break both strings into the representative frames Translate each frame of DNA to protein
36

Example 3-2
# revcom # # A subroutine to compute the reverse complement of DNA sequence sub revcom { my($dna) = @_; # First reverse the sequence my($revcom) = reverse($dna); # Next, complement the sequence, dealing with upper and lower case # A->T, T->A, C->G, G->C $revcom =~ tr/ACGTacgt/TGCAtgca/; return $revcom; } # translate_frame # # A subroutine to translate a frame of DNA sub translate_frame { my($seq, $start, $end) = @_; my $protein; # To make the subroutine easier to use, you wont need to specify # the end point--it will just go to the end of the sequence # by default. unless($end) { $end = length($seq); } # Finally, calculate and return the translation return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) ); }
We are going to reuse our old code from Section 1, revcom(). We have to rewrite it as a subroutine. Now we need to design that subroutine which will break the DNA strings into our frames and translate the string into proteins. Our old perl command substr() should do the trick for taking apart our frames. The unless($end) condition checks for a value in the variable $end, if no value then it calculates the end value as the length of the sequence. The length of the desired sequence doesnt change with the change in indices, since: (end - 1) - (start - 1) + 1 = end - start + 1 Translating to peptides we revisite our codon2aa() subroutine, from Section 2. This has been included in a subroutine dna2peptide() which is, already, in BeginPerlBioin.pm.
37

Example 3-2 #!/usr/bin/perl # Translate a DNA sequence in all six reading frames use lib ../ModLib; use strict; use warnings; use BeginPerlBioinfo; # Initialize variables my @file_data = ( ); my $dna = ; my $revcom = ; my $protein = ; # Read in the contents of the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Translate the DNA to protein in six reading frames # and print the protein in lines 70 characters long print "\n -------Reading Frame 1--------\n\n"; $protein = translate_frame($dna, 1); print_sequence($protein, 70); print "\n -------Reading Frame 2--------\n\n"; $protein = translate_frame($dna, 2); print_sequence($protein, 70); print "\n -------Reading Frame 3--------\n\n"; $protein = translate_frame($dna, 3); print_sequence($protein, 70); # Calculate reverse complement $revcom = revcom($dna); print "\n -------Reading Frame 4--------\n\n"; $protein = translate_frame($revcom, 1); print_sequence($protein, 70); print "\n -------Reading Frame 5--------\n\n"; $protein = translate_frame($revcom, 2); print_sequence($protein, 70); print "\n -------Reading Frame 6--------\n\n"; $protein = translate_frame($revcom, 3); print_sequence($protein, 70); exit;
Now that we have done all that work, and it appears that our subroutines will provide us with the functon we need, these routines are provided in BeginPerlBioinf.pm. So, the Perl program is a short exercise and is very modular. Output from example 3-2
-------Reading Frame 1-------RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAE WSVQVRGSLAGVVRECAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKP DINCFMIGCDNCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKS RERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSP QPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKI RQKCRLRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRI REDEGAVASSTVKEPPEATATPEPLSDEDL -------Reading Frame 5-------RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDS EGVTGESEEGKYLYDSRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRH ASHSPHMRADRLICCCCCW_CWLGVATKGCGEDLWGEAEPRASMAPTPVPDPARR CRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFSLHSRQYHSRM ALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGG SGSEPSPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAY SYCAGPMRRLRCKPVGGRPRAPKTPQRRH -------Reading Frame 6-------GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTL RASLVRARKGSTCTIPGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADM PHTHHTCGLTV_SAAAAAGDAGWVWPPRAAERICGAKQSPEQAWPQPLSLTLPGA AGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSLCTPDSTTPGW PWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEA LGLNHLPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRT PIAQAPCVVSAANQ_VAGLEPPRPLSAAI
38
Section 3 : FASTA file format

Exercises for Section 3 1. Add to the Perl program in Example 3-1 a translation from DNA to protein and print out the protein. 2. Write a subroutine that checks a string and returns true if its a DNA sequence. Write another that checks for protein sequence data. Write a program that can search by name for a gene in an unsorted array. Write a subroutine that inserts an element into a sorted array. Hint: use the splice Perl function to insert the element. Write a subroutine that checks an array of data and returns true if its in FASTA format. Note that FASTA expects the standard IUB/IUPAC amino acid and nucleic acid codes, plus the dash (-) that represents a gap of unknown length. Also, the asterisk (*) represents a stop codon for amino acids. Be careful using an asterisk in regular expressions; use a \* to escape it to match an actual asterisk.
3. 4.
5.
39
Section 4 : GenBank (Genetic Sequence Data Bank) Files International repository of known genetic sequences from a variety of organisms GenBank is a flat file, an ASCII text file, that is easily readable GenBank referred to as a databank or data store
Databases have a relational structure includes associated indices links and a query language.
Perl modules and constructs are ideal for processing flat files For additional bioinformatics software, reference these web sites
National Center for Biotechnology Information (NCBI) National Institutes of Health (NIH), http://proxy.lib.ohio-state.edu:2224 European Bioinformatics Institute (EBI), http://www.ebi.ac.uk European Molecular Biology Laboratory (EMBL), http://www.embl-heidelberg.de/
Lets take a look at a short GenBank file
40
Section 4 : GenBank Files

Example of a short GenBank file;
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB031069 VERSION AB031069.1 GI:8100074 KEYWORDS . SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) MEDLINE 20261256 REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. Tadahiro Fujino, Keio University School of Medicine, Department of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan (E-mail:fujino@microb.med.keio.ac.jp, Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075" /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR" BASE COUNT 564 a 715 c 768 g 440 t ORIGIN (contd on next page)
41

Example of a short GenBank filw (contd):
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat 2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat 2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt 2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat 2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag 2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt 2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa 2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa //
For a view of the complete file and its format, look at record.gb in Section 4 of the exercises. A typical GenBank entry is packed with information. With perl we will be able to separate the different parts. For instance, by extracting the sequence, we can search for motifs, calculate statistics on the sequence, or compare with other sequences. Also, separating the various parts of the data annotation, we have access to ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation includes specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on. The format specification of GenBank files and a great deal of other information about GenBank can be found in theGenBank release notes, gbrel.txt, on the GenBank web site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.
42

Example 4-1 :
#!/usr/bin/perl # Example 4-1 Extract annotation and sequence from GenBank file use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; # declare and initialize variables my @annotation = ( ); my $sequence = ''; my $filename = 'record.gb'; parse1(\@annotation, \$sequence, $filename); # Print the annotation, and then # print the DNA in new format just to check if we got it okay. print @annotation; print_sequence($sequence, 50); exit;
We need a subroutine which will parse the annotation part of the file from the dna sequence. Here in the main Perl code, the subroutine, parse1(), will be that subroutine. But there is a different twist this time. The variables @annotation and $sequence are included as arguments in the call, and preceded by the back-slash character. This instructs the compiler to pass the variable by reference, rather than by value. The actual location in memory is passed to the subroutine and any changes made there will remain after the subroutine passes back to the calling routine.
43

Example 4-1 :
################################################### # Subroutine ################################################### # parse1 # -parse annotation and sequence from GenBank record sub parse1 { my($annotation, $dna, $filename) = @_; # $annotation-reference to array # $dna -reference to scalar # $filename -scalar # declare and initialize variables my $in_sequence = 0; my @GenBankFile = ( ); # Get the GenBank data into an array from a file @GenBankFile = get_file_data($filename); # Extract all the sequence lines foreach my $line (@GenBankFile) { if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n, last; #break out of the foreach loop. } elsif( $in_sequence) { # If we know we're in a sequence, $$dna .= $line; # add the current line to $$dna. } elsif ( $line =~ /ÔRIGIN/ ) { # If $line begins a sequence, $in_sequence = 1; # set the $in_sequence flag. } else{ # Otherwise push( @$annotation, $line); # add the current line to @annotation. } } # remove whitespace and line numbers from DNA sequence $$dna =~ s/[\s0-9]//g; }
The parse1() routine takes the cleaned string and an integer specifying the number of characters to print, per line. Again notice that the variables are assigned from the special array, @_. The two variables accept the reference to memory for @annotation and $dna. The array, @GenBankFile is parsed in the foreach loop. Each line is extracted and evaluated. There are two Because the annotation part of the file appears first, the flag, $in_sequence, is set to false. The conditional checks are : 1. Check for the end line \\; last leaves the logical block 2. If in_sequence is true, append the line to the DNA sequence 3. If the word ORIGIN is at the beginning, set the in_sequence variable to true 4. Otherwise append the array with the the annotation line The last command is to delete all white spaces and numbers from the DNA sequence.
44
Section 4 : GenBank (Genetic Sequence Data Bank) Files Example 4-2 : Parsing GenBank file using scalars A second way to separate annotations from sequences in GenBank records is to read the entire record into a scalar variable Then operate on it with regular expressions For some kinds of data, this can be a more convenient way to parse the input Problem with multiple newlines, \n, in the sequence Previous regular expressions have used the caret (^), dot (.), and dollar sign ($) metacharacters The following two pattern modifiers affect these three metacharacters:
The /s modifier assumes you want to treat the whole string as a single line, even with embedded newlines; it makes the dot metacharacter match any character including newlines. The /m modifier assumes you want to treat the whole string as a multiline, with embedded newlines; extends the ^ and the $ to match after, or before, a newline, embedded in the string.
45

Example 4-2 : #!/usr/bin/perl # Example 4-2 Extract the annotation and sequence sections from the first # record of a GenBank library use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; # Declare and initialize variables my $annotation = ''; my $dna = ''; my $record = ''; my $filename = 'record.gb'; my $save_input_separator = $/; # Open GenBank library file unless (open(GBFILE, $filename)) { print "Cannot open GenBank file \"$filename\"\n\n"; exit; } # Set input separator to "//\n" and read in a record to a scalar $/ = "//\n"; $record = <GBFILE>; # reset input separator $/ = $save_input_separator; # Now separate the annotation from the sequence data ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); # Print the two pieces, which should give us the same as the # original GenBank file, minus the // at the end print $annotation, $dna; exit;
The output of this program is just about the same as Example 4-1. There is the annotation section and the DNA section. The input separator is initially set the the two forward slashes, //\n, which marks the end of the Genbank file. The contents of the file is assigned to the one string variable, $record. Now we will use the regular expression that parses the annotation and sequence out of the $record variable: $record = /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s. There are two pairs of parentheses in the regular expression: (LOCUS.*ORIGIN\s*\n) and (.*). The parentheses are metacharacters which remember the parts of the data that match the pattern within the parentheses, here the annotation and the sequence. Also note that the pattern match returns an array whose elements are the matched parenthetical patterns. After matching the annotation and the sequence within the pairs of parentheses in the regular expression, the matched patterns are assigned to the two variables $annotation and $dna: ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); Notice that at the end of the pattern, weve added the /s pattern matching modifier, which, as youve seen earlier, allows a dot to match any character including an embedded newline.
46
Section 4 : GenBank files

Exercises for Section 4 1. Looking at a GenBank record, its interesting to think about how to extract the useful information. Write a Perl program which parses the annotations of a GenBank file. Write a program that takes a long DNA sequence as input and outputs the counts of all four-base subsequences (256 of them in all), sorted by frequency. A four-base subsequence starts at each location 1, 2, 3, and so on. Extend the program in Exercise 4-2 to count all the sequences in a GenBank library. Given an amino acid, find the frequency of occurrence of the adjacent amino acids coded in a DNA sequence; or in a GenBank library.
2.
3. 4.
47
Section 5 : Protein Data Bank (PDB) PDB is the main source for information about 3D structures of acromolecules proteins
peptides, viruses, protein/nucleic acid complexes, nucleic acids, carbohydrates
PDB files are like GenBank records, human-readable ASCII flat files May be problems with consistency of PDB files Routine/programs which work well with newer PDB files may have problems with older files We will look at tools which operate on large numbers of files and folders
Example 5-1 : Print the contents of folders and all subfolders Use the Perl module of opendir and readdir Need to construct a logic to read into subfolders Use the test of -d and -f to test for regular files and folders
48
Section 5 : PDB
Example 5-1
#!/usr/bin/perl # Example 5-1 Demonstrating how to open a folder and list its contents # -distinguishing between files and subfolders, which # are themselves listed use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; my @files = ( ); my $folder = 'pdb'; # Open the folder unless(opendir(FOLDER, $folder)) { print "Cannot open folder $folder!\n"; exit; } # Read the folder, ignoring special entries "." and ".." @files = grep (!/^\.\.?$/, readdir(FOLDER)); closedir(FOLDER); # If file, print its name # If folder, print its name and contents # # Notice that we need to prepend the folder name! foreach my $file (@files) { # If the folder entry is a regular file if (-f "$folder/$file") { print "$folder/$file\n"; # If the folder entry is a subfolder }elsif( -d "$folder/$file") {
Example 5-1 (contd)

my $folder = "$folder/$file"; # open the subfolder and list its contents unless(opendir(FOLDER, "$folder")) { print "Cannot open folder $folder!\n"; exit; } my @files = grep (!/^\.\.?$/, readdir(FOLDER)); closedir(FOLDER); foreach my $file (@files) { print "$folder/$file\n"; } } } exit;
This Perl code attemts to open a folder and subfolders to list the files contained in each. The readdir command opens a directory a creates a list of entries. Using the grep command and regular expression, we can cull out the entries of . and ..:
@files = grep (!/^\.\.?$/, readdir(FOLDER));
Here the !/^\.\.?$/ checks against the list created by readdir and looks for anything NOT a . or ... Also, the imbedded condition of checking if an entry is a folder/directory permits us to dive into each folder and retrieve the files.
49
Section 5 : Protein Data Bank (PDB) Example 5-2 : Extract sequence chains from PDB file Take a look at a PDB file Can be very lengthy PDB files are composed of lines of 80 columns that begin with one of several predefined record names and end with a newline. Lets start with extracting the amino acid sequence data. To extract the amino acid primary sequence information, you need to parse the record type SEQRES The SEQRES record type is one of four record types in the Primary Structure Section Represents the primary structure of the peptide or nucleotide sequence: Here is a SEQRES line from the PDB file
SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL
50
Section 5 : Protein Data Bank (PDB)

Example 5-2 : Extract sequence chains from PDB file
Record Format COLUMNS DATA TYPE FIELD DEFINITION --------------------------------------------------------------------------------1-6 Record name "SEQRES" 9 - 10 Integer serNum Serial number of the SEQRES record for the current chain. Starts at 1 and increments by one each line. Reset to 1 for each chain. 12 Character chainID Chain identifier. This may be any single legal character, including a blank which is used if there is only one chain. 14 - 17 Integer numRes Number of residues in the chain. This value is repeated on every record. 20 - 22 Residue name resName Residue name. 24 - 26 Residue name resName Residue name. 28 - 30 Residue name resName Residue name. 32 - 34 Residue name resName Residue name. 36 - 38 Residue name resName Residue name. 40 - 42 Residue name resName Residue name. 44 - 46 Residue name resName Residue name. 48 - 50 Residue name resName Residue name. 52 - 54 Residue name resName Residue name. 56 - 58 Residue name resName Residue name. 60 - 62 Residue name resName Residue name. 64 - 66 Residue name resName Residue name. 68 - 70 Residue name resName Residue name.
51
Section 5 : PDB
Example 5-2 #!/usr/bin/perl # Extract sequence chains from PDB file use lib ../ModLib; use strict; use warnings; use BeginPerlBioinfo; # Read in PDB file: Warning - some files are very large! my @file = get_file_data(pdb/c1/pdb1c1f.ent); # Parse the record types of the PDB file my %recordtypes = parsePDBrecordtypes(@file); # Extract the amino acid sequences of all chains in the protein my @chains = extractSEQRES( $recordtypes{SEQRES} ); # Translate the 3-character codes to 1-character codes, and print foreach my $chain (@chains) { print "****chain $chain **** \n"; print "$chain\n"; print iub3to1($chain), "\n"; } exit; This Perl code will incorporate three subroutines: 1. parsePDBrecordtypes(), takes an array and returns a key, value pair hash data. 2. extractSEQRES(), extract from the hash data given an scalar containing SEQRES lines, return an array containing the chains of the sequence 3. iub3tol(), change string of 3-character IUB amino acid codes whitespace separated) into a string of 1-character amino acid codes Because of the size of PDB files memory limitations might be a problem. The use of memory can be lessened by not saving the results of reading in the file, but instead passing the file data directly to the parsePDBrecordtypes subroutine: # Get the file data and parse the record types of the PDB file %recordtypes = parsePDBrecordtypes(get_file_data(pdb/c1/pdb1c1f.ent)); Further savings of memory are possible by rewriting the program to just read the file one line at a time while parsing the data into the record types. Output from Exercise 5-2
SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN SER LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY ALA ASP GLN ASN THR ILE VAL MET ASN SER THR LEU LYS GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN PHE THR LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU SER TYR ASP ILE ASN LYS PHE TYR ILE ASP ILE LEU ASP GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU PHE LEU PRO PHE LEU SER LEU ALA GLY ASP ALA ARG LEU THR LEU VAL LYS LEU GLU SGGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVM NSTLKGDNGWETEQRSTNFTLSAGQYFEITLSYDINKFYIDILDGPNLEFPNRYSKEFLPFLSL AGDARLTLVKLE
52
Section 5 : PDB
Example 5-2 # parsePDBrecordtypes # #-given an array of a PDB file, return a hash with # keys = record type names # values = scalar containing lines for that record type sub parsePDBrecordtypes { my @file = @_; use strict; use warnings; my %recordtypes = ( ); foreach my $line (@file) { # Get the record type name which begins at the # start of the line and ends at the first space # The pattern (\S+) is returned and saved in $recordtype my($recordtype) = ($line =~ /^(\S+)/); parsePDBrecordtypes() parses the PDB record types from an array containing the lines of the PDB record. Follow the comments which describe whats happening. Basically, each line is examined for its record type and is then added to the value of a hash entry with he record type as the key. In the code the RE /^(\S+)/ matches any word at the beginning of the line. Remember the \S is interpreted as any non-white-space character. The enclosing parentheses represents the successful match saved as a special variable the lines that match that first key are associated with that key: key => value COMPND => COMPND MOL_ID: 1;\n COMPND 2 MOLECULE: CONGERIN I;\nCOMPND 2 MOLECULE: CONGERIN I;\n COMPND 3 CHAIN: A; \nCOMPND 4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN; \nCOMPND 5 BIOLOGICAL_UNIT: HOMODIMER\n Every line in the array with the same record type is concatenated onto the rest. The hash is returned from the subroutine. Example 5-2(contd) # .= fails if a key is undefined, so we have to # test for definition and use either .= or = depending if(defined $recordtypes{$recordtype} ) { $recordtypes{$recordtype} .= $line; }else{ $recordtypes{$recordtype} = $line; } } return %recordtypes; }
53
Section 5 : PDB
Example 5-2
# extractSEQRES # #-given an scalar containing SEQRES lines, # return an array containing the chains of the sequence sub extractSEQRES { use strict; use warnings; my($seqres) = @_; my $lastchain = ''; my $sequence = ''; my @results = ( ); # make array of lines my @record = split ( /\n/, $seqres); foreach my $line (@record) { # Chain is in column 12, residues start in column 20 my ($thischain) = substr($line, 11, 1); my($residues) = substr($line, 19, 52); # add space at end # Check if a new chain, or continuation of previous chain if("$lastchain" eq "") { $sequence = $residues; }elsif("$thischain" eq "$lastchain") { $sequence .= $residues;
Example 5-2
# Finish gathering previous chain (unless first record) }elsif ( $sequence ) { push(@results, $sequence); $sequence = $residues; } $lastchain = $thischain; } # save last chain push(@results, $sequence); return @results; }
Lets examine the subroutine extractSEQRES. The record types have been parsed out, and extracted the primary amino acid sequence. We need to extract each chain separately and return an array of one or more strings of sequence corresponding to those chains, instead of just one sequence. When passed to the subroutine, the varialble $seqres is assigned the required SEQRES record type, which stretches over several lines, in a scalar string that is the value of the key SEQRES in a hash. Here we use the same approach as with the previous parsePDBrecordtypes subroutine that used iteration over lines (as opposed to regular expressions over multiline strings). The split Perl function enables us to turn a multiline string into an array. As we iterate through the lines of the SEQRES record type, mark when a new chain is starting, save the previous chain in @results, reset the $sequence array, and reset the $lastchain flag to the new chain. Also, when done with all the lines, make sure to save the last sequence chain in the @results array.
54
Section 5 : PDB
Example 5-2 # iub3to1 #-change string of 3-character IUB amino acid codes (whitespace separated) # into a string of 1-character amino acid codes sub iub3to1 { my($input) = @_; my %three2one = (
Example 5-2
# clean up the input $input =~ s/\n/ /g; my $seq = ''; # This use of split separates on any contiguous whitespace my @code3 = split(' ', $input); foreach my $code (@code3) { # A little error checking if(not defined $three2one{$code}) { print "Code $code not defined\n"; next; } $seq .= $three2one{$code}; } return $seq; }
'ALA' => 'A', 'VAL' => 'V', 'LEU' => 'L', 'ILE' => 'I', 'PRO' => 'P', 'TRP' => 'W', 'PHE' => 'F', 'MET' => 'M', 'GLY' => 'G', 'SER' => 'S', 'THR' => 'T', 'TYR' => 'Y', 'CYS' => 'C', 'ASN' => 'N', 'GLN' => 'Q', 'LYS' => 'K', 'ARG' => 'R', 'HIS' => 'H', 'ASP' => 'D', 'GLU' => 'E',
);
The subroutine iub3t01 translates the three-character codes, which the in PDB sequence information is in , into one-character codes. The hash is defined within the subroutine. The split Perl function is, again, used to create an array from a string. Now the foreach loop merely matches the array entry to the list of hash keys. The single character from the translation is added to the end of our sequence.
55
Section 5 : PDB
Exercises for Section 5 1. 2. 3. Write a recursive subroutine to list a filesystem. Be sure to check if an entry is a file or folder. Write a recursive subroutine to determine the size of an array. You may want to use the pop or unshift functions. Write a recursive subroutine that extracts the primary amino acid sequence from the SEQRES record type of a PDB file. Given an atom and a distance, find all other atoms in a PDB file that are within that distance of the atom.
4.
56
Section 6 : Basic Local Alignment Search Tool (BLAST) Search for sequence similarity is very important BLAST is one of the popular software tools in biological research BLAST tests a query sequence against a library of known sequences in order to find similarity A collection of programs with versions for query-to-database pairs
nucleotide-nucleotide protein-nucleotide protein-protein nucleotide-protein
Goal of this section is to write Perl code to parse a BLAST output file using regular expressions The code is basic and efficient, which may lead to more extensive algorithms Online documentation for BLAST is extensive Here we are interested in parsing the BLAST file, rather than the theory
57
Section 6 : Basic Local Alignment Search Tool (BLAST) Example 6-1 : Extracting annotations and alignments BLAST File used in this section, blast.txt, created from a BLAST query using the file sample.dna in Sections 3 and 4 Introduce two new subroutines, parse_blast and parse_blast_alignment Use regular expressions to extract the various bits of data from a scalar string
58
Section 6 : BLAST files

Example 6-1
# parse_blast # # -parse beginning and ending annotation, and alignments, # from BLAST output file sub parse_blast { my($beginning_annotation, $ending_annotation, $alignments, filename) = @_; # $beginning_annotation-reference to scalar # $ending_annotation -reference to scalar # $alignments -reference to hash # $filename -scalar # declare and initialize variables my $blast_output_file = ''; my $alignment_section = ''; # Get the BLAST program output into an array from a file $blast_output_file = join( '', get_file_data($filename)); # Extract the beginning annotation, alignments, and ending annotation ($$beginning_annotation, $alignment_section, $$ending_annotation) = ($blast_output_file =~ /(.*ÂLIGNMENTS\n)(.*)(^ Database:.*)/ms); # Populate %alignments hash # key = ID of hit # value = alignment section %$alignments = parse_blast_alignment($alignment_section); }
The main program does no more than call the parsing subroutine and print the results. The arguments, initialized as empty, are passed by reference, with the preceding \ in the routine call. The subroutine parse_blast does the parsing job of separating the three sections of a BLAST output file: 1. the annotation at the beginning, 2. the alignments in the middle, 3. the annotation at the end. It then calls the parse_blast_alignment subroutine to extract the individual alignments from that middle alignment section. The data is first read in from the named file with get_file_data subroutine, using the join function to store the array of file data into a scalar string. The pattern match contains three parenthesized expressions: 1. (.*ÂLIGNMENTS\n) is assigned to $$beginning_annotation 2. (.*) is saved in $alignment_section 3. (^ Database:.*) is saved in $$ending_annotation Lets see if we can agree what is happening. The regular expression in the parenthesized expression matches everything up to the word ALIGNMENTS followed by an end-of-line; collect everything for a while, (.*), which is the $alignment_section; then a line that begins with two spaces and the word Database followed by the rest of the file (^ Database:.*). These are the three desired parts of the BLAST output file; the beginning annotation, the alignment section, and the ending annotation. The last command creates the hash %alignments by calling the parse_blast_alignment subroutine, which takes a scalar string argument, the section which contains the alignment.
59

Example 6-1
# parse_blast_alignment # # -parse the alignments from a BLAST output file, # return hash with # key = ID # value = text of alignment sub parse_blast_alignment { my($alignment_section) = @_; # declare and initialize variables my(%alignment_hash) = ( ); # loop through the scalar containing the BLAST alignments, # extracting the ID and the alignment and storing in a hash # # The regular expression matches a line beginning with >, # and containing the ID between the first pair of | characters; # followed by any number of lines that don't begin with > while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) { my($value) = $&; my($key) = (split(/\|/, $value)) [1]; $alignment_hash{$key} = $value; } return %alignment_hash; }
The subroutine parse_blast_alignment. This subroutine has one important loop. The while loop, which, remember the + is interpreted to mean one or more matches; the /g matches throughout the string; while using the m modifier to keep the matches from the beginning of each line. Each time the program cycles through the loop, the pattern match finds the value (the entire alignment), then determines the key. The key and values are saved in the hash %alignment_hash. The regular expression ^>.*\n looks for > at the beginning of the BLAST output line, followed by .*, which matches the first line of the alignment. The rest of the regular expression, (^(?!>).*\n)+ a negative lookahead assertion, (?!>), ensuring that a > doesnt follow. The .* matches all non-newline characters, up to the final \n at the end of the line. The surrounding +, matches all the available REs. The call to split, split(/\|/, $value), splits $value into pieces delimited by | characters. That is, the | symbol is used to determine where one list element ends and the next one begins. This function creates an array of the sections of the string, parsing before and after the verticle bars. Surrounding the call to split with parentheses and adding an array offset ([1]), the second index into the array, saving the key into $key.
60

Example 6-1
#!/usr/bin/perl # Example 6-1 Extract annotation and alignments from BLAST output file use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; # declare and initialize variables my $beginning_annotation = ''; my $ending_annotation = ''; my %alignments = ( ); my $filename = 'blast.txt'; parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename); # Print the annotation, and then # print the DNA in new format just to check if we got it okay. print $beginning_annotation; foreach my $key (keys %alignments) { print "$key\nXXXXXXXXXXXX\n", $alignments{$key}, "\nXXXXXXXXXXX\n"; } print $ending_annotation; exit;
Here is our main routine, which merely calls the our two subroutines and creates the output. The data are passed by reference in the parse_blast subroutine. As the discussed before subroutines have been added to BeginPerlBioinf.pm, this extremely short routine is all we need to run our code.
61
Section 6 : Basic Local Alignment Search Tool (BLAST) Example 6-2 : Parse alignments from BLAST output file Taking exercise 6-1 a little further, some of the alignments include more than one aligned string To parse each alignment, we have to parse out each of the matched strings, which are called high-scoring pairs (HSPs). Parse each HSP into annotation, query string, and subject string, together with the starting and ending positions of the strings We include a pair of subroutines
one to parse the alignments into their HSPs the second to extract the sequences and their start-end positions
62

Example 6-2 # parse_blast_alignment_HSP # # -parse beginning annotation, and HSPs, # from BLAST alignment # Return an array with first element set to the beginning annotation, # and each successive element set to an HSP sub parse_blast_alignment_HSP { my($alignment ) = @_; # declare and initialize variables my $beginning_annotation = ''; my $HSP_section = ''; my @HSPs = ( ); # Extract the beginning annotation and HSPs ($beginning_annotation, $HSP_section ) = ($alignment =~ /(.*?)(^ Score =.*)/ms); # Store the $beginning_annotation as the first entry in @HSPs push(@HSPs, $beginning_annotation); # Parse the HSPs, store each HSP as an element in @HSPs while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) { push(@HSPs, $&); } # Return an array with first element = the beginning annotation, # and each successive element = an HSP return(@HSPs); }
The subroutine parse_blast_alignment_HSP, takes one of the alignments from the BLAST output and separates out the individual HSP string matches. Here, the first regular expression: ($beginning_annotation, $HSP_section ) = ($alignment =~ /(.*?)(^ Score =.*)/ms); parses out the annotation and the section containing the HSPs. The first parentheses in the regular expression is (.*?) This is the minimal matching, which gathers up everything before the first line that begins Score = (without the ? after the *, it would gather everything until the final line that begins Score =). This is the exactly what we want, dividing between the beginning annotation and the HSP string matches. The while loop and regular expression separates the individual HSP string matches: while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) { push(@HSPs, $&); } This is the same kind of global string match in a while loop that keeps iterating as long as the match can be found. The other modifier /m is the multiline modifier, which enables the metacharacters $ and ^ to match before and after embedded newlines. The expression within the first pair of parentheses, (^ Score =.*\n), matches a line that begins Score =, which is the beginning of an HSP string match section. The RE Compare within the second pair of parentheses, (^(?! Score =).*\n)+, matches one or more (the +) lines that do not begin with Score =. The ?! at the beginning of the embedded parentheses is the negative lookahead assertion we saw in Example 6-1. So, simply, the regular expression captures a line beginning with Score = and all the following adjacent lines that dont begin with Score =. Remember, the RE special variable $& has the value of the last successful pattern match. This will create an array of all the listed HSPs in the BLAST output.
63

Example 6-2
# extract_HSP_information # # -parse a HSP from a BLAST output alignment section # - return array with elements: # Expect value # Query string # Query range # Subject string # Subject range sub extract_HSP_information { my($HSP) = @_; # declare and initialize variables my($expect) = ''; my($query) = ''; my($query_range) = ''; my($subject) = ''; my($subject_range) = ''; ($expect) = ($HSP =~ /Expect = (\S+)/); $query = join ( '' , ($HSP =~ /^Query(.*)\n/gm) ); $subject = join ( '' , ($HSP =~ /^Sbjct(.*)\n/gm) ); $query_range = join('..', ($query =~ /(\d+).*\D(\d+)/s)); $subject_range = join('..', ($subject =~ /(\d+).*\D(\d+)/s)); $query =~ s/[âcgt]//g; $subject =~ s/[âcgt]//g; return ($expect, $query, $query_range, $subject, $subject_range); }
The subroutine extract_HSP_information returns the parsed values to the main program; parsed from the HSP information. As an exercise explain how the regular expressions parse the information. Remember some details about REs: 1. \S+ matches all non-white-space characters 2. () saves the enclosed match in a special variable, eg. $1 3. \d+ matches any string of numbers; \D matches nondigits Try running the program and see the output. Look at what values were created.
64

Example 6-2
#!/usr/bin/perl # Example 6-2 Parse alignments from BLAST output file use lib '../ModLib'; use strict; use warnings; use BeginPerlBioinfo; # declare and initialize variables my $beginning_annotation = ''; my $ending_annotation = ''; my %alignments = ( ); my $alignment = ''; my $filename = 'blast.txt'; my @HSPs = ( ); my($expect, $query, $query_range, $subject, $subject_range) = ('','','','',''); parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename); $alignment = $alignments{'AK017941.1'}; @HSPs = parse_blast_alignment_HSP($alignment); ($expect, $query, $query_range, $subject, $subject_range) = extract_HSP_information($HSPs[1]); # Print the results print "\n-> Expect value: $expect\n"; print "\n-> Query string: $query\n"; print "\n-> Query range: $query_range\n"; print "\n-> Subject String: $subject\n"; print "\n-> Subject range: $subject_range\n"; exit;
In this example, the key is coded into the program, $alignment = $alignments{'AK017941.1'}; How could we change the code to look for all possible keys? Again, we have installed the two subroutines in BeginPerlBioinfo.pm. Here is the output of running this succinct code: -> Expect value: 5e-52 -> Query string: ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaat ggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca -> Query range: 235..400
-> Subject String: ctggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgaga atgggctgagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatg attggacttgtgacaactgcaacgagtggttccatggagactgcatccggatca -> Subject range: 1048..1213
65
Section 6 : Blast files

Exercises for Section 6 1. Perform two BLAST searches with related sequences. Parse the BLAST output of the searches and extract the top 10 hits in the header annotation of each search. Write a program that reports on the differences and similarities between the two searches.
66
For Further Information

Bioperl at http://bioperl.org National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) (http://proxy.lib.ohio-state.edu:2224) [secure entry] European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk European Molecular Biology Laboratory (EMBL) at http://www.emblheidelberg.de Protein Data Bank (PDB) at http://www.rcsb.org/pdb/ WU-BLAST, Washington University, http://blast.wustl.edu
67
References
Perl Programming for Biologists, Jamison, Curtis D., John Wiley & Sons, Inc., 2003 Beginning Perl for Bioinformatics, James Tisdall, OReilly Pub., 2001 [****], very much recommended Mastering Perl for Bioinformatics, James Tisdall, O'Reilly Pub., 2003
68
Table 1 : Codon to Amino Acids

from Beginning Perl for Bioinformatics, by James Tisdall
69

Perl Bioinf 0411 PDF

Uploaded by

Copyright:

Available Formats

Perl Bioinf 0411 PDF

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Perl Bioinf 0411 PDF

Uploaded by

Copyright:

Available Formats

Using Perl for Bioinformatics

Section 1 : Sequences and Regular Expressions

Use print command to print concatenated sequence stdout.

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Could use the substitute function of the regular expression

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Read in the contents.

Read from a file named in the command line.

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Read in the contents.

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Section 1 : Sequences and Regular Expressions

Section 2 : Mutations, Randomization and Modules

Lexical scoping using my declaration

Use special array to pass arguments to subroutine

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Random number algorithms are only psuedo-random numbers

Random numbers require a seed

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : Mutations, Randomization and Modules

Section 2 : BioPerl and CPAN

Note where the test script fails

Section 2 : BioPerl and CPAN

Section 2 : Mutations, Randomization and Bioperl

Section 3 : Fasta Files and Frames

Section 3 : Fasta Files and Frames

Second subroutine extracts sequence data from fasta file

Third subroutine writes the sequence data

Section 3 : Fasta Files and Frames

Section 3 : Fasta Files and Frames

Section 3 : Fasta file format

Section 3 : Fasta Files and Frames

Section 3 : Fasta Files and Frames

Section 3 : Fasta Files and Frames

Section 3 : FASTA file format

Lets take a look at a short GenBank file

Section 4 : GenBank Files

Section 4 : GenBank Files

Section 4 : GenBank Files

Section 4 : GenBank Files

Section 4 : GenBank Files

Section 4 : GenBank files

Example 5-1 (contd)

Section 5 : Protein Data Bank (PDB)

Section 6 : BLAST files

Section 6 : BLAST files