Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Programming in R
(and some other stuff)
y.wurm@qmul.ac.uk
https://wurmlab.github.io
© Alex Wild & others
2015 11-17-programming inr.key
© National Geographic
Atta leaf-cutter ants
© National Geographic
Atta leaf-cutter ants
© National Geographic
Atta leaf-cutter ants
2015 11-17-programming inr.key
Oecophylla Weaver ants
© ameisenforum.de
© ameisenforum.de
Fourmis tisserandes
© ameisenforum.de
Oecophylla Weaver ants
© forestryimages.org© wynnie@flickr
Tofilski et al 2008
Forelius pusillus
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Avant
Workers staying outside die
« preventive self-sacrifice »
Tofilski et al 2008
Forelius pusillus hides the nest entrance at night
Dorylus driver ants: ants with no home
© BBC
Animal biomass (Brazilian rainforest)
from Fittkau & Klinge 1973
Other insects Amphibians
Reptiles
Birds
Mammals
Earthworms
Spiders
Soil fauna excluding
earthworms,
ants & termites
Ants & termites
We use modern technologies to
understand insect societies.
• evolution of social behaviour
• molecules involved in social behaviour
• consequences of environmental change
2015 11-17-programming inr.key
2015 11-17-programming inr.key
Big data is invading biology
This changes
everything.
Any lab can
sequence
anything!
http://gregoryzynda.com/ncbi/genome/python/2014/03/31/ncbi-genome.html
BIG
Big data is invading biology
• Genomics
• Cancer genomics
• Biodiversity assessments
• Stool microbiome sequencing
• Personalized medicine
• Sensor networks - e.g tracking microclimates, recording sounds
• Huge medical studies
• Aerial surveys (Drones) - e.g. crop productivity; rainforest cover
• Camera traps
2015 11-17-programming inr.key
Learning to deal with big data takes time
2015 11-17-programming inr.key
Practicals
• Aim: get relevant data handling skills
• Doing things by hand:
• impossible?
• slow,
• error-prone,
• Automate!
• Basic programming
• in R
• no stats!
Why R?
😳😟
😴😡
😥
Practicals: contents
• Done:
• data accessing/subsetting
• New:
• search/replace
• regular expressions
• New:
• functions
• loops
• Friday: (Introduction to Unix & High performance computing)
Text search on steroids
Reusable pieces of work
Repeating the same thing many times
2015 11-17-programming inr.key
• create a variable that contains the number 35
• create a variable that contains the string “I love tofu”
• give me a vector containing the sequence of numbers
from 5 to 11
• access the second number
• replace the second number with 42
• add 5 to the second number
• now add 5 to all numbers
• now add an extra number: 1999
• can you sum all the numbers?
• creating a vector
> my_vector <- c(5, 6, 7, 8, 9, 10, 11)
> my_vector <- 5:11
> my_vector <- seq(from=5, to=11, by=1)
> my_vector
[1] 5 6 7 8 9 10 11
> length(my_vector)
[1] 7
> (10 > 30)

[1] FALSE
> my_vector > 8

[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
> my_vector[my_vector > 8]

9 10 11
> other_vector <- my_vector[my_vector > 8]
> other_vector
9 10 11
> other_vector + 3
• give me a vector containing numbers from 5 to 11 (3 variants)
• accessing a subset
• of a vector
> big_vector <- 150:100
> big_vector
[1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 13
[20] 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 11
[39] 112 111 110 109 108 107 106 105 104 103 102 101 100
> big_vector[5]
146
> mysubset <- big_vector[my_vector]
> mysubset
[1] 146 145 144 143 142 141 140
> big_vector > 130
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE
> subset(x = big_vector, subset = big_vector > 140)
[1] 150 149 148 147 146 145 144 143 142 141
> big_vector[big_vector >= 140]
[1] 150 149 148 147 146 145 144 143 142 141 140
> my_vector
[1] 5 6 7 8 9 10 11
Regular expressions (regex):
Text search on steroids.
who dat?
2015 11-17-programming inr.key
2015 11-17-programming inr.key
Regular expressions (regex):
Text search on steroids.
Regular expression Finds
David David
Dav(e|(id)) David, Dave
Dav(e|(id)|(ide)|o) David, Dave, Davide, Davo
At{1,2}enborough
Attenborough,
Atenborough
Atte[nm]borough
Attenborough,
Attemborough
At{1,2}[ei][nm]bo{0,1}ro((ugh)|w){0,1}
Atimbro,

attenbrough,
ateinborow
Easy counting, replacing all with “Sir David Attenborough”
Yes: ”HATSOMIKTIP"
yes: ”HAVSONYYIKTIP"
not: ”HAVSQMIKTIP"
Regex special symbols
Regular expression Finds Example
[aeiou] any single vowel “e”
[aeiou]*
between 0 and infinity
vowels vowels, e.g.’
“eeooouuu"
[aeoiu]{1,3} between 1 and 3 vowels “oui”
a|i one of the 2 characters “"
((win)|(fail))
one of the two 

words in ()
fail
Yes: ”HATSOMIKTIP"
yes: ”HAVSONYYIKTIP"
not: ”HAVSQMIKTIP"
More Regex Special symbols
• Google “Regular expression cheat sheet”
• ?regexp
Synonymous with
[:digit:] [0-9]
[A-z] [A-z], ie [A-Za-z]
s whitespace
. any single character
.+ one to many of anything
b* between 0 and infinity letter ‘b’
[^abc] any character other than a, b or c.
( (
[:punct:]
any of these: ! " # $ % & ' ( ) * + , - . /
: ; < = > ? @ [  ] ^ _ ` { |
2015 11-17-programming inr.key
You want to scan a protein sequence database for a
particular binding site.Type a single regular expression that
will match the first two of the following peptide sequences,
but NOT the last one:
"HATSOMIKTIP"
"HAVSONYYIKTIP"
"HAVSQMIKTIP"
(rubular)
Variants of a microsatellite sequence are responsible for
differential expression of vasopressin receptor, and in turn for
differences in social behaviour in voles & others. Create a regular
expression that finds AGAGAGAGAGAGAGAG dinucleotide
microsatellite repeats with lengths of 5 to 500
Again
Make a regular expression
• matching “LMTSOMIKTIP” and “LMVSONYYIKTIP” but not
“LMVSQMIKTIP”
• matching all variants of “ok” (e.g., “O.K.”,“Okay”…)
2015 11-17-programming inr.key
Ok… so how do we use this?
• ?grep
• ?gsub
Which species names include ‘y’?
Create a vector with only species names, but replace all ‘y’
with ‘Y!
ants <- read.table("https://goo.gl/3Ek1dL")
colnames(ants) <- c("genus", "species")
Remove all vowels
Replace all vowels with ‘o’
2015 11-17-programming inr.key
Functions
Functions
• R has many. e.g.: plot(), t.test()
• Making your own:
tree_age_estimate <- function(diameter, species) {
growth_rate <- growth_rates[ species ]
age_estimate <- diameter / growth_rate
return(age_estimate)
}
> tree_age_estimate(25, “White Oak”)
+ 66
> tree_age_estimate(60, “Carya ovata”)
+ 190
Make a function
• That converts fahrenheit to celsius
(subtract 32 then divide the result by 1.8)
Loops
“for”
Loop
> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue',
'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark
blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')
> possible_colours
[1] "blue" "cyan" "sky-blue" "navy blue"
[5] "steel blue" "royal blue" "slate blue" "light blue"
[9] "dark blue" "prussian blue" "indigo" "baby blue"
[13] "electric blue"
> for (colour in possible_colours) {
+ print(paste("The sky is oh so, so", colour))
+ }
[1] "The sky is so, oh so blue"
[1] "The sky is so, oh so cyan"
[1] "The sky is so, oh so sky-blue"
[1] "The sky is so, oh so navy blue"
[1] "The sky is so, oh so steel blue"
[1] "The sky is so, oh so royal blue"
[1] "The sky is so, oh so slate blue"
[1] "The sky is so, oh so light blue"
[1] "The sky is so, oh so dark blue"
[1] "The sky is so, oh so prussian blue"
[1] "The sky is so, oh so indigo"
[1] "The sky is so, oh so baby blue"
What does this loop do?
for (index in 10:1) {
print(paste(index, "mins befo lunch"))
}
Again
• What does the following code do (decompose on pen and
paper)
for (letter in LETTERS) {
begins_with <- paste("^", letter, sep="")
matches <- grep(pattern = begins_with,
x = ants$genus)
print(paste(length(matches), "begin with", letter))
}
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"


> ants <- read.table("https://goo.gl/3Ek1dL")
> colnames(ants) <- c("genus", “species")


> head(ants)
genus species
1 Anergates atratulus
2 Camponotus sp.
3 Crematogaster scutellaris
4 Formica aquilonia
5 Formica cunicularia
6 Formica exsecta
What does this loop do?
2015 11-17-programming inr.key
Jasmin
Zohren
Bruno
Vieira
Rodrigo
Pracana
James
Wright
Programming in R
?
If/else
Logical Operators
2015 11-17-programming inr.key
2015 11-17-programming inr.key
2015 11-17-programming inr.key
going further

More Related Content

2015 11-17-programming inr.key

  • 1. Programming in R (and some other stuff) y.wurm@qmul.ac.uk https://wurmlab.github.io
  • 2. © Alex Wild & others
  • 4. © National Geographic Atta leaf-cutter ants
  • 5. © National Geographic Atta leaf-cutter ants
  • 6. © National Geographic Atta leaf-cutter ants
  • 8. Oecophylla Weaver ants © ameisenforum.de
  • 12. Tofilski et al 2008 Forelius pusillus
  • 13. Tofilski et al 2008 Forelius pusillus hides the nest entrance at night
  • 14. Tofilski et al 2008 Forelius pusillus hides the nest entrance at night
  • 15. Tofilski et al 2008 Forelius pusillus hides the nest entrance at night
  • 16. Tofilski et al 2008 Forelius pusillus hides the nest entrance at night
  • 17. Avant Workers staying outside die « preventive self-sacrifice » Tofilski et al 2008 Forelius pusillus hides the nest entrance at night
  • 18. Dorylus driver ants: ants with no home © BBC
  • 19. Animal biomass (Brazilian rainforest) from Fittkau & Klinge 1973 Other insects Amphibians Reptiles Birds Mammals Earthworms Spiders Soil fauna excluding earthworms, ants & termites Ants & termites
  • 20. We use modern technologies to understand insect societies. • evolution of social behaviour • molecules involved in social behaviour • consequences of environmental change
  • 23. Big data is invading biology
  • 24. This changes everything. Any lab can sequence anything!
  • 26. BIG
  • 27. Big data is invading biology • Genomics • Cancer genomics • Biodiversity assessments • Stool microbiome sequencing • Personalized medicine • Sensor networks - e.g tracking microclimates, recording sounds • Huge medical studies • Aerial surveys (Drones) - e.g. crop productivity; rainforest cover • Camera traps
  • 29. Learning to deal with big data takes time
  • 31. Practicals • Aim: get relevant data handling skills • Doing things by hand: • impossible? • slow, • error-prone, • Automate! • Basic programming • in R • no stats!
  • 33. Practicals: contents • Done: • data accessing/subsetting • New: • search/replace • regular expressions • New: • functions • loops • Friday: (Introduction to Unix & High performance computing) Text search on steroids Reusable pieces of work Repeating the same thing many times
  • 35. • create a variable that contains the number 35 • create a variable that contains the string “I love tofu” • give me a vector containing the sequence of numbers from 5 to 11 • access the second number • replace the second number with 42 • add 5 to the second number • now add 5 to all numbers • now add an extra number: 1999 • can you sum all the numbers?
  • 36. • creating a vector > my_vector <- c(5, 6, 7, 8, 9, 10, 11) > my_vector <- 5:11 > my_vector <- seq(from=5, to=11, by=1) > my_vector [1] 5 6 7 8 9 10 11 > length(my_vector) [1] 7 > (10 > 30)
 [1] FALSE > my_vector > 8
 [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE > my_vector[my_vector > 8]
 9 10 11 > other_vector <- my_vector[my_vector > 8] > other_vector 9 10 11 > other_vector + 3 • give me a vector containing numbers from 5 to 11 (3 variants)
  • 37. • accessing a subset • of a vector > big_vector <- 150:100 > big_vector [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 13 [20] 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 11 [39] 112 111 110 109 108 107 106 105 104 103 102 101 100 > big_vector[5] 146 > mysubset <- big_vector[my_vector] > mysubset [1] 146 145 144 143 142 141 140 > big_vector > 130 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [49] FALSE FALSE FALSE > subset(x = big_vector, subset = big_vector > 140) [1] 150 149 148 147 146 145 144 143 142 141 > big_vector[big_vector >= 140] [1] 150 149 148 147 146 145 144 143 142 141 140 > my_vector [1] 5 6 7 8 9 10 11
  • 38. Regular expressions (regex): Text search on steroids.
  • 42. Regular expressions (regex): Text search on steroids. Regular expression Finds David David Dav(e|(id)) David, Dave Dav(e|(id)|(ide)|o) David, Dave, Davide, Davo At{1,2}enborough Attenborough, Atenborough Atte[nm]borough Attenborough, Attemborough At{1,2}[ei][nm]bo{0,1}ro((ugh)|w){0,1} Atimbro,
 attenbrough, ateinborow Easy counting, replacing all with “Sir David Attenborough” Yes: ”HATSOMIKTIP" yes: ”HAVSONYYIKTIP" not: ”HAVSQMIKTIP"
  • 43. Regex special symbols Regular expression Finds Example [aeiou] any single vowel “e” [aeiou]* between 0 and infinity vowels vowels, e.g.’ “eeooouuu" [aeoiu]{1,3} between 1 and 3 vowels “oui” a|i one of the 2 characters “" ((win)|(fail)) one of the two 
 words in () fail Yes: ”HATSOMIKTIP" yes: ”HAVSONYYIKTIP" not: ”HAVSQMIKTIP"
  • 44. More Regex Special symbols • Google “Regular expression cheat sheet” • ?regexp Synonymous with [:digit:] [0-9] [A-z] [A-z], ie [A-Za-z] s whitespace . any single character .+ one to many of anything b* between 0 and infinity letter ‘b’ [^abc] any character other than a, b or c. ( ( [:punct:] any of these: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { |
  • 46. You want to scan a protein sequence database for a particular binding site.Type a single regular expression that will match the first two of the following peptide sequences, but NOT the last one: "HATSOMIKTIP" "HAVSONYYIKTIP" "HAVSQMIKTIP"
  • 48. Variants of a microsatellite sequence are responsible for differential expression of vasopressin receptor, and in turn for differences in social behaviour in voles & others. Create a regular expression that finds AGAGAGAGAGAGAGAG dinucleotide microsatellite repeats with lengths of 5 to 500
  • 49. Again Make a regular expression • matching “LMTSOMIKTIP” and “LMVSONYYIKTIP” but not “LMVSQMIKTIP” • matching all variants of “ok” (e.g., “O.K.”,“Okay”…)
  • 51. Ok… so how do we use this? • ?grep • ?gsub
  • 52. Which species names include ‘y’? Create a vector with only species names, but replace all ‘y’ with ‘Y! ants <- read.table("https://goo.gl/3Ek1dL") colnames(ants) <- c("genus", "species") Remove all vowels Replace all vowels with ‘o’
  • 55. Functions • R has many. e.g.: plot(), t.test() • Making your own: tree_age_estimate <- function(diameter, species) { growth_rate <- growth_rates[ species ] age_estimate <- diameter / growth_rate return(age_estimate) } > tree_age_estimate(25, “White Oak”) + 66 > tree_age_estimate(60, “Carya ovata”) + 190
  • 56. Make a function • That converts fahrenheit to celsius (subtract 32 then divide the result by 1.8)
  • 57. Loops
  • 58. “for” Loop > possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue') > possible_colours [1] "blue" "cyan" "sky-blue" "navy blue" [5] "steel blue" "royal blue" "slate blue" "light blue" [9] "dark blue" "prussian blue" "indigo" "baby blue" [13] "electric blue" > for (colour in possible_colours) { + print(paste("The sky is oh so, so", colour)) + } [1] "The sky is so, oh so blue" [1] "The sky is so, oh so cyan" [1] "The sky is so, oh so sky-blue" [1] "The sky is so, oh so navy blue" [1] "The sky is so, oh so steel blue" [1] "The sky is so, oh so royal blue" [1] "The sky is so, oh so slate blue" [1] "The sky is so, oh so light blue" [1] "The sky is so, oh so dark blue" [1] "The sky is so, oh so prussian blue" [1] "The sky is so, oh so indigo" [1] "The sky is so, oh so baby blue"
  • 59. What does this loop do? for (index in 10:1) { print(paste(index, "mins befo lunch")) }
  • 60. Again • What does the following code do (decompose on pen and paper)
  • 61. for (letter in LETTERS) { begins_with <- paste("^", letter, sep="") matches <- grep(pattern = begins_with, x = ants$genus) print(paste(length(matches), "begin with", letter)) } > LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" 
 > ants <- read.table("https://goo.gl/3Ek1dL") > colnames(ants) <- c("genus", “species") 
 > head(ants) genus species 1 Anergates atratulus 2 Camponotus sp. 3 Crematogaster scutellaris 4 Formica aquilonia 5 Formica cunicularia 6 Formica exsecta What does this loop do?