Intro To Statistic Using R - Session 2
Intro To Statistic Using R - Session 2
2023-04-12
R basics
R is case sensitive.
Let’s create an object and see what happens if we use the wrong case to call it
hab<-"forest"
hab
## [1] "forest"
#Hab
hab2<-"prairie"
rm(hab2) #deletes the newly created hab2 so you can see that adding spaces changes nothing
hab2 <- "prairie"
First, you can see that adding spaces or not changes nothing. Second, notice the # sign: as explain you can add whatever comments in your r
code, as long as you add a # in front, it will not run.
🍀If for some reason R does not respond, or you made a mistake you can terminate whatever current command you are running by pressing the
esc key.
## [1] 5
The [1] in front of the result means that the observation number at the beginning of the line is the first observation. Not very useful here for a
simple calculation, but when you get a series of calculations with an output taking 10 lines, that may come handy!
## [1] 3.141593
## [1] 78.53982
## [1] 2.079442
## [1] 0.90309
## [1] 2980.958
## [1] 2.828427
As we said during last session, R has convenient functions for pretty much everything basic. I encourage you to (every now and then) use R when
you have easy calculation to perform, instead of using excel or the computer integrated calculator for example. Using R regularly is the only way to
get used to it, and become proficient.
As you just saw by running these lines, the results only got displayed in the console. That’s because we did not create any object. So if you want to
keep an output for later use, don’t forget to create objects.
Create objects
To create an object you use the operator <- sometimes referred to as “gets operator”. So, in the first line above you would read r gets log(25). You
can also use =
diam = r*2
As you can see the objects are stored in the Environment pane. You can create objects with multiple values (we’ll see that in a minute).
🍀Tips about naming objects: not as easy as it seems. You can an object pretty much anything you want, but there are a few rules. 1. keep the
names as short as possible, while keeping them informative (easier said than done), 2. NO special characters, that’s just opening doors for trouble,
3. if you name an object with multiple words, use . or _ to separate them, or capitalize each word (ex.: my_data, my.model, MyOutput), 4. a few
words are not allowed because they are reserved for specific cases like TRUE or NA. You can try but R won’t let you (I actually encourage you to
try to see what happens, 5. Don’t call an object with the name of a function, R might or might not let you but if it does let you, you will definitely run
into problems (ex.: instead of data name dat, or df).
Let’s see how to create objects with multiple values for different data type
Now don’t forget when writing characters you need to add ““. Let see what happens if you forget.
#sky<- star
You get an error, R thinks, you were calling an object that doesn’t exist.
Here, we go, it works! And congrats, you have been using an R function, c() is a function and a short for concatenate. Functions in R are ALWAYS
followed by round brackets, and everything you put into the function is separated by commas.
mean(int)
## [1] 2
sd(int)
## [1] 1
length(chr)
## [1] 3
Just to name a few. Most common calculation, R has a pre-written function for it. Don’t hesitate to re-read last session’s code again, to remember
the different functions we used.
When dealing with a vector of length > 1, you can extract specific values from your vector. For example, we want the 2nd entry of vector chr
chr[2]
## [1] "Howdy!"
chr[c(1,2)]
let’s create a vector containing a sequence of numbers. You have (like for pretty much everything else) quite a few ways to do it. Below are two
common ways to create a suite of integers withing a certain range.
my.vec<-5:20
vec<-seq(from = 5, to = 20, by = 1)
The first way is shorter but only works when we want our sequence to be every 1 number. With the second you could customize further (see
below).
Let’s go back to my.vec: we can ask R to tell us which entries are bigger than 10
my.vec[my.vec > 10] #in my.vec which values of my.vec are > 10
## [1] 11 12 13 14 15 16 17 18 19 20
my.vec > 10
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE
Now, let’s see the most common logical operators: |, &, ==, !=, >=, <=
## [1] 6 7 8 9
## [1] 5 6 7 8 9 10
## [1] 5 9 10 11 12 13 14 15 16 17 18 19 20
my.vec[my.vec != 20 & my.vec != 10] #adding ! in front of = means you want to exclude it
## [1] 5 6 7 8 9 11 12 13 14 15 16 17 18 19
## [1] 5 800 7 8 9 10 11 12 13 14 15 16 17 18 19 20
my.vec2<- my.vec*2
log.vec<- log(my.vec)
You can also do these calculations on data frame columns. Take the code from session 1, we have done just that (like asking the mean of column
wing_span from df mean(df$wing_span).
🍀Tip: If you want to save your work, remember these 2 functions: save and save.image
#save(nameOfObject, file = "name_of_file.RData") #to save an object. Very useful when your object is a model that
has been running for 10 days!
To load a RData object, there is the wait for it… load() function.
#load(file = "name_of_file_Date.RData")
You’ve notice I’ve put a # in front of these 3 commands. That’s because, when I’ll print this .Rmd file as an HTML file, R runs everything and if one
line returns an error, the printing aborts. If you want to try replave the generic object name I’ve inserted by a real one and don’t forget to delete the
#.
🍀Tips for naming a file on your computer (that works for any file, from word docs, and ppt presentations to R files or data bases): 1. never
overwrite former versions unless you have a very good reason to. You may want to trace back the changes you’ve been making, especially when
collaborating on the same file with other people. 2. Whether you already have multiple version of the same file or not, always add the date, that will
help you being organize, and please don’t make the rookie mistake of naming a file V2.2 (they’re not softwares) or V3, or Final, or FinalVersion,
trust me, I’ve seen many a computer with 10 FinalVersions of the same file… that’s never a good idea. If you collaborate with internationals,
remember that we all have different conventions to write dates: 05/11/2023 will either be November 5th, 2023, or May 11th, 2023 depending on
where you’re from. I strongly recommend to use YYYYMMDD, it’s a widely used format for international collabs because not ambiguous. 3. Keep
the name a short as you can, but still provide all key infos. Your future You will thank Past You for it, and collaborators may appreciate. For example
eagle_data.csv is a terrible name, because we don’t have any info beside the fact that it’s about eagle.
Canada_boldeagle_morpho_data_20102022_20230412.csv is indeed longer, but we have the info we need now: the db contains the morphometric
data of the Canadian Bald eagle population from 2010 to 2022, and was last updated on April 12 2023. The same goes for an R workspace:
“R_seminar_series_session2_20231204.RData” is much better than “R_code_FinalVersion.RData”.
If you’re not sure anymore of the exact name of the function you can use the help.search function
help.search("read csv")
R now proposes a few different options that correspond to your key words.
Now, good practice wants you to systematically report the version of R and packages you used to perform an analysis. That’s also key info to
provide if you’re asking help online (just like for everything else computer): some bugs or specific behaviors are linked to your software version.
SessionInfo(), tells you everything about your session including R version, the platform, and your OS, current timezone, language…m and the
loaded packages attached or not. Check it out:
sessionInfo()
packageVersion("dplyr")
## [1] '1.1.0'
citation("dplyr")
##
## To cite package 'dplyr' in publications use:
##
## Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
## Grammar of Data Manipulation_. R package version 1.1.0,
## <https://CRAN.R-project.org/package=dplyr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {dplyr: A Grammar of Data Manipulation},
## author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
## year = {2023},
## note = {R package version 1.1.0},
## url = {https://CRAN.R-project.org/package=dplyr},
## }
As for most command in R, there are many ways to define the wd. Let’s see a couple ways.
#wd<- setwd("C:/Users/crodrigues/Documents/R_seminar_session2")
Why do I dislike it? Because it’s hardly a reproducible line of code, since each person on earth has organized their computer files differently, the
path will change for each of us!
Because worse even: say I finally decide to properly organize my files on my computer, but I have many R projects going on. I deleted files,
created new ones, moved objects into new files etc… I would have to rewrite every setwd command in my codes… Huge waste of time!
Also, because the longer the path, the more likely you may make mistakes when writing the path and get errors such as Error in setwd(“xxx”) :
cannot change working directory. You’ll need to pinpoint where you wrote something wrong. Again, waste of time…
I could go on, on why I hate setting the wd the classic way. Coding is for lazy people, because they always find the easier way to do it 😜
My (easier) solution here is:
#setwd(choose.dir())
I put a # because, unfortunately, it often creates issues when knitting a Rmd file into HTML, but in normal R it will open a pop up window in which
you can directly choose which file you want to use as working directory. Plus you send your code to friends, they don’t need to change that line,
they can directly use it with their own file organization.
You can either set it directly as above, or store the path in an object we will call wd, that appears in value as you can see. If you need to feed the
path of the wd to another function at some point, you can just use the name of the object “wd”.
#wd<-setwd(choose.dir())
Now we have our working directory, we want to load our data set.
df<- read.csv("RF_mercury_long_20230406.csv")
The same way you chose the wd from a pop-up window, you could do the same for whatever files you want to import.
#df<- read.delim(file.choose(""))
But unlike for the wd, I do not recommend this selection method. If you have multiple versions of your files (since you should never overwrite a
previous version), it’s better to write down the name of the file you last worked with. Say you’re publishing your paper and analysis have been done
like a year ago (or more!), and a reviewer ask you to re-do or check something, trust me, you will thank Past you to have written down the exact file
name you’ve been using last. That will avoid you possibly not finding the same results you reported in your “Results” section. Of course that
suggest you are not renaming your data bases on a regular basis…
Data exploration is a crucial part of data analysis. In fact, most of the times, you will spend way more time exploring your data than actually
modelling them. running a linear model, take 1min, but choosing (and then validating it, a step we will cover in a future session), will take much
much longer, and will guarantee that you did a proper job, and can trust your results. If you don’t provide detailed information on these steps,
reviewers will reject your paper (or at least they should, if they know what they’re doing. I totally would), because there would be no way to know if
we can trust your claims. Plus, it will ensure reproducibility: if I take your data and follow the steps you describe in your “Methods” section, I should
1. be easily able to do so, just by reading your Methods, 2. find the same results. Reproducibility ensures that, as scientists, we do our job right.
colSums(is.na(df))
which(colSums(is.na(df))>0)
## THg
## 7
names(which(colSums(is.na(df))>0))
## [1] "THg"
The first line returns the number of NAs for each column (this is the line I personally always use in first intention, because it is the most
informative). The function is.na() basically transform each cell of your data frame into a logical value: is it NA? TRUE or FALSE. Then the
colSums() function counts the number of TRUE per column.
The second line names and gives you the position of columns where the count of NAs > 0. and the third line only provides the column name.
Now that we know that we only have missing values in THg, we want to check there is no pattern. So, we will switch to package dplyr (which you
will get to know quite well in future sessions), to summarize the NAs by group.
##
## Attaching package: 'dplyr'
df<- as.data.frame(unclass(df),
stringsAsFactors = TRUE) #this line convert all characters into factors at once
## # A tibble: 2 × 3
## sex count_na n
## <fct> <int> <int>
## 1 F 26 182
## 2 M 29 203
## # A tibble: 2 × 3
## age_cat count_na n
## <fct> <int> <int>
## 1 adult 27 175
## 2 juvenile 28 210
#To check by age, we can plot it because it will be easier to visualize for most people than a table with numbers
for so many categories
#First, we create a table for age like above
dCount.age2 <- df %>% # Count NA by age
group_by(age) %>%
summarize(count_na = sum(is.na(THg)), n = length(THg))
#Then we create a new variable: that is the proportion of NAs (Nas per age/n)
dCount.age2$prop_na<- dCount.age2$count_na/dCount.age2$n
#Finally we load ggpubr (wrapper for ggplot2, easier to use) and make a scatterplot with the correlation R value
and a p value for a pearson correlation
library(ggpubr)
So, missing values seem to be pretty random. The R we obtained for age is quite high, but that’s because we have very few old animals, so we can
safely consider that the apparent pattern is just due to sampling.
We first loaded the dplyr package using the library() function, and the second line is a command to convert all characters from your data set into
factors, which are (in general) easier to deal with in R. The usual command to convert from one data type to another is
as.NewDataType(as.OldDataType())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
thg.SexAge
## # A tibble: 4 × 7
## # Groups: sex [2]
## sex age_cat min max mean se n
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 F adult 20.9 2747. 682. 134. 39
## 2 F juvenile 10.6 2510. 341. 49.9 117
## 3 M adult 31.9 3495. 677. 70.4 109
## 4 M juvenile 15.4 2501. 364. 68.9 65
We have to filter out the NAs otherwise, they will cause problems (take out the filter line, and run the code to see).
## # A tibble: 10 × 6
## traploc min max mean se n
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Button Bay 15.4 2510. 420. 95.5 39
## 2 Goose Creek 42.8 2468. 658. 243. 13
## 3 Line 25 20.0 2595. 573. 152. 21
## 4 Mack Lake area 17.3 107. 57.2 15.5 5
## 5 North River 12.4 3424. 589. 91.6 63
## 6 Seal River 10.6 3495. 603. 141. 30
## 7 Southknife Lake 45.8 2641. 721. 173. 23
## 8 Town of Churchill 17.5 2501. 407. 67.4 78
## 9 Twin Lakes 16.7 2097. 352. 70.7 49
## 10 Wakeworth Lake 25.8 2090. 661. 266. 9
## sex
## traploc F M
## Button Bay 21 21
## Goose Creek 14 7
## Line 25 0 21
## Mack Lake area 7 0
## North River 42 21
## Seal River 7 35
## Southknife Lake 7 21
## Town of Churchill 28 63
## Twin Lakes 42 14
## Wakeworth Lake 14 0
## sex F M
## traploc age_cat
## Button Bay adult 0 7
## juvenile 21 14
## Goose Creek adult 7 7
## juvenile 7 0
## Line 25 adult 0 14
## juvenile 0 7
## Mack Lake area adult 0 0
## juvenile 7 0
## North River adult 14 21
## juvenile 28 0
## Seal River adult 0 21
## juvenile 7 14
## Southknife Lake adult 7 21
## juvenile 0 0
## Town of Churchill adult 7 28
## juvenile 21 35
## Twin Lakes adult 7 7
## juvenile 35 7
## Wakeworth Lake adult 7 0
## juvenile 7 0
We saw, that some locations (namely GosseCreek and Wakeworth Lake) may be associated with higher THg mercury in fox tissues, but we also
saw above that adult females seem to have more mercury in their tissues. We, thus, need to check if the distribution of sex per location is balanced
(the 2 xtabs lines that we also saw last session show you the count of sex - line1 - and sex*age_cat - line 2- per location), and it is not balanced. If
we were going to analyze these data today, that info should get you to think about possibly excluding one of the variable sex or traploc as they may
be strongly associated, and should definitely make you test for that association specifically. We’ll talk about correlation between explanatory
variables further in future sessions.
Next, we can produce a table summarizing THg per tissue per sex and age. We will use a different way this time, one that does not involve dplyr.
dfc<-df[complete.cases(df),]
sum.tab <- aggregate(dfc$THg,
by = list(dfc$sex, dfc$age_cat, dfc$organ),
FUN = function(x) c(min = min(x), max = max(x),
median = median(x), mean = mean(x),
se = sd(x)/sqrt(length(x)),
n = length(x)))
sum.tab<-do.call(data.frame, sum.tab)
colnames(sum.tab)<- c("sex", "age", "organ", "min", "max", "median", "mean", "se", "n")
Like with dplyr, ignoring the NAs will cause problems. We need to get rid of them, and we did that with function complete.cases() which only keeps
rows that are complete. Then, we used the function aggregate(). The arguments you give to aggregate are the column you want to summarize by
group (here, the THg), then the groups (as a list), and finally the functions you want (here, we asked for min, max, median, mean, standard error*,
and sample size). *Remember Standard error se does not have a specific function, so you need to add it by hand sd (Standard deviation)/ sqrt
(square root) of n (sample size).
Then, we converted sum.tab into a real data frame, and renamed the columns.
💪 We have our summary table. Now, we’ll see how to make it pretty.
library(rempsyc)
## Suggested APA citation: Thériault, R. (2022). rempsyc: Convenience functions for psychology
## (R package version 0.1.1) [Computer software]. https://rempsyc.remi-theriault.com
##
## Attaching package: 'flextable'
Now we have a nice table, but we need to arrange it because we technically have multilevel headers. and instead here we have each group as a
separate column.
#And create a new df with the 4 df bound columnwise - we need to exclude the first column from 3 of teh df and th
e last column from all of them
dat <- cbind(fa[,1:7], fj[,2:7], ma[,2:7], mj[,2:7])
#Now we rename the column (except the first one) the proper way so that the function understands what header it s
hould take
names(dat)[-1] <- c(paste0("Female.adult.", names(sum.tab2[2:7])),
paste0("Female.juvenile.", names(sum.tab2[2:7])),
paste0("Male.adult.", names(sum.tab2[2:7])),
paste0("Male.juvenile.", names(sum.tab2[2:7])))
#Now we'll rename the organs the way we want them to appear in the table
# Renaming factor levels dplyr
dat <- dat %>%
mutate(organ=recode(organ, "GH"="Guard hair", "brain" = "Brain",
"claw"="Claw", "KidCort" = "Renal cortex",
"KidMed" = "Renal medulla", "liver"= "Liver",
"muscle"= "Muscle"))
organ Female.adult.min Female.adult.max Female.adult.median Female.adult.mean Female.adult.se Female.adult.n Female.juvenile.min Female.juvenile.max Female.juvenile.median Female.juvenile.mean Female.juvenile.se Fema
Brain 20.86 109.93 54.95 54.46 16.15 5.00 10.57 126.31 30.14 34.89 6.46
Claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6.00 209.40 1,764.90 512.78 704.44 128.21
Guard
1,976.55 2,747.21 2,159.25 2,260.72 116.88 6.00 481.09 2,509.83 926.77 1,240.99 194.82
hair
Renal
178.62 455.45 363.94 340.49 63.62 4.00 107.40 495.49 245.96 262.25 29.90
cortex
Renal
29.50 220.63 138.79 131.93 43.06 4.00 28.25 167.54 62.39 77.60 9.92
medulla
Liver 154.98 430.22 180.34 246.80 41.92 7.00 48.51 259.36 101.92 114.94 12.02
Muscle 35.98 184.30 61.68 92.58 21.90 7.00 17.33 101.55 42.57 46.22 5.18
#All seems in order, ready for the last step which consists of separating headers
nice_table(dat, separate.header = TRUE, italics = seq(dat))
Female Male
min max median mean se n min max median mean se n min max median mean se n min max median mean se n
Brain 20.86 109.93 54.95 54.46 16.15 5 10.57 126.31 30.14 34.89 6.46 17 31.92 269.44 117.16 114.44 14.31 17 15.45 45.71 20.03 25.88 4.33 7
Claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6 209.40 1,764.90 512.78 704.44 128.21 16 523.99 2,134.76 1,054.77 1,197.70 150.83 13 209.85 1,727.20 607.53 777.40 196.29 9
Guard
1,976.55 2,747.21 2,159.25 2,260.72 116.88 6 481.09 2,509.83 926.77 1,240.99 194.82 16 658.73 3,424.01 1,973.80 1,813.03 236.37 13 411.50 2,501.03 1,327.61 1,248.34 264.80 9
hair
Renal
178.62 455.45 363.94 340.49 63.62 4 107.40 495.49 245.96 262.25 29.90 15 326.75 3,494.81 809.66 1,018.81 208.39 15 166.53 406.48 268.49 272.45 25.68 9
cortex
Renal
29.50 220.63 138.79 131.93 43.06 4 28.25 167.54 62.39 77.60 9.92 15 49.63 1,035.02 235.06 289.59 64.87 15 37.56 141.14 73.10 82.97 11.11 9
medulla
Liver 154.98 430.22 180.34 246.80 41.92 7 48.51 259.36 101.92 114.94 12.02 19 181.33 1,348.47 340.33 500.97 73.38 18 86.92 263.93 118.35 133.53 15.71 11
Muscle 35.98 184.30 61.68 92.58 21.90 7 17.33 101.55 42.57 46.22 5.18 19 48.86 679.73 175.75 225.81 40.89 18 24.77 121.54 43.03 53.19 8.22 11
💪
That’s all for this example. Using flextables, you can really customize your tables any way you wish. Here, is the link to package flextable user
guide: https://ardata-fr.github.io/flextable-book/
We will stop here for today, we will see much more of R during the next sessions, as we will slowly shift from
mostly theoretical seminars to mostly practical ones.