0% found this document useful (0 votes)

17 views

Intro To Statistic Using R - Session 2

Code associated with presentation 2 from the seminar series Intro to statistics using R by Chloé Warret Rodrigues, PhD, DVM, MSc

Uploaded by

Chloé Warret Rodrigues

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Intro To Statistic Using R - Session 2

Code associated with presentation 2 from the seminar series Intro to statistics using R by Chloé Warret Rodrigues, PhD, DVM, MSc

Uploaded by

Chloé Warret Rodrigues

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Intro to statistics using R - session 2

2023-04-12

Introduction to statistics using R - Session 2

Introduction to R and RStudio - R intermission
During this short intro to R, we will see the main basic functions, and we will learn how to perform the steps (and associated basic functions) from
importing a data set, that I prepared as an example, to summarizing the descriptive statistics of our data set.

R basics
R is case sensitive.
Let’s create an object and see what happens if we use the wrong case to call it

hab<-"forest"
hab

## [1] "forest"

#Hab

That’s right, we get an error message.

R tolerates extra spaces

For clarity, I encourage you to use spaces.

hab2<-"prairie"
rm(hab2) #deletes the newly created hab2 so you can see that adding spaces changes nothing
hab2 <- "prairie"

First, you can see that adding spaces or not changes nothing. Second, notice the # sign: as explain you can add whatever comments in your r
code, as long as you add a # in front, it will not run.

🍀If for some reason R does not respond, or you made a mistake you can terminate whatever current command you are running by pressing the
esc key.

R is a calculator, with some useful functions!

3+2 #simple addition

## [1] 5

The [1] in front of the result means that the observation number at the beginning of the line is the first observation. Not very useful here for a
simple calculation, but when you get a series of calculations with an output taking 10 lines, that may come handy!

pi #pi is a built-in constant in R

## [1] 3.141593

pi5^2 #pir^2 i.e., area of a circle with r = 5

## [1] 78.53982

log(8) #log base e

## [1] 2.079442

log10(8) #log base 10

## [1] 0.90309

exp(8) #exponential or natural anti-log

## [1] 2980.958

sqrt(8) #square root

## [1] 2.828427

As we said during last session, R has convenient functions for pretty much everything basic. I encourage you to (every now and then) use R when
you have easy calculation to perform, instead of using excel or the computer integrated calculator for example. Using R regularly is the only way to
get used to it, and become proficient.

As you just saw by running these lines, the results only got displayed in the console. That’s because we did not create any object. So if you want to
keep an output for later use, don’t forget to create objects.

Create objects

r <- log(25) #we created an object r (radius)

area<-pi*r^2 #now we can create an object area reusing the object r directly

To create an object you use the operator <- sometimes referred to as “gets operator”. So, in the first line above you would read r gets log(25). You
can also use =

diam = r*2

As you can see the objects are stored in the Environment pane. You can create objects with multiple values (we’ll see that in a minute).

🍀Tips about naming objects: not as easy as it seems. You can an object pretty much anything you want, but there are a few rules. 1. keep the
names as short as possible, while keeping them informative (easier said than done), 2. NO special characters, that’s just opening doors for trouble,
3. if you name an object with multiple words, use . or _ to separate them, or capitalize each word (ex.: my_data, my.model, MyOutput), 4. a few
words are not allowed because they are reserved for specific cases like TRUE or NA. You can try but R won’t let you (I actually encourage you to
try to see what happens, 5. Don’t call an object with the name of a function, R might or might not let you but if it does let you, you will definitely run
into problems (ex.: instead of data name dat, or df).

Let’s see how to create objects with multiple values for different data type

int<- c(1, 2, 3) #see int is numeric and has 3 obs [1:3]

chr<- c("Hello world!", "Howdy!", "great day!") #and now we have a character variable, with 3 values [1:3]

Now don’t forget when writing characters you need to add ““. Let see what happens if you forget.

#sky<- star

You get an error, R thinks, you were calling an object that doesn’t exist.

sky<- c("star", "moon", "sun")

Here, we go, it works! And congrats, you have been using an R function, c() is a function and a short for concatenate. Functions in R are ALWAYS
followed by round brackets, and everything you put into the function is separated by commas.

R functions and vectors

Remember (Remember the fifth of November…): During last session we used many functions.

mean(int)

## [1] 2

sd(int)

## [1] 1

length(chr)

## [1] 3

Just to name a few. Most common calculation, R has a pre-written function for it. Don’t hesitate to re-read last session’s code again, to remember
the different functions we used.

When dealing with a vector of length > 1, you can extract specific values from your vector. For example, we want the 2nd entry of vector chr

chr[2]

## [1] "Howdy!"

# you can stored this value for later use

h<- chr[2]

Using c, you can extract multiple values

chr[c(1,2)]

## [1] "Hello world!" "Howdy!"

let’s create a vector containing a sequence of numbers. You have (like for pretty much everything else) quite a few ways to do it. Below are two
common ways to create a suite of integers withing a certain range.

my.vec<-5:20
vec<-seq(from = 5, to = 20, by = 1)

The first way is shorter but only works when we want our sequence to be every 1 number. With the second you could customize further (see
below).

vec2<-seq(5, 20, by = 0.5)

vec3<-seq(5, 20, by = 0.1)

Let’s go back to my.vec: we can ask R to tell us which entries are bigger than 10

my.vec[my.vec > 10] #in my.vec which values of my.vec are > 10

## [1] 11 12 13 14 15 16 17 18 19 20

or you can ask if specific values are bigger than 10

my.vec > 10

## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE

In the command above, R returns logical values TRUE and FALSE

Now, let’s see the most common logical operators: |, &, ==, !=, >=, <=

my.vec[my.vec < 10 & my.vec > 5] # & is AND

## [1] 6 7 8 9

my.vec[my.vec <= 10 & my.vec >= 5]

## [1] 5 6 7 8 9 10

my.vec[my.vec > 8 | my.vec ==5] # | is OR and == is equal to

## [1] 5 9 10 11 12 13 14 15 16 17 18 19 20

my.vec[my.vec != 20 & my.vec != 10] #adding ! in front of = means you want to exclude it

## [1] 5 6 7 8 9 11 12 13 14 15 16 17 18 19

You can also replace elements in your vector

my.vec[2]<- 800 #just 1

my.vec

## [1] 5 800 7 8 9 10 11 12 13 14 15 16 17 18 19 20

my.vec[c(6, 7)] <- 500# or many

my.vec

## [1] 5 800 7 8 9 500 500 12 13 14 15 16 17 18 19 20

my.vec[my.vec > 19] <- 1000 #conditionally

my.vec

## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

## [16] 1000

my.vec[my.vec < 1000 & my.vec > 100]<- 5 #double condition

my.vec

## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

## [16] 1000

You can also perform calculation on entire vectors.

my.vec2<- my.vec*2
log.vec<- log(my.vec)

You can also do these calculations on data frame columns. Take the code from session 1, we have done just that (like asking the mean of column
wing_span from df mean(df$wing_span).

Ok, that’s it for the basics. Just 2 last pieces of info:

🍀Tip: If you want to save your work, remember these 2 functions: save and save.image
#save(nameOfObject, file = "name_of_file.RData") #to save an object. Very useful when your object is a model that
has been running for 10 days!

#save.image(file = "name_of_file_Date.RData") #To save all your workspace at once

Try these functions out!

To load a RData object, there is the wait for it… load() function.

#load(file = "name_of_file_Date.RData")

You’ve notice I’ve put a # in front of these 3 commands. That’s because, when I’ll print this .Rmd file as an HTML file, R runs everything and if one
line returns an error, the printing aborts. If you want to try replave the generic object name I’ve inserted by a real one and don’t forget to delete the
#.

🍀Tips for naming a file on your computer (that works for any file, from word docs, and ppt presentations to R files or data bases): 1. never
overwrite former versions unless you have a very good reason to. You may want to trace back the changes you’ve been making, especially when
collaborating on the same file with other people. 2. Whether you already have multiple version of the same file or not, always add the date, that will
help you being organize, and please don’t make the rookie mistake of naming a file V2.2 (they’re not softwares) or V3, or Final, or FinalVersion,
trust me, I’ve seen many a computer with 10 FinalVersions of the same file… that’s never a good idea. If you collaborate with internationals,
remember that we all have different conventions to write dates: 05/11/2023 will either be November 5th, 2023, or May 11th, 2023 depending on
where you’re from. I strongly recommend to use YYYYMMDD, it’s a widely used format for international collabs because not ambiguous. 3. Keep
the name a short as you can, but still provide all key infos. Your future You will thank Past You for it, and collaborators may appreciate. For example
eagle_data.csv is a terrible name, because we don’t have any info beside the fact that it’s about eagle.
Canada_boldeagle_morpho_data_20102022_20230412.csv is indeed longer, but we have the info we need now: the db contains the morphometric
data of the Canadian Bald eagle population from 2010 to 2022, and was last updated on April 12 2023. The same goes for an R workspace:
“R_seminar_series_session2_20231204.RData” is much better than “R_code_FinalVersion.RData”.

Asking R for help or information

There are a few ways. Let’s take for example the read.csv function

?read.csv # the documentation now appears in the Output pane

## starting httpd help server ... done

If you’re not sure anymore of the exact name of the function you can use the help.search function

help.search("read csv")

R now proposes a few different options that correspond to your key words.

Now, good practice wants you to systematically report the version of R and packages you used to perform an analysis. That’s also key info to
provide if you’re asking help online (just like for everything else computer): some bugs or specific behaviors are linked to your software version.

SessionInfo(), tells you everything about your session including R version, the platform, and your OS, current timezone, language…m and the
loaded packages attached or not. Check it out:

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)

## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 evaluate_0.20
## [5] cachem_1.0.7 rlang_1.1.0 cli_3.6.0 rstudioapi_0.14
## [9] jquerylib_0.1.4 bslib_0.4.2 rmarkdown_2.21 tools_4.2.2
## [13] xfun_0.37 yaml_2.3.7 fastmap_1.1.0 compiler_4.2.2
## [17] htmltools_0.5.4 knitr_1.42 sass_0.4.5

To get the version of a specific package you can use packageVersion()

packageVersion("dplyr")

## [1] '1.1.0'

And to cite your packages right, use citation()

citation("dplyr")

##
## To cite package 'dplyr' in publications use:
##
## Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
## Grammar of Data Manipulation_. R package version 1.1.0,
## <https://CRAN.R-project.org/package=dplyr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {dplyr: A Grammar of Data Manipulation},
## author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
## year = {2023},
## note = {R package version 1.1.0},
## url = {https://CRAN.R-project.org/package=dplyr},
## }

Let’s play with a real data set!

Presentation of the data set

The data set I have provided is a modified version of a real-life data set that was build to assess the relationship between total mercury
concentration (THg) in diverse organs of red foxes. The data set contains the following variables: individual ID, age in number of years as an
integer, age category (adult or juvenile), sex (F or M), organ (liver, muscle, renal cortex, renal medulla, brain, claw and guard hair), and Total
mercury concentration (THg) as a continuous variable . As it is often the case in real life data sets, we have some NA for THg in diverse tissues.

Steps to produce descriptive stats of your data set

1. Setting your working directory
First, we need to define our working directory (wd). Here, I recommend that you create a new file called “R_seminar_session2” in Documents to
run this code.

As for most command in R, there are many ways to define the wd. Let’s see a couple ways.

The classic way (which I very much dislike)

#wd<- setwd("C:/Users/crodrigues/Documents/R_seminar_session2")

Why do I dislike it? Because it’s hardly a reproducible line of code, since each person on earth has organized their computer files differently, the
path will change for each of us!

Because worse even: say I finally decide to properly organize my files on my computer, but I have many R projects going on. I deleted files,
created new ones, moved objects into new files etc… I would have to rewrite every setwd command in my codes… Huge waste of time!

Also, because the longer the path, the more likely you may make mistakes when writing the path and get errors such as Error in setwd(“xxx”) :
cannot change working directory. You’ll need to pinpoint where you wrote something wrong. Again, waste of time…

I could go on, on why I hate setting the wd the classic way. Coding is for lazy people, because they always find the easier way to do it 😜
My (easier) solution here is:

#setwd(choose.dir())

I put a # because, unfortunately, it often creates issues when knitting a Rmd file into HTML, but in normal R it will open a pop up window in which
you can directly choose which file you want to use as working directory. Plus you send your code to friends, they don’t need to change that line,
they can directly use it with their own file organization.

You can either set it directly as above, or store the path in an object we will call wd, that appears in value as you can see. If you need to feed the
path of the wd to another function at some point, you can just use the name of the object “wd”.

#wd<-setwd(choose.dir())

Now we have our working directory, we want to load our data set.

2. Loading the data set

You need your data as a text format, so ideally csv or txt. Here, we have a csv, so let’s load it

df<- read.csv("RF_mercury_long_20230406.csv")

The command is read.csv(), we call our dataframe df (How original! 😜).

If you work with Frenchies, they will likely have a weird format of csv with “;” as separator instead of “,” and a “,” for decimal point instead of “.”. In
that case, the command will be read.csv2()

If you have a .txt file the command is read.table().

The same way you chose the wd from a pop-up window, you could do the same for whatever files you want to import.

#df<- read.delim(file.choose(""))

But unlike for the wd, I do not recommend this selection method. If you have multiple versions of your files (since you should never overwrite a
previous version), it’s better to write down the name of the file you last worked with. Say you’re publishing your paper and analysis have been done
like a year ago (or more!), and a reviewer ask you to re-do or check something, trust me, you will thank Past you to have written down the exact file
name you’ve been using last. That will avoid you possibly not finding the same results you reported in your “Results” section. Of course that
suggest you are not renaming your data bases on a regular basis…

3. Summarizing the data for data exploration

The next step will be the very first steps of the data exploration process, which allows you to get to know your data set, like it’s your best friend.

Data exploration is a crucial part of data analysis. In fact, most of the times, you will spend way more time exploring your data than actually
modelling them. running a linear model, take 1min, but choosing (and then validating it, a step we will cover in a future session), will take much
much longer, and will guarantee that you did a proper job, and can trust your results. If you don’t provide detailed information on these steps,
reviewers will reject your paper (or at least they should, if they know what they’re doing. I totally would), because there would be no way to know if
we can trust your claims. Plus, it will ensure reproducibility: if I take your data and follow the steps you describe in your “Methods” section, I should
1. be easily able to do so, just by reading your Methods, 2. find the same results. Reproducibility ensures that, as scientists, we do our job right.

3.1 Check out the missing values

First, we will check how many NAs we have and where they are, to make sure they’re not following a pattern, and thus be a problem.

colSums(is.na(df))

## id age age_cat sex traploc organ THg

## 0 0 0 0 0 0 55

which(colSums(is.na(df))>0)

## THg
## 7

names(which(colSums(is.na(df))>0))

## [1] "THg"

The first line returns the number of NAs for each column (this is the line I personally always use in first intention, because it is the most
informative). The function is.na() basically transform each cell of your data frame into a logical value: is it NA? TRUE or FALSE. Then the
colSums() function counts the number of TRUE per column.

The second line names and gives you the position of columns where the count of NAs > 0. and the third line only provides the column name.

Now that we know that we only have missing values in THg, we want to check there is no pattern. So, we will switch to package dplyr (which you
will get to know quite well in future sessions), to summarize the NAs by group.

library(dplyr) #first we load the package

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##
## filter, lag

## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union

df<- as.data.frame(unclass(df),
stringsAsFactors = TRUE) #this line convert all characters into factors at once

dCount.sex <- df %>% # Count NA by sex

group_by(sex) %>%
summarize(count_na = sum(is.na(THg)), n = length(THg))
dCount.sex

## # A tibble: 2 × 3
## sex count_na n
## <fct> <int> <int>
## 1 F 26 182
## 2 M 29 203

dCount.age <- df %>% # Count NA by age category

group_by(age_cat) %>%
summarize(count_na = sum(is.na(THg)), n = length(THg))
dCount.age

## # A tibble: 2 × 3
## age_cat count_na n
## <fct> <int> <int>
## 1 adult 27 175
## 2 juvenile 28 210

#To check by age, we can plot it because it will be easier to visualize for most people than a table with numbers
for so many categories
#First, we create a table for age like above
dCount.age2 <- df %>% # Count NA by age
group_by(age) %>%
summarize(count_na = sum(is.na(THg)), n = length(THg))

#Then we create a new variable: that is the proportion of NAs (Nas per age/n)
dCount.age2$prop_na<- dCount.age2$count_na/dCount.age2$n

#Finally we load ggpubr (wrapper for ggplot2, easier to use) and make a scatterplot with the correlation R value
and a p value for a pearson correlation
library(ggpubr)

## Loading required package: ggplot2

ggscatter(dCount.age2, x = "age", y = "prop_na",

add = "reg.line", # Add regression line
conf.int = TRUE, # Add confidence interval
add.params = list(color = "blue",
fill = "lightgray")
)+
stat_cor(method = "pearson", label.x = 0.2, label.y = 1) # Add correlation coefficient

So, missing values seem to be pretty random. The R we obtained for age is quite high, but that’s because we have very few old animals, so we can
safely consider that the apparent pattern is just due to sampling.

We first loaded the dplyr package using the library() function, and the second line is a command to convert all characters from your data set into
factors, which are (in general) easier to deal with in R. The usual command to convert from one data type to another is
as.NewDataType(as.OldDataType())

df$id<- as.character(as.factor(df$id)) #back to character

df$id<- as.factor(as.character(df$id)) #and again convert id to factors

3.2 Summarizing THg per group

Here, we will use dplyr again, but this time to summarize THg concentration per group.

thg.SexAge <- df %>%

filter(!is.na(df$THg)) %>%
group_by(sex, age_cat) %>%
summarize(min = min(THg), max = max(THg), mean = mean(THg),
se = sd(THg)/sqrt(length(THg)), n = length(THg))

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

thg.SexAge

## # A tibble: 4 × 7
## # Groups: sex [2]
## sex age_cat min max mean se n
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 F adult 20.9 2747. 682. 134. 39
## 2 F juvenile 10.6 2510. 341. 49.9 117
## 3 M adult 31.9 3495. 677. 70.4 109
## 4 M juvenile 15.4 2501. 364. 68.9 65

We have to filter out the NAs otherwise, they will cause problems (take out the filter line, and run the code to see).

Let’s do the same for trapping location.

thg.loc <- df %>%

filter(!is.na(df$THg)) %>%
group_by(traploc) %>%
summarize(min = min(THg), max = max(THg), mean = mean(THg),
se = sd(THg)/sqrt(length(THg)), n = length(THg))
thg.loc

## # A tibble: 10 × 6
## traploc min max mean se n
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Button Bay 15.4 2510. 420. 95.5 39
## 2 Goose Creek 42.8 2468. 658. 243. 13
## 3 Line 25 20.0 2595. 573. 152. 21
## 4 Mack Lake area 17.3 107. 57.2 15.5 5
## 5 North River 12.4 3424. 589. 91.6 63
## 6 Seal River 10.6 3495. 603. 141. 30
## 7 Southknife Lake 45.8 2641. 721. 173. 23
## 8 Town of Churchill 17.5 2501. 407. 67.4 78
## 9 Twin Lakes 16.7 2097. 352. 70.7 49
## 10 Wakeworth Lake 25.8 2090. 661. 266. 9

xtabs(~ traploc + sex, data = df)

## sex
## traploc F M
## Button Bay 21 21
## Goose Creek 14 7
## Line 25 0 21
## Mack Lake area 7 0
## North River 42 21
## Seal River 7 35
## Southknife Lake 7 21
## Town of Churchill 28 63
## Twin Lakes 42 14
## Wakeworth Lake 14 0

tab<-xtabs(~ traploc + age_cat + sex, data = df)

ftable(tab)

## sex F M
## traploc age_cat
## Button Bay adult 0 7
## juvenile 21 14
## Goose Creek adult 7 7
## juvenile 7 0
## Line 25 adult 0 14
## juvenile 0 7
## Mack Lake area adult 0 0
## juvenile 7 0
## North River adult 14 21
## juvenile 28 0
## Seal River adult 0 21
## juvenile 7 14
## Southknife Lake adult 7 21
## juvenile 0 0
## Town of Churchill adult 7 28
## juvenile 21 35
## Twin Lakes adult 7 7
## juvenile 35 7
## Wakeworth Lake adult 7 0
## juvenile 7 0

We saw, that some locations (namely GosseCreek and Wakeworth Lake) may be associated with higher THg mercury in fox tissues, but we also
saw above that adult females seem to have more mercury in their tissues. We, thus, need to check if the distribution of sex per location is balanced
(the 2 xtabs lines that we also saw last session show you the count of sex - line1 - and sex*age_cat - line 2- per location), and it is not balanced. If
we were going to analyze these data today, that info should get you to think about possibly excluding one of the variable sex or traploc as they may
be strongly associated, and should definitely make you test for that association specifically. We’ll talk about correlation between explanatory
variables further in future sessions.

Next, we can produce a table summarizing THg per tissue per sex and age. We will use a different way this time, one that does not involve dplyr.

dfc<-df[complete.cases(df),]
sum.tab <- aggregate(dfc$THg,
by = list(dfc$sex, dfc$age_cat, dfc$organ),
FUN = function(x) c(min = min(x), max = max(x),
median = median(x), mean = mean(x),
se = sd(x)/sqrt(length(x)),
n = length(x)))
sum.tab<-do.call(data.frame, sum.tab)
colnames(sum.tab)<- c("sex", "age", "organ", "min", "max", "median", "mean", "se", "n")

Like with dplyr, ignoring the NAs will cause problems. We need to get rid of them, and we did that with function complete.cases() which only keeps
rows that are complete. Then, we used the function aggregate(). The arguments you give to aggregate are the column you want to summarize by
group (here, the THg), then the groups (as a list), and finally the functions you want (here, we asked for min, max, median, mean, standard error*,
and sample size). *Remember Standard error se does not have a specific function, so you need to add it by hand sd (Standard deviation)/ sqrt
(square root) of n (sample size).

Then, we converted sum.tab into a real data frame, and renamed the columns.

Yay! We have our summary table! 🎉

Below, is another bit of code to make the table pretty. We won’t cover it in class, but I encourage you to try to run it.

💪 We have our summary table. Now, we’ll see how to make it pretty.
library(rempsyc)

## Warning: package 'rempsyc' was built under R version 4.2.3

## Suggested APA citation: Thériault, R. (2022). rempsyc: Convenience functions for psychology
## (R package version 0.1.1) [Computer software]. https://rempsyc.remi-theriault.com

library(flextable) #load the two necessary libraries

## Warning: package 'flextable' was built under R version 4.2.3

##
## Attaching package: 'flextable'

## The following objects are masked from 'package:ggpubr':

##
## border, font, rotate

nice_table(sum.tab) #nice function, eh?

sex age organ min max median mean se n

F adult brain 20.86 109.93 54.95 54.46 16.15 5

M adult brain 31.92 269.44 117.16 114.44 14.31 17

F juvenile brain 10.57 126.31 30.14 34.89 6.46 17

M juvenile brain 15.45 45.71 20.03 25.88 4.33 7

F adult claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6

M adult claw 523.99 2,134.76 1,054.77 1,197.70 150.83 13

F juvenile claw 209.40 1,764.90 512.78 704.44 128.21 16

M juvenile claw 209.85 1,727.20 607.53 777.40 196.29 9

F adult GH 1,976.55 2,747.21 2,159.25 2,260.72 116.88 6

M adult GH 658.73 3,424.01 1,973.80 1,813.03 236.37 13

F juvenile GH 481.09 2,509.83 926.77 1,240.99 194.82 16

M juvenile GH 411.50 2,501.03 1,327.61 1,248.34 264.80 9

F adult KidCort 178.62 455.45 363.94 340.49 63.62 4

M adult KidCort 326.75 3,494.81 809.66 1,018.81 208.39 15

F juvenile KidCort 107.40 495.49 245.96 262.25 29.90 15

M juvenile KidCort 166.53 406.48 268.49 272.45 25.68 9

F adult KidMed 29.50 220.63 138.79 131.93 43.06 4

M adult KidMed 49.63 1,035.02 235.06 289.59 64.87 15

F juvenile KidMed 28.25 167.54 62.39 77.60 9.92 15

M juvenile KidMed 37.56 141.14 73.10 82.97 11.11 9

F adult liver 154.98 430.22 180.34 246.80 41.92 7

M adult liver 181.33 1,348.47 340.33 500.97 73.38 18

F juvenile liver 48.51 259.36 101.92 114.94 12.02 19

M juvenile liver 86.92 263.93 118.35 133.53 15.71 11

F adult muscle 35.98 184.30 61.68 92.58 21.90 7

M adult muscle 48.86 679.73 175.75 225.81 40.89 18

F juvenile muscle 17.33 101.55 42.57 46.22 5.18 19

M juvenile muscle 24.77 121.54 43.03 53.19 8.22 11

Now we have a nice table, but we need to arrange it because we technically have multilevel headers. and instead here we have each group as a
separate column.

#First, we'll combine sex and age in the same column

sum.tab$sex_age<- paste0(substr(sum.tab$sex, 1, 1), ".",
substr(sum.tab$age, 1, 1))

# Remove columns using select()

sum.tab2 <- sum.tab %>% select(-c(sex, age))

#Then, we'll separate the rows by group sex_age

fa<- subset(sum.tab2, sex_age == "F.a")
ma<- subset(sum.tab2, sex_age == "M.a")
fj<- subset(sum.tab2, sex_age == "F.j")
mj<- subset(sum.tab2, sex_age == "M.j")

#And create a new df with the 4 df bound columnwise - we need to exclude the first column from 3 of teh df and th
e last column from all of them
dat <- cbind(fa[,1:7], fj[,2:7], ma[,2:7], mj[,2:7])

#Now we rename the column (except the first one) the proper way so that the function understands what header it s
hould take
names(dat)[-1] <- c(paste0("Female.adult.", names(sum.tab2[2:7])),
paste0("Female.juvenile.", names(sum.tab2[2:7])),
paste0("Male.adult.", names(sum.tab2[2:7])),
paste0("Male.juvenile.", names(sum.tab2[2:7])))

#Now we'll rename the organs the way we want them to appear in the table
# Renaming factor levels dplyr
dat <- dat %>%
mutate(organ=recode(organ, "GH"="Guard hair", "brain" = "Brain",
"claw"="Claw", "KidCort" = "Renal cortex",
"KidMed" = "Renal medulla", "liver"= "Liver",
"muscle"= "Muscle"))

#Now we use nice_table()

nice_table(dat)

organ Female.adult.min Female.adult.max Female.adult.median Female.adult.mean Female.adult.se Female.adult.n Female.juvenile.min Female.juvenile.max Female.juvenile.median Female.juvenile.mean Female.juvenile.se Fema

Brain 20.86 109.93 54.95 54.46 16.15 5.00 10.57 126.31 30.14 34.89 6.46

Claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6.00 209.40 1,764.90 512.78 704.44 128.21

Guard
1,976.55 2,747.21 2,159.25 2,260.72 116.88 6.00 481.09 2,509.83 926.77 1,240.99 194.82
hair

Renal
178.62 455.45 363.94 340.49 63.62 4.00 107.40 495.49 245.96 262.25 29.90
cortex

Renal
29.50 220.63 138.79 131.93 43.06 4.00 28.25 167.54 62.39 77.60 9.92
medulla

Liver 154.98 430.22 180.34 246.80 41.92 7.00 48.51 259.36 101.92 114.94 12.02

Muscle 35.98 184.30 61.68 92.58 21.90 7.00 17.33 101.55 42.57 46.22 5.18

#All seems in order, ready for the last step which consists of separating headers
nice_table(dat, separate.header = TRUE, italics = seq(dat))

Female Male

organ adult juvenile adult juvenile

min max median mean se n min max median mean se n min max median mean se n min max median mean se n

Brain 20.86 109.93 54.95 54.46 16.15 5 10.57 126.31 30.14 34.89 6.46 17 31.92 269.44 117.16 114.44 14.31 17 15.45 45.71 20.03 25.88 4.33 7

Claw 1,079.20 1,732.24 1,424.47 1,413.12 133.00 6 209.40 1,764.90 512.78 704.44 128.21 16 523.99 2,134.76 1,054.77 1,197.70 150.83 13 209.85 1,727.20 607.53 777.40 196.29 9

Guard
1,976.55 2,747.21 2,159.25 2,260.72 116.88 6 481.09 2,509.83 926.77 1,240.99 194.82 16 658.73 3,424.01 1,973.80 1,813.03 236.37 13 411.50 2,501.03 1,327.61 1,248.34 264.80 9
hair

Renal
178.62 455.45 363.94 340.49 63.62 4 107.40 495.49 245.96 262.25 29.90 15 326.75 3,494.81 809.66 1,018.81 208.39 15 166.53 406.48 268.49 272.45 25.68 9
cortex

Renal
29.50 220.63 138.79 131.93 43.06 4 28.25 167.54 62.39 77.60 9.92 15 49.63 1,035.02 235.06 289.59 64.87 15 37.56 141.14 73.10 82.97 11.11 9
medulla

Liver 154.98 430.22 180.34 246.80 41.92 7 48.51 259.36 101.92 114.94 12.02 19 181.33 1,348.47 340.33 500.97 73.38 18 86.92 263.93 118.35 133.53 15.71 11

Muscle 35.98 184.30 61.68 92.58 21.90 7 17.33 101.55 42.57 46.22 5.18 19 48.86 679.73 175.75 225.81 40.89 18 24.77 121.54 43.03 53.19 8.22 11

💪
That’s all for this example. Using flextables, you can really customize your tables any way you wish. Here, is the link to package flextable user
guide: https://ardata-fr.github.io/flextable-book/

We will stop here for today, we will see much more of R during the next sessions, as we will slowly shift from
mostly theoretical seminars to mostly practical ones.

Introduction To R, Version 2
No ratings yet
Introduction To R, Version 2
51 pages
R For Absolute Beginners - Hands-On R Tutorial: June 2018
No ratings yet
R For Absolute Beginners - Hands-On R Tutorial: June 2018
43 pages
Network Analysis and Visualization With R and Igraph
No ratings yet
Network Analysis and Visualization With R and Igraph
62 pages
R Commands
No ratings yet
R Commands
18 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Week 1-R Programming Notes
No ratings yet
Week 1-R Programming Notes
15 pages
CIND123 Swirl Lesson 15
No ratings yet
CIND123 Swirl Lesson 15
46 pages
data anlytics using r notes
No ratings yet
data anlytics using r notes
14 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Practical 1_Data Frame Manipulation_072502
No ratings yet
Practical 1_Data Frame Manipulation_072502
16 pages
Data Analysis Using R and Vectors
No ratings yet
Data Analysis Using R and Vectors
35 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
Introduction To R and Rstudio, R Script, Calling Functions, Running Code
No ratings yet
Introduction To R and Rstudio, R Script, Calling Functions, Running Code
10 pages
R Programming
No ratings yet
R Programming
22 pages
R Programming Swirl
No ratings yet
R Programming Swirl
22 pages
Source Code 1
No ratings yet
Source Code 1
40 pages
R
No ratings yet
R
38 pages
KD Lab - 1 Introductions To R
No ratings yet
KD Lab - 1 Introductions To R
12 pages
R-Unit 2
No ratings yet
R-Unit 2
81 pages
R - Lecture 2
No ratings yet
R - Lecture 2
51 pages
cours
No ratings yet
cours
33 pages
R Lab
No ratings yet
R Lab
114 pages
This Is The Course Script
No ratings yet
This Is The Course Script
9 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
Prerequis R
No ratings yet
Prerequis R
38 pages
First Course On R
No ratings yet
First Course On R
26 pages
R study material I
No ratings yet
R study material I
8 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
Pse2023 1
No ratings yet
Pse2023 1
453 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
R
No ratings yet
R
13 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
R Intro
No ratings yet
R Intro
109 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
100 pages
R Session A
No ratings yet
R Session A
107 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Intro2R
No ratings yet
Intro2R
206 pages
R Fundamentals (Hadley Wickham - Rice Univ)
No ratings yet
R Fundamentals (Hadley Wickham - Rice Univ)
66 pages
Intro 2 R
No ratings yet
Intro 2 R
206 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Beyond the Basics of JavaScript
From Everand
Beyond the Basics of JavaScript
Tom Henricksen
No ratings yet
Just the basics of JavaScript
From Everand
Just the basics of JavaScript
Tom Henricksen
No ratings yet
Python An Introduction
From Everand
Python An Introduction
Renier Engelbrecht
No ratings yet
Javascript - 50 functions and tutorial
From Everand
Javascript - 50 functions and tutorial
Nino Paiotta
No ratings yet
Quick JavaScript Learning In Just 3 Days: Fast-Track Learning Course
From Everand
Quick JavaScript Learning In Just 3 Days: Fast-Track Learning Course
Vijay K.R.
No ratings yet
JavaScript.
From Everand
JavaScript.
Tom Henricksen
No ratings yet
C++ Functions and tutorial
From Everand
C++ Functions and tutorial
Nino Paiotta
No ratings yet
PHP programming
From Everand
PHP programming
Nino Paiotta
No ratings yet
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
From Everand
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
Dan Wahlin
4.5/5 (3)
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Bash Command Line Pro Tips
From Everand
Bash Command Line Pro Tips
Jason Cannon
4.5/5 (8)
Python. Easy Steps to Learning.
From Everand
Python. Easy Steps to Learning.
Renier Engelbrecht
No ratings yet
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Fluxinin Zorrillaetal Cons Biol 2014
No ratings yet
Fluxinin Zorrillaetal Cons Biol 2014
6 pages
Introduction To Statistics Using R - Session 2
No ratings yet
Introduction To Statistics Using R - Session 2
68 pages
Introduction To Statistics Using R - Session 1 - Ar
No ratings yet
Introduction To Statistics Using R - Session 1 - Ar
101 pages
Bulletin Ecologic Soc America - 2020 - Grogan - Writing Science What Makes Scientific Writing Hard and How To Make It
No ratings yet
Bulletin Ecologic Soc America - 2020 - Grogan - Writing Science What Makes Scientific Writing Hard and How To Make It
8 pages
Intro To Statistic Using R - Session 1
No ratings yet
Intro To Statistic Using R - Session 1
1 page
Void Former SD Filcor Cordek Ramp
No ratings yet
Void Former SD Filcor Cordek Ramp
1 page
Danfoss SM185
No ratings yet
Danfoss SM185
1 page
#74 Virtual Tour of London - Level Up English
No ratings yet
#74 Virtual Tour of London - Level Up English
1 page
Ior Kri Event Slides Ac
No ratings yet
Ior Kri Event Slides Ac
15 pages
Musicians' Guide
No ratings yet
Musicians' Guide
157 pages
Statement of Account: Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
No ratings yet
Statement of Account: Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
8 pages
Tmux Basics Cheat Sheet
No ratings yet
Tmux Basics Cheat Sheet
1 page
Mini V3 Manual
No ratings yet
Mini V3 Manual
90 pages
Simulab Activity 3.1: Ammeter, Voltmeter, and Ohmmeter
No ratings yet
Simulab Activity 3.1: Ammeter, Voltmeter, and Ohmmeter
11 pages
Polaris Sportsman700-800-800 X2 EFI PDF
No ratings yet
Polaris Sportsman700-800-800 X2 EFI PDF
393 pages
York VRF System
No ratings yet
York VRF System
96 pages
Servo-Hydraulic Actuator SHA Fields of Application: RE 08137, Edition: 2018-02, Bosch Rexroth AG
No ratings yet
Servo-Hydraulic Actuator SHA Fields of Application: RE 08137, Edition: 2018-02, Bosch Rexroth AG
12 pages
Assignment Two
No ratings yet
Assignment Two
6 pages
CLO GRP Jacking Pipe Brochure 2023 v3-WEB
No ratings yet
CLO GRP Jacking Pipe Brochure 2023 v3-WEB
9 pages
Project Board Agendas
No ratings yet
Project Board Agendas
4 pages
MW - ANTENA - 2.4m W6G Dual Polarization Antenna
No ratings yet
MW - ANTENA - 2.4m W6G Dual Polarization Antenna
3 pages
Spare Parts: Boilers Type M, ME, MCS & CPI12
No ratings yet
Spare Parts: Boilers Type M, ME, MCS & CPI12
18 pages
The History of Computer - 2
No ratings yet
The History of Computer - 2
5 pages
For Example, A Queue of Customers at The Checkout Point in A Supermarket or Cars Backed Up at Traffic Lights
No ratings yet
For Example, A Queue of Customers at The Checkout Point in A Supermarket or Cars Backed Up at Traffic Lights
10 pages
The Eye As An Optical Instrument From Camera Obscura To Helmholtz S Perspective
No ratings yet
The Eye As An Optical Instrument From Camera Obscura To Helmholtz S Perspective
4 pages
CV Template Marketing Manager
No ratings yet
CV Template Marketing Manager
2 pages
Chapter 8 Introduction
No ratings yet
Chapter 8 Introduction
3 pages
EP Sheet For Assistant
No ratings yet
EP Sheet For Assistant
164 pages
Draft Raporti Hidrologjia Eng1
No ratings yet
Draft Raporti Hidrologjia Eng1
41 pages
ZTBL Atm Application Form
No ratings yet
ZTBL Atm Application Form
6 pages
CMT-Concrete-Computation
No ratings yet
CMT-Concrete-Computation
15 pages
KIADB Booklet-22Feb11-Mysore Dist.
No ratings yet
KIADB Booklet-22Feb11-Mysore Dist.
11 pages
Schedule 40 & 80 PVC Conduit Fittings
No ratings yet
Schedule 40 & 80 PVC Conduit Fittings
29 pages
Capture Board Connection (Eng)
No ratings yet
Capture Board Connection (Eng)
25 pages
DDC SVC Oth 0030
No ratings yet
DDC SVC Oth 0030
2 pages

Intro To Statistic Using R - Session 2

Uploaded by

Intro To Statistic Using R - Session 2

Uploaded by

Intro to statistics using R - session 2

Introduction to statistics using R - Session 2

That’s right, we get an error message.

R tolerates extra spaces

R is a calculator, with some useful functions!

3+2 #simple addition

pi #pi is a built-in constant in R

pi*5^2 #pi*r^2 i.e., area of a circle with r = 5

log(8) #log base e

log10(8) #log base 10

exp(8) #exponential or natural anti-log

sqrt(8) #square root

r <- log(25) #we created an object r (radius)

int<- c(1, 2, 3) #see int is numeric and has 3 obs [1:3]

sky<- c("star", "moon", "sun")

R functions and vectors

# you can stored this value for later use

Using c, you can extract multiple values

## [1] "Hello world!" "Howdy!"

vec2<-seq(5, 20, by = 0.5)

or you can ask if specific values are bigger than 10

In the command above, R returns logical values TRUE and FALSE

my.vec[my.vec < 10 & my.vec > 5] # & is AND

my.vec[my.vec <= 10 & my.vec >= 5]

my.vec[my.vec > 8 | my.vec ==5] # | is OR and == is equal to

You can also replace elements in your vector

my.vec[2]<- 800 #just 1

my.vec[c(6, 7)] <- 500# or many

## [1] 5 800 7 8 9 500 500 12 13 14 15 16 17 18 19 20

my.vec[my.vec > 19] <- 1000 #conditionally

## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

my.vec[my.vec < 1000 & my.vec > 100]<- 5 #double condition

## [1] 5 1000 7 8 9 1000 1000 12 13 14 15 16 17 18 19

You can also perform calculation on entire vectors.

Ok, that’s it for the basics. Just 2 last pieces of info:

#save.image(file = "name_of_file_Date.RData") #To save all your workspace at once

Try these functions out!

Asking R for help or information

?read.csv # the documentation now appears in the Output pane

## starting httpd help server ... done

## R version 4.2.2 (2022-10-31 ucrt)

To get the version of a specific package you can use packageVersion()

And to cite your packages right, use citation()

Let’s play with a real data set!

Presentation of the data set

Steps to produce descriptive stats of your data set

The classic way (which I very much dislike)

2. Loading the data set

The command is read.csv(), we call our dataframe df (How original! 😜).

If you have a .txt file the command is read.table().

3. Summarizing the data for data exploration

3.1 Check out the missing values

## id age age_cat sex traploc organ THg

library(dplyr) #first we load the package

## The following objects are masked from 'package:stats':

## The following objects are masked from 'package:base':

dCount.sex <- df %>% # Count NA by sex

dCount.age <- df %>% # Count NA by age category

## Loading required package: ggplot2

ggscatter(dCount.age2, x = "age", y = "prop_na",

df$id<- as.character(as.factor(df$id)) #back to character

3.2 Summarizing THg per group

thg.SexAge <- df %>%

Let’s do the same for trapping location.

thg.loc <- df %>%

xtabs(~ traploc + sex, data = df)

tab<-xtabs(~ traploc + age_cat + sex, data = df)

Yay! We have our summary table! 🎉

## Warning: package 'rempsyc' was built under R version 4.2.3

library(flextable) #load the two necessary libraries

## Warning: package 'flextable' was built under R version 4.2.3

## The following objects are masked from 'package:ggpubr':

nice_table(sum.tab) #nice function, eh?

sex age organ min max median mean se n

F adult brain 20.86 109.93 54.95 54.46 16.15 5

M adult brain 31.92 269.44 117.16 114.44 14.31 17

pi5^2 #pir^2 i.e., area of a circle with r = 5