This document provides an agenda and overview for a class on using R to work with data. The class covers topics like calculating, joining, and grouping data in R; using R to build databases in Google Sheets; and introducing R Markdown for automating reporting. Specific sessions will demonstrate generating fake data from GitHub, data transformations with dplyr, different types of joins, uploading/downloading from Google Sheets, and creating dashboards in DataStudio.
2. I N T R O D U C T I O N
Who is impact extend, and how do we work with data?
02 W H A T M A K E S R S O A W E S O M E ?
Cons and pros against using R to Extract, Transform and load data
based on usecases.
03 C A L C U L A T I N G , J O I N I N G A N D G R O U P I N G D A T A
Unifying and transforming data, always.
01
AGENDA
C R E A T E , W R I T E A N D R E A D F R O M G O O G L E
S H E E T
Using R to build a free database to be used for reporting, datastorage or
Google Data Studio.
05 I N T R O D U C T I O N T O R M A R K D O W N
Automate your reporting framework by leveraging R Markdown, Shiny
and simple HTML
06 S C H E D U L E R S C R I P T S O N Y O U R M A C H I N E
How can you do as little as possible?
04
3. Who is impact extend, and how do we work with data?
01.
INTRODUCTION
4. • Copenhagen based
• Lead analyst at IMPACT EXTEND
• 2 years in doing R
• 5 years in doing GTM and GA work
• 2 years in doing random SEO and Website stuff
About me
5. • Kickass analyst in terms of understanding humans
• BI specialist within using PowerBI to do crazy dashboards
• Former Google Analytics class educator
• The nerd who is always curious about taking it next step
…. Also he build an entire GA validator by himself which is quite
cool
About Rasmus
6. 100% focus on digital commerce Long customer relations 7 x Gazelle
A A R H U S – C O P E N H A G E N - L I S B O A
1 2 6 E M P L O Y E E S
E S T A B L I S H E D I N 1 9 9 8
Market leader in commerce
Established in 2018 150+ Employees Aarhus - Copenhagen - Lisbon
Part of IMPACT A/S Clients: Largest retailers in the nordics Focus is on datadriven marketing
7. OUR OFFERINGS
ATTRACT
ANDSELL
TRAFFIC &
INSIGHTS
SERVE
ANDGROW
DIALOGUE &
LOYALTY
DATAANDINSIGHTS
DMP & INTELLIGENCE
DIGITAL
MARKETING
STRATEGY
Full-service approach with combined services delivering holistic
solutions to address Marketing’s primary pains and objectives with
digital marketing strategies
9. OUR APPROACH TO WORK WITH DATA
Behavioraldata
User ID
Sessions
Cross-device
CRMDATA
User ID
Purchase
Channels
(web/store)
IMPRESSIONDATA
User ID
Conversions
Store Visits
ENGAGEMENT DATA
User ID
Mails
Open/click
MARKETINGDB
Dataconsolidation
Segmentation
Engagement
LTV
Segmentering
Personalization
Dynamisk content
Triggers
11. Cons and pros against using R to Extract, Transform and load data
based on usecases.
02.
WHAT MAKES R SO AWESOME?
17. Extract
GetDatafromAPI
ScrapeWebdata
Workwithnormal worksheets
Transform
Do all your calculations automatically
Splitdataapartandassembleitwith
other data
Do hugeworkloads fastas thereis nota
traditionGUI likeexcel
Load
Senddatato databases
Create dashboards
Makeautomatedreports
Getthedatathewayyouneedit
Makesurethatitlookslikeyouwantit
Dowhateveryouneedyourdatatodo
19. GENERATE FAKE DATA FROM A GITHUB
RESPORATORY
install.packages("RCurl")
library(RCurl)
#go to https://bit.ly/2PSb6FB and copy paste the URL
url <- "thepasted url"
script <- getURL(url, ssl.verifypeer = FALSE)
eval(parse(text = script))
This should give you 300 rows of data, that we can use to do various calculations and modifications with
22. WITH THE ID’S WE CAN CHECK FOR DUPLICATES
This is to determine if there are one or more
users that goes through the dataset. By
knowing we have the same user more than
once, we can aggregate data by user
duplicated(ID$CustomerID)
23. TO UNDERSTAND HOW THIS DATA LOOKS
AGGREGATED ON A USERLEVEL, IN EXCEL IT
WOULD LOOK LIKE THIS
Here, the Google Analytics cookie ID is
assembled with visit to the sites each day. As
each ID is connected to a GA cookie ID, we
can actually see how many devices each users
are going through within a user journey
24. TO DO THE SAME, DPLYR HAS SOME GREAT
WAYS OF WORKING WITH DATA
P I V O T B Y I D W I L L P R O D U C E T H I S
#group by device
ID %>%
group_by(CustomerID) %>%
summarise(devices = n_distinct(GA))
To find out how many devices people are
using, we cam group them by customer
ID and Google Analytics ID
25. TO DO THE SAME, DPLYR HAS SOME GREAT
WAYS OF WORKING WITH DATA
P I V O T B Y S E S S I O N S W I L L P R O D U C E T H I S
#group by device
ID %>%
group_by(CustomerID) %>%
summarise(devices = n_distinct(GA))
To find out how many session the users
had in total, you can use this
26. JOINS
left_join()
return all rows from x, and all
columns from x and y. Rows in
x with no match in y will have
NA values in the new columns.
If there are multiple matches
between x and y, all
combinations of the matches are
returned.
right_join()
return all rows from y, and all
columns from x and y. Rows in
y with no match in x will have
NA values in the new columns.
If there are multiple matches
between x and y, all
combinations of the matches are
returned.
full_join()
return all rows and all columns
from both x and y. Where there
are not matching values, returns
NA for the one missing.
Note: FULL OUTER JOIN can
potentially return very large
result-sets!
I N N E R J O I N L E F T J O I N R I G H T J O I N F U L L J O I N
inner_join()
return all rows from x where
there are matching values in y,
and all columns from x and y. If
there are multiple matches
between x and y, all
combination of the matches are
returned.
27. JOINS
inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
semi_join(x, y, by = NULL, copy = FALSE, ...)
anti_join(x, y, by = NULL, copy = FALSE, ...)
x, y tbls to join
by a character vector of variables to join by. If NULL,
the default, *_join() will do a natural join, using all
variables with common names across the two tables. A
message lists the variables so that you can check
they're right (to suppress the message, simply
explicitly list the variables that you want to join).
To join by different variables on x and y use a named
vector. For example, by = c("a" = "b") will
match x.a to y.b.
copy If x and y are not from the same data source,
and copy is TRUE, then y will be copied into the
same src as x. This allows you to join tables across
srcs, but it is a potentially expensive operation so you
must opt into it.
suffix If there are non-joined duplicate variables in x and y,
these suffixes will be added to the output to
disambiguate them. Should be a character vector of
length 2.
29. INNER JOIN
inner_join()
return all rows from x where there are matching
values in y, and all columns from x and y. If
there are multiple matches between x and y, all
combination of the matches are returned.
What does this mean?
We join the two tables where the UserID is
present.
inner_join(Dataset1, Dataset2, by = "UserID",
copy = FALSE, suffix = c(".x", ".y"))
A1 A1
A2
A3
30. LEFT JOIN
left_join()
return all rows from x, and all columns from x
and y. Rows in x with no match in y will have
NA values in the new columns. If there are
multiple matches between x and y, all
combinations of the matches are returned.
What does this mean?
inner_join(Dataset1, Dataset2, by = "UserID",
copy = FALSE, suffix = c(".x", ".y"))
31. RIGHT JOIN
right_join()
return all rows from y, and all columns from x
and y. Rows in y with no match in x will have
NA values in the new columns. If there are
multiple matches between x and y, all
combinations of the matches are returned.
What does this mean?
32. FULL JOIN
left_join()
return all rows from x, and all columns from x
and y. Rows in x with no match in y will have
NA values in the new columns. If there are
multiple matches between x and y, all
combinations of the matches are returned.
What does this mean?
We take table 1 one, and join it with table 2
33. Using R to build a free database to be used for reporting, datastorage or
Google Data Studio.
04.
CREATE, WRITE AND READ
FROM GOOGLE SHEET
34. • We use the google authr package created by Mark
Edmonson
• This allows us to generate a token which we can
use to work with Googles products
AUTHENTICATION
#install and load google drive
install.packages("googlesheets")
library(googlesheets)
googlesheets::gs_auth()
35. CREATE A GOOGLE SHEET
gs_new(title = "impactextendrclass")
gs <- gs_title("impactextendrclass")
gs_browse(gs, ws = 1)
39. LETS ADD SOME MORE DATA TO IT!
eval(parse(text = script))
n <- paste("A",nrow(ID), sep="")
gs_edit_cells(gs, ws = 1, input = ID, anchor = n, byrow = FALSE,
col_names = FALSE, trim = FALSE, verbose = TRUE)
What happens is that we use the “paste” function to
find out where to add the new data from so we don’t
break the old data
40. DOWNLOAD AND MODIFY GS DATA
E X T R A C T T R A N S F O R M L O A D
#download gs data
download <- gs_read(gs)
upload <-
download %>%
group_by(CustomerID,sessions) %>%
summarise(devices =
n_distinct(GA))
gs %>%
gs_ws_new(ws_title =
"aggregated", input
= upload)
41. WHICH SHOULD GIVE YOU THIS
There are many ways to do similar task, and the
usecases are basically endless. For larger dataset we
recommend that you send the data to BigQuery or
other databases which can handle more information.
With BigQuery it will be the same approach except
that it requires that you link your creditcard to the
account
47. Automate your reporting framework by leveraging R Markdown, Shiny
and simple HTML
05.
Introduction to R markdown
48. • An adoptation to general Markdown which is used to do
documentation etc.
• R Markdown makes it possible to generate different types of
documents such as HTML, Word, PDF, Slides etc.
• R markdown is really easy to write with and keeps formatting clean
and simple
• Use the cheat sheet to play around
What is Rmarkdown?
49. • In terms of making sure that our GTM setups were GDPR complient
we wrote a script that took data down from GTM, and then it ran
trough everything to ensure that it was set with the right compliance
rules.
• Today we have this document generated once every 6 months, and it
will flag if there are any issues we need to take care of
Example - HTML
51. DOING
VISUALIZATIONS
• To be able to visualize anything we need to
have the data physically downloaded on our
machine
• Also it needs to be loaded whenever you run
your document
save(upload, download,
file = "data.RData")
load("data.RData")
52. MAKING TABLES
• To be able to visualize anything we need to
have the data physically downloaded on our
machine
• Also it needs to be loaded whenever you run
your document
save(upload, download,
file = "data.RData")
load("data.RData")
```{r table, echo=TRUE, message=FALSE,
warning=FALSE}
library(ggplot2)
library(kableExtra)
library(kableExtra)
library(dplyr)
library(knitr)
head(upload) %>%
kable() %>%
kable_styling("HTML")
```
53. MAKING TABLES
The cool thing here is that
you can do any html and css
styling to your documents.
This means that you can do
basically anything that is
possible within HTML and
CSS
60. PLAY AROUND WITH R MARKDOWN AND PLOTS –
GOOGLE IS YOUR FRIEND FOR SEEING THE
POSSIBILITIES!
61. How can you do as little as possible?
06.
Schedule tasks
62. SCHEDULA(R) Tools à Addins à Browse Addins
Choose the file that should be executed by the file.
Choose the frequency, startDate, startTime of which
the file shall be executed.
65. • On PC:
• - Task Scheduler
• See and kill the process.
• On Mac:
• - Begin Automator. Click “Applications” on the
Dock of your Mac. ...
HOW TO STOP
IT AGAIN!