R is a widely used open-source statistical software environment used by over 2 million data scientists and analysts. It is based on the S programming language and is developed by the R Foundation. R provides a flexible and powerful environment for statistical analysis, modeling, and data visualization. Some key advantages include being free, having an extensive community for support, and allowing for automated replication through scripting. However, it also has some drawbacks like having a steep learning curve and scripts sometimes being difficult to understand.
Report
Share
Report
Share
1 of 97
More Related Content
CuRious about R in Power BI? End to end R in Power BI for beginners
12. “R is a free software environment for
statistical computing and graphics.”
e GNU-Project: Open Source
e Based on “S” (programming language
developed by John Chambers at Bell-Labs)
e R Foundation: NPO for the development of R
13. §most widely used data analysis software - used by 2M +
data scientist, statisticians and analysts
§Most powerful statistical programming language
§flexible, extensible and comprehensive for productivity
22. • Mean – this is the average
• Median – splits the data in two halves
• Mode – the most popular value
23. • Variance – average squared difference between the data points
and the mean
• Standard Deviation – square root of the variance, more intuitive
• Percentiles – dataset is divided into 100 equal parts
• Quartiles – dataset is divided into four equal parts
• Interquartile range – middle 50% of data points
24. Advantages
• Free
• “Lingua franca” in methodological research: new statistical
procedures are often developed with R
• Large community: most problems are discussed on the internet
• No “point and click”: scripts make procedures transparent
and reproducible
• Flexible programming allows for automated replication with
new data
25. Drawbacks
• Not very intuitive
• No “Point and Click”: handling only through command line
and scripts
• Documentation is very technical at times
• Community-based: different developers (different, lacking
compatibility)
• Slow with very large data sets
27. Enter from command line
Ctrl + Enter from script
Assign variables:
x <- 2
Comments:
# Comment
Comment selection with Ctrl + Shift + c
44. Code Description
y ~ x
y ~ x1 + x2
y ~ x1 + x2 + 0 y ~
I(x1 + x2) y ~ . -
x1
y ~ x1 * x2
x has an effect on y
x1 and x2 have an effect on y
intercept set to zero
y is influenced by x1 plus x2
model of all variables except x1
interaction between x1 and x2
76. Tidy code is easier to
write, read, maintain
and frequently even
faster than the base R
counterparts.
It is also easier to learn.
So here we are!
77. ● Tidy Data is a standard approach to structure datasets
● Good for Data Analysis and Data Visualization
● Variables make up the columns
● Observations make up the rows
● Values go into cells
78. ● A Variable is a measurement
● Also known as:
● Independent or dependent variables
● Features – this is Microsoft’s terminology
● Predictors – (machine learning background)
● Outcomes – (social sciences background)
● The Response (if you have a statistics background)
● Attributes (if you have a dimensional modelling background)
79. ● A Variable can fall into three categories:
● Fixed Variables
● Known variables prior to the start of the investigation
● Measured Variables
● Data that’s captured during the investigative process
● Derived Variables
● Think of a calculated column in DAX or SQL
80. ● Ingests data from different sources
● There are lots of options to work with the file
● Headers
● Limiters
● https://cran.r-project.org/web/packages/readr/readr.pdf for more information
81. ● Easy data manipulation
● Built for data frames
● There are equivalents in SQL
● Written in C++ so it’s faster
82. ● 6 verbs for data manipulation
● Select
● Filter
● Mutate
● group_by
● Summarize
● Tally
● There are equivalents in SQL