Unit 1 - R Programming
Unit 1 - R Programming
• R programming is a leading tool for machine learning, statistics, and data analysis, allowing
for the easy creation of objects, functions, and packages.
• R is a powerful tool for data analysis and visualization, and is widely used in various fields,
including finance, healthcare, marketing, and social sciences.
• Data Manipulation and Analysis: R provides various libraries and functions for data cleaning,
filtering, sorting, and analysis.
• Data Visualization: R offers a wide range of libraries (e.g., ggplot2, Shiny) for creating
interactive and dynamic visualizations.
• Statistical Modeling: R supports various statistical techniques, such as linear regression, time
series analysis, and hypothesis testing.
• Machine Learning: R provides libraries (e.g., caret, dplyr) for building machine learning
models and performing predictive analytics.
• Data Mining: R offers tools for data mining, text analysis, and natural language processing.
• Web Development: R can be used to build web applications and interactive dashboards using
Shiny.
• Research and Academia: R is widely used in research and academia for data analysis,
simulation, and visualization.
• Scripting Language: R is a scripting language, allowing users to write scripts and programs.
• Dynamic Typing: R is dynamically typed, meaning variable types are determined at runtime.
1.2. Why R
There are several reasons why R programming is widely used and popular among data analysts,
data scientists, and researchers:
• Free and Open-Source: R is free to download and use, making it accessible to anyone.
• Large Community: R has a massive user base and a vibrant community, ensuring there are
plenty of resources available.
• Extensive Libraries: R has an vast collection of libraries and packages for various tasks, such
as data visualization, machine learning, and statistical analysis.
• Data Analysis: R is specifically designed for data analysis, making it a powerful tool for data
manipulation, visualization, and modeling.
• Academic and Research: R is widely used in academia and research for data analysis,
simulation, and visualization.
• Easy to Learn: R has a relatively low barrier to entry, making it easy for new users to learn.
• Flexible: R can be used for a wide range of applications, from data analysis to web
development.
• Constantly Evolving: R is constantly being updated and improved, with new packages and
features being added regularly.
• Machine Learning: R has a wide range of machine learning libraries, making it easy to build
predictive models.
What is IDE
• Code Editor: A text editor with features like syntax highlighting, code completion, and code
folding.
• Debugger: A tool to identify and fix errors in the code, allowing you to step through code
execution, set breakpoints, and inspect variables.
• Project Explorer: A file management system to organize and navigate your project's files and
directories.
• Build Automation: Tools to automate the process of compiling, linking, and packaging your
code.
• Version Control: Integration with version control systems like Git to manage changes to your
codebase.
• Visualization: Tools for visualizing data, such as graphs, charts, and plots.
• Plugin Support: Ability to extend the IDE's functionality with plugins and extensions.
• RStudio: A widely-used and highly-regarded IDE for R, offering features like code completion,
debugging tools, and visualization support.
• R GUI (R Graphical User Interface): A basic IDE that comes bundled with the R installation,
providing a simple interface for writing and running R code.
• Visual Studio Code with R Extension: A lightweight, versatile code editor with an R extension
that offers features like code completion, debugging, and visualization support.
• Eclipse with StatET Plugin: A popular IDE for various programming languages, including R,
offering features like code completion, debugging, and project management.
• Jupyter Notebook with R Kernel: A web-based interactive environment for working with R,
allowing you to create and share documents containing live code, equations, and visualizations.
1.4. R Syntax
• In R, the print statement is simply print(). You can use it to print the output of any expression
or variable. Here are some examples:
• print(mean(c(1, 2, 3, 4, 5))) prints the mean of the vector c(1, 2, 3, 4, 5), which is 3.
• You can also use the cat() function to print output, which is similar to print() but has some
differences in how it handles strings and concatenation.
• Note that in R, the result of an expression is automatically printed if it's not assigned to a
variable, so you often don't need to use print() explicitly. For example:
• 5 prints the number 5.
• mean(c(1, 2, 3, 4, 5)) prints the mean of the vector c(1, 2, 3, 4, 5), which is 3.
• However, in scripts or functions, it's often a good practice to use print() or cat() to explicitly
control what output is displayed.
1.5. R Comments
• In R, comments are used to add notes or explanations to your code. They are ignored by the
R interpreter, so they don't affect the execution of your code.
• Hash symbol (#): Any text following the hash symbol on the same line is considered a
comment.
• Comment blocks: You can use the # symbol at the beginning of each line to create a block of
comments.
• Example:
1.6. R Variables
• In R, a variable is a name given to a value or a set of values. You can think of it as a labeled
container that holds data. Here are some key things to know about variables in R:
• Assignment: Use the assignment operator <- to assign a value to a variable. For example: x <-
5 assigns the value 5 to the variable x.
• Names: Variable names can contain letters, numbers, and underscores, but they can't start
with a number. For example: x, y, z_1, etc.
• Scope: Variables have a scope, which determines their visibility. Global variables are
accessible from anywhere, while local variables are only accessible within a function or loop.
• Environment: R has an environment, which is a collection of variables and their values. You
can use the ls() function to list all variables in the current environment.
• Accessing values: x
1.7. R Function
Arguments
• Information can be passed into functions as arguments.
• Arguments are specified after the function name, inside the parentheses. You can add as
many arguments as you want, just separate them with a comma.
• The following example has a function with one argument (fname). When the function is
called, we pass along a first name, which is used inside the function to print the full name
Parameters or Arguments?
• The terms "parameter" and "argument" can be used for the same thing: information that are
passed into a function.
• A parameter is the variable listed inside the parentheses in the function definition.
Number of Arguments
• By default, a function must be called with the correct number of arguments. Meaning that if
your function expects 2 arguments, you have to call the function with 2 arguments, not more,
and not less:
Return Values
• To let a function return a result, use the return() function
• Global variables can be used by everyone, both inside of functions and outside.
• If you create a variable with the same name inside a function, this variable will be local, and
can only be used inside the function. The global variable with the same name will remain as it
was, global and with the original value.
Global Assignment Operator
• Normally, when you create a variable inside a function, that variable is local, and can only be
used inside that function.
• To create a global variable inside a function, you can use the global assignment operator <<-
• Use the global assignment operator if you want to change a global variable inside a function
1.9. R Vector
• A vector is simply a list of items that are of the same type.
• To combine the list of items to a vector, use the c() function and separate the items by a
comma.
• In the example below, we create a vector variable called fruits, that combine strings
Vector Length
• To find out how many items a vector has, use the length() function
Sort a Vector
• To sort items in a vector alphabetically or numerically, use the sort() function
Access Vectors
• You can access the vector items by referring to its index number inside brackets []. The first
item has index 1, the second item has index 2, and so on
• You can also access multiple elements by referring to different index positions with the c()
function
• You can also use negative index numbers to access all items except the ones specified
1.10. R List
• A list in R can contain many different data types inside it. A list is a collection of data which is
ordered and changeable.
• You can access the list items by referring to its index number, inside brackets. The first item
has index 1, the second item has index 2, and so on:
• To find out if a specified item is present in a list, use the %in% operator
• To add an item to the end of the list, use the append() function
• To add an item to the right of a specified index, add "after=index number" in the append()
function
• ou can also remove list items. The following example creates a new, updated list without an
"apple" item
• You can specify a range of indexes by specifying where to start and where to end the range,
by using the : operator
1.11. R Matrices
• A matrix is a two dimensional data set with columns and rows.
• Specify the nrow and ncol parameters to get the amount of rows and columns
• You can access the items by using [ ] brackets. The first number "1" in the bracket specifies
the row-position, while the second number "2" specifies the column-position
• The whole row can be accessed if you specify a comma after the number in the bracket
• The whole column can be accessed if you specify a comma before the number in the bracket
• Use the dim() function to find the number of rows and columns in a Matrix
• Use the length() function to find the dimension of a Matrix
1.12. R Arrays
• Compared to matrices, arrays can have more than two dimensions.
• We can use the array() function to create an array, and the dim parameter to specify the
dimensions
• The first and second number in the bracket specifies the amount of rows and columns.
• The last number in the bracket specifies how many dimensions we want.
• You can access the array elements by referring to the index position. You can use the []
brackets to access the desired elements from an array
• To find out if a specified item is present in an array, use the %in% operator
• Use the dim() function to find the amount of rows and columns in an array
• Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column should
have the same type of data.
• We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame
• Use the dim() function to find the amount of rows and columns in a Data Frame
• You can also use the ncol() function to find the number of columns and nrow() to find the
number of rows
• Use the length() function to find the number of columns in a Data Frame (similar to ncol())
•
1.14. R Factors
• Factors are used to categorize data, o create a factor, use the factor() function and add a
vector as argument.
• Use the length() function to find out how many items there are in the factor
• To access the items in a factor, refer to the index number, using [] brackets
1.15. R Plot
• The plot() function is used to draw points (markers) in a diagram.
• At its simplest, you can use the plot() function to plot two numbers against each other
• For better organization, when you have many values, it is better to use variables
x <- c(1, 2, 3, 4, 5)
plot(x, y)
• If you want to draw dots in a sequence, on both the x-axis and the y-axis, use the : operator
plot(1:10)
• The plot() function also takes a type parameter with the value l to draw a line to connect all
the points in the diagram
plot(1:10, type="l")
• The plot() function also accept other parameters, such as main, xlab and ylab if you want to
customize the graph with a main title and different labels for the x and y-axis
plot(1:10, col="red")
• Use cex=number to change the size of the points (1 is default, while 0.5 means 50% smaller,
and 2 means 100% larger)
plot(1:10, cex=2)
• Use pch with a value from 0 to 25 to change the point shape format
plot(1:10, type="l")
• The line color is black by default. To change the color, use the col parameter
• To change the width of the line, use the lwd parameter (1 is default, while 0.5 means 50%
smaller, and 2 means 100% larger)
• The line is solid by default. Use the lty parameter with a value from 0 to 6 to specify the line
format.
• For example, lty=3 will display a dotted line instead of a solid line
• To display more than one line in a graph, use the plot() function together with
the lines() function
•
• It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-axis
(vertical)
• That might not be clear for someone who sees the graph for the first time, so let's add a
header and different labels to describe the scatter plot better
• To compare the plot with another plot, use the points() function
• You can change the start angle of the pie chart with the init.angle parameter. The value
of init.angle is defined with angle in degrees, where default angle is 0
• Use the label parameter to add a label to the pie chart, and use the main parameter to add a
header
• You can add a color to each pie with the col parameter
• To add a list of explanation for each pie, use the legend() function
Mean
• It is calculated by taking the sum of the values and dividing with the number of values in a
data series.
Median
• The middle most value in a data series is called the median. The median() function is used in
R to calculate this value.
Mode
• The mode is the value that has highest number of occurrences in a set of data. Unike mean
and median, mode can have both numeric and character data.
• R does not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and gives the mode
value as output.
• which.max() - used to return the location of the first maximum value in the Numeric Vector
• match() - is used to return the positions of the first match of the elements of the first vector
in the second vector.
y = ax + b
Following is the description of the parameters used −
• y is the response variable.
• A simple example of regression is predicting weight of a person when his height is known. To
do this we need to have the relationship between height and weight of a person.
• Find the coefficients from the model created and create the mathematical equation using
these
• Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
lm() Function
• This function creates the relationship model between the predictor and the response
variable.
lm(formula,data)
• predict(object, newdata)
• object is the formula which is already created using the lm() function.
• newdata is the vector containing the new value for predictor variable.
• This function creates the relationship model between the predictor and the response
variable.
lm(y ~ x1+x2+x3...,data)
• formula is a symbol presenting the relation between the response variable and predictor
variables.
• Sum of Squares Regression (SSR) – The sum of squared differences between predicted data
points (ŷi) and the mean of the response variable(y).
• Sum of Squares Error (SSE) – The sum of squared differences between predicted data points
(ŷi) and observed data points (yi).