Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
11 views

Unit 1 - R Programming

Uploaded by

anju.k10301
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 1 - R Programming

Uploaded by

anju.k10301
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit 1 - R Programming

1.1. What is R Programming?


• R is a programming language and environment specifically designed for statistical computing
and graphics. It's widely used by statisticians, data miners, and data analysts for tasks like data
analysis, visualization, and statistical modeling. R provides a wide variety of statistical and
graphical techniques.

• R programming is a leading tool for machine learning, statistics, and data analysis, allowing
for the easy creation of objects, functions, and packages.

• R is a powerful tool for data analysis and visualization, and is widely used in various fields,
including finance, healthcare, marketing, and social sciences.

• R programming is a statistical programming language and environment for data analysis,


visualization, and modeling. R is widely used by data analysts, data scientists, and researchers
for:

• Data Manipulation and Analysis: R provides various libraries and functions for data cleaning,
filtering, sorting, and analysis.

• Data Visualization: R offers a wide range of libraries (e.g., ggplot2, Shiny) for creating
interactive and dynamic visualizations.

• Statistical Modeling: R supports various statistical techniques, such as linear regression, time
series analysis, and hypothesis testing.

• Machine Learning: R provides libraries (e.g., caret, dplyr) for building machine learning
models and performing predictive analytics.

• Data Mining: R offers tools for data mining, text analysis, and natural language processing.

• Web Development: R can be used to build web applications and interactive dashboards using
Shiny.

• Research and Academia: R is widely used in research and academia for data analysis,
simulation, and visualization.

R programming language features:

• Scripting Language: R is a scripting language, allowing users to write scripts and programs.

• Object-Oriented: R supports object-oriented programming (OOP) concepts.


• Functional Programming: R supports functional programming principles.

• Dynamic Typing: R is dynamically typed, meaning variable types are determined at runtime.

1.2. Why R

There are several reasons why R programming is widely used and popular among data analysts,
data scientists, and researchers:

• Free and Open-Source: R is free to download and use, making it accessible to anyone.

• Large Community: R has a massive user base and a vibrant community, ensuring there are
plenty of resources available.

• Extensive Libraries: R has an vast collection of libraries and packages for various tasks, such
as data visualization, machine learning, and statistical analysis.

• Cross-Platform: R can run on Windows, Mac, and Linux operating systems.

• Programming Language: R is a programming language, allowing users to write scripts and


programs.

• Data Analysis: R is specifically designed for data analysis, making it a powerful tool for data
manipulation, visualization, and modeling.

• Academic and Research: R is widely used in academia and research for data analysis,
simulation, and visualization.

• Easy to Learn: R has a relatively low barrier to entry, making it easy for new users to learn.
• Flexible: R can be used for a wide range of applications, from data analysis to web
development.

• Constantly Evolving: R is constantly being updated and improved, with new packages and
features being added regularly.

Some specific reasons why R is popular among data scientists include:

• Data Visualization: R provides an extensive range of data visualization libraries, making it


easy to create interactive and dynamic visualizations.

• Machine Learning: R has a wide range of machine learning libraries, making it easy to build
predictive models.

• Statistical Analysis: R has a strong foundation in statistical analysis, making it a popular


choice for hypothesis testing and confidence intervals.

1.3. R IDE Environment


• R is a popular programming language for statistical computing and graphics. An IDE
(Integrated Development Environment) is a software application that provides a comprehensive
interface for writing, debugging, and testing code.

What is IDE

• An Integrated Development Environment (IDE) is a software application that provides a


comprehensive interface for writing, debugging, testing, and managing code. It's essentially a
suite of tools that helps developers create, modify, and maintain software programs.

A typical IDE environment includes:

• Code Editor: A text editor with features like syntax highlighting, code completion, and code
folding.

• Debugger: A tool to identify and fix errors in the code, allowing you to step through code
execution, set breakpoints, and inspect variables.

• Project Explorer: A file management system to organize and navigate your project's files and
directories.
• Build Automation: Tools to automate the process of compiling, linking, and packaging your
code.

• Version Control: Integration with version control systems like Git to manage changes to your
codebase.

• Code Analysis: Tools for code refactoring, formatting, and optimization.

• Visualization: Tools for visualizing data, such as graphs, charts, and plots.

• Plugin Support: Ability to extend the IDE's functionality with plugins and extensions.

Here are some popular IDEs for R:

• RStudio: A widely-used and highly-regarded IDE for R, offering features like code completion,
debugging tools, and visualization support.

• R GUI (R Graphical User Interface): A basic IDE that comes bundled with the R installation,
providing a simple interface for writing and running R code.

• Visual Studio Code with R Extension: A lightweight, versatile code editor with an R extension
that offers features like code completion, debugging, and visualization support.

• Eclipse with StatET Plugin: A popular IDE for various programming languages, including R,
offering features like code completion, debugging, and project management.

• Jupyter Notebook with R Kernel: A web-based interactive environment for working with R,
allowing you to create and share documents containing live code, equations, and visualizations.

1.4. R Syntax

• In R, the print statement is simply print(). You can use it to print the output of any expression
or variable. Here are some examples:

• print("Hello, World!") prints the string "Hello, World!".

• x <- 5; print(x) prints the value of x, which is 5.

• print(mean(c(1, 2, 3, 4, 5))) prints the mean of the vector c(1, 2, 3, 4, 5), which is 3.

• You can also use the cat() function to print output, which is similar to print() but has some
differences in how it handles strings and concatenation.

• Note that in R, the result of an expression is automatically printed if it's not assigned to a
variable, so you often don't need to use print() explicitly. For example:
• 5 prints the number 5.

• mean(c(1, 2, 3, 4, 5)) prints the mean of the vector c(1, 2, 3, 4, 5), which is 3.

• However, in scripts or functions, it's often a good practice to use print() or cat() to explicitly
control what output is displayed.

1.5. R Comments

• In R, comments are used to add notes or explanations to your code. They are ignored by the
R interpreter, so they don't affect the execution of your code.

To add a comment in R, you can use the following methods:

• Hash symbol (#): Any text following the hash symbol on the same line is considered a
comment.

• Example: x <- 5 # This is a comment

• Comment blocks: You can use the # symbol at the beginning of each line to create a block of
comments.

• Example:

• # This is a comment block

• # You can write multiple lines of comments here

• # And they will all be ignored by R

Some best practices for commenting in R include:

• Use comments to explain complex code or logic

• Use comments to document functions and packages

• Keep comments concise and clear

• Avoid commenting on obvious code (e.g., x <- 5 # assign 5 to x)

• Use English for comments, even if your code is in another language

1.6. R Variables
• In R, a variable is a name given to a value or a set of values. You can think of it as a labeled
container that holds data. Here are some key things to know about variables in R:

• Assignment: Use the assignment operator <- to assign a value to a variable. For example: x <-
5 assigns the value 5 to the variable x.

• Names: Variable names can contain letters, numbers, and underscores, but they can't start
with a number. For example: x, y, z_1, etc.

• Data Types: R has several data types, including:

• Numeric (e.g., x <- 5)

• Character (e.g., x <- "hello")

• Logical (e.g., x <- TRUE)

• Factor (e.g., x <- factor(c("a", "b", "c")))

• Data Frame (e.g., x <- data.frame(a = c(1, 2), b = c("a", "b")))

• Scope: Variables have a scope, which determines their visibility. Global variables are
accessible from anywhere, while local variables are only accessible within a function or loop.

• Environment: R has an environment, which is a collection of variables and their values. You
can use the ls() function to list all variables in the current environment.

Some common operations with variables in R include:

• Assigning values: x <- 5

• Accessing values: x

• Reassigning values: x <- 10

• Deleting variables: rm(x)

• Listing variables: ls()

1.7. R Function

• A function is a block of code which only runs when it is called.

• You can pass data, known as parameters, into a function.

• A function can return data as a result.


Creating a Function

• To create a function, use the function() keyword

Arguments
• Information can be passed into functions as arguments.

• Arguments are specified after the function name, inside the parentheses. You can add as
many arguments as you want, just separate them with a comma.

• The following example has a function with one argument (fname). When the function is
called, we pass along a first name, which is used inside the function to print the full name

Parameters or Arguments?
• The terms "parameter" and "argument" can be used for the same thing: information that are
passed into a function.

• From a function's perspective:

• A parameter is the variable listed inside the parentheses in the function definition.

• An argument is the value that is sent to the function when it is called.

Number of Arguments

• By default, a function must be called with the correct number of arguments. Meaning that if
your function expects 2 arguments, you have to call the function with 2 arguments, not more,
and not less:
Return Values
• To let a function return a result, use the return() function

1.8. R Global Variables


• Variables that are created outside of a function are known as global variables.

• Global variables can be used by everyone, both inside of functions and outside.

• If you create a variable with the same name inside a function, this variable will be local, and
can only be used inside the function. The global variable with the same name will remain as it
was, global and with the original value.
Global Assignment Operator
• Normally, when you create a variable inside a function, that variable is local, and can only be
used inside that function.

• To create a global variable inside a function, you can use the global assignment operator <<-

• Use the global assignment operator if you want to change a global variable inside a function

1.9. R Vector
• A vector is simply a list of items that are of the same type.

• To combine the list of items to a vector, use the c() function and separate the items by a
comma.

• In the example below, we create a vector variable called fruits, that combine strings

• To create a vector with numerical values in a sequence, use the : operator


• You can also create numerical values with decimals in a sequence, but note that if the last
element does not belong to the sequence, it is not used

Vector Length
• To find out how many items a vector has, use the length() function

Sort a Vector
• To sort items in a vector alphabetically or numerically, use the sort() function

Access Vectors
• You can access the vector items by referring to its index number inside brackets []. The first
item has index 1, the second item has index 2, and so on
• You can also access multiple elements by referring to different index positions with the c()
function

• You can also use negative index numbers to access all items except the ones specified

• To change the value of a specific item, refer to the index number

• To repeat vectors, use the rep() function


• One of the examples on top, showed you how to create a vector with numerical values in a
sequence with the : operator:

• To make bigger or smaller steps in a sequence, use the seq() function

1.10. R List
• A list in R can contain many different data types inside it. A list is a collection of data which is
ordered and changeable.

• To create a list, use the list() function

• You can access the list items by referring to its index number, inside brackets. The first item
has index 1, the second item has index 2, and so on:

• To change the value of a specific item, refer to the index number


• To find out how many items a list has, use the length() function

• To find out if a specified item is present in a list, use the %in% operator

• To add an item to the end of the list, use the append() function

• To add an item to the right of a specified index, add "after=index number" in the append()
function
• ou can also remove list items. The following example creates a new, updated list without an
"apple" item

• You can specify a range of indexes by specifying where to start and where to end the range,
by using the : operator

1.11. R Matrices
• A matrix is a two dimensional data set with columns and rows.

• A column is a vertical representation of data, while a row is a horizontal representation of


data.

• A matrix can be created with the matrix() function.

• Specify the nrow and ncol parameters to get the amount of rows and columns

• You can access the items by using [ ] brackets. The first number "1" in the bracket specifies
the row-position, while the second number "2" specifies the column-position
• The whole row can be accessed if you specify a comma after the number in the bracket

• The whole column can be accessed if you specify a comma before the number in the bracket

• Use the cbind() function to add additional columns in a Matrix

• Use the rbind() function to add additional rows in a Matrix

• Use the c() function to remove rows and columns in a Matrix

• Use the dim() function to find the number of rows and columns in a Matrix
• Use the length() function to find the dimension of a Matrix

1.12. R Arrays
• Compared to matrices, arrays can have more than two dimensions.

• We can use the array() function to create an array, and the dim parameter to specify the
dimensions

• In the example above we create an array with the values 1 to 20.

• How does dim=c(4,3,2) work?

• The first and second number in the bracket specifies the amount of rows and columns.

• The last number in the bracket specifies how many dimensions we want.

Accessing Array Element

• You can access the array elements by referring to the index position. You can use the []
brackets to access the desired elements from an array
• To find out if a specified item is present in an array, use the %in% operator

• Use the dim() function to find the amount of rows and columns in an array

• Use the length() function to find the dimension of an array

1.13. R Data Frames

• Data Frames are data displayed in a format as a table.

• Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column should
have the same type of data.

• Use the data.frame() function to create a data frame


• Use the summary() function to summarize the data from a Data Frame

• We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame

• Use the rbind() function to add new rows in a Data Frame

• Use the cbind() function to add new columns in a Data Frame


• Use the c() function to remove rows and columns in a Data Frame

• Use the dim() function to find the amount of rows and columns in a Data Frame

• You can also use the ncol() function to find the number of columns and nrow() to find the
number of rows

• Use the length() function to find the number of columns in a Data Frame (similar to ncol())


1.14. R Factors
• Factors are used to categorize data, o create a factor, use the factor() function and add a
vector as argument.

• To only print the levels, use the levels() function

• Use the length() function to find out how many items there are in the factor

• To access the items in a factor, refer to the index number, using [] brackets

1.15. R Plot
• The plot() function is used to draw points (markers) in a diagram.

• The function takes parameters for specifying points in the diagram.

• Parameter 1 specifies points on the x-axis.

• Parameter 2 specifies points on the y-axis.

• At its simplest, you can use the plot() function to plot two numbers against each other

plot(1, 3) or plot(c(1, 8), c(3, 10))


• You can plot as many points as you like, just make sure you have the same number of points
in both axis

plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))

• For better organization, when you have many values, it is better to use variables

x <- c(1, 2, 3, 4, 5)

y <- c(3, 7, 8, 9, 12)

plot(x, y)

• If you want to draw dots in a sequence, on both the x-axis and the y-axis, use the : operator

plot(1:10)

• The plot() function also takes a type parameter with the value l to draw a line to connect all
the points in the diagram

plot(1:10, type="l")

• The plot() function also accept other parameters, such as main, xlab and ylab if you want to
customize the graph with a main title and different labels for the x and y-axis

plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")

• Use col="color" to add a color to the points

plot(1:10, col="red")

• Use cex=number to change the size of the points (1 is default, while 0.5 means 50% smaller,
and 2 means 100% larger)

plot(1:10, cex=2)

• Use pch with a value from 0 to 25 to change the point shape format

plot(1:10, pch=25, cex=2)


• A line graph has a line that connects all the points in a diagram. To create a line, use
the plot() function and add the type parameter with a value of "l"

plot(1:10, type="l")

• The line color is black by default. To change the color, use the col parameter

plot(1:10, type="l", col="blue")

• To change the width of the line, use the lwd parameter (1 is default, while 0.5 means 50%
smaller, and 2 means 100% larger)

plot(1:10, type="l", lwd=2)

• The line is solid by default. Use the lty parameter with a value from 0 to 6 to specify the line
format.

• For example, lty=3 will display a dotted line instead of a solid line

plot(1:10, type="l", lwd=5, lty=3)

• To display more than one line in a graph, use the plot() function together with
the lines() function

1.16. R Scatter Plots


• A "scatter plot" is a type of plot used to display the relationship between two numerical
variables, and plots one dot for each observation.

• It needs two vectors of same length, one for the x-axis (horizontal) and one for the y-axis
(vertical)

• That might not be clear for someone who sees the graph for the first time, so let's add a
header and different labels to describe the scatter plot better

• To compare the plot with another plot, use the points() function

1.17. R Pie Charts


• A pie chart is a circular graphical view of data. Use the pie() function to draw pie charts.

• You can change the start angle of the pie chart with the init.angle parameter. The value
of init.angle is defined with angle in degrees, where default angle is 0
• Use the label parameter to add a label to the pie chart, and use the main parameter to add a
header

• You can add a color to each pie with the col parameter

• To add a list of explanation for each pie, use the legend() function

• The legend can be positioned as either -


bottomright, bottom, bottomleft, left, topleft, top, topright, right, center

1.18. R - Mean, Median and Mode


• Statistical analysis in R is performed by using many in-built functions. Most of these functions
are part of the R base package. These functions take R vector as an input along with the
arguments and give the result.

Mean
• It is calculated by taking the sum of the values and dividing with the number of values in a
data series.

• The function mean() is used to calculate this in R.

Median
• The middle most value in a data series is called the median. The median() function is used in
R to calculate this value.

Mode

• The mode is the value that has highest number of occurrences in a set of data. Unike mean
and median, mode can have both numeric and character data.

• R does not have a standard in-built function to calculate mode. So we create a user function
to calculate mode of a data set in R. This function takes the vector as input and gives the mode
value as output.
• which.max() - used to return the location of the first maximum value in the Numeric Vector

• tabulate() - is used to count the frequency of occurrence of a element in the vector

• match() - is used to return the positions of the first match of the elements of the first vector
in the second vector.

1.19. Linear Regression


• Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.

• The general mathematical equation for a linear regression is

y = ax + b
Following is the description of the parameters used −
• y is the response variable.

• x is the predictor variable.

• a and b are constants which are called the coefficients.

Steps to Establish a Regression

• A simple example of regression is predicting weight of a person when his height is known. To
do this we need to have the relationship between height and weight of a person.

The steps to create the relationship is −


• Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.

• Create a relationship model using the lm() functions in R.

• Find the coefficients from the model created and create the mathematical equation using
these

• Get a summary of the relationship model to know the average error in prediction. Also
called residuals.

• To predict the weight of new persons, use the predict() function in R.

lm() Function

• This function creates the relationship model between the predictor and the response
variable.

• The basic syntax for lm() function in linear regression is −

lm(formula,data)

• formula is a symbol presenting the relation between x and y.

• data is the vector on which the formula will be applied.


predict() Function

• The basic syntax for predict() in linear regression is −

• predict(object, newdata)

Following is the description of the parameters used −

• object is the formula which is already created using the lm() function.

• newdata is the vector containing the new value for predictor variable.

1.20. Multiple Linear Regressions


• Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but in
multiple regression we have more than one predictor variable and one response variable.

• The general mathematical equation for multiple regression is

y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −

• y is the response variable.

• a, b1, b2...bn are the coefficients.

• x1, x2, ...xn are the predictor variables.


lm() Function

• This function creates the relationship model between the predictor and the response
variable.

• The basic syntax for lm() function in multiple regression is −

lm(y ~ x1+x2+x3...,data)

Following is the description of the parameters used −

• formula is a symbol presenting the relation between the response variable and predictor
variables.

• data is the vector on which the formula will be applied.

1.21. SST, SSR, SSE


• Sum of Squares Total (SST) – The sum of squared differences between individual data points
(yi) and the mean of the response variable (y).
SST = Σ(yi – y)2

• Sum of Squares Regression (SSR) – The sum of squared differences between predicted data
points (ŷi) and the mean of the response variable(y).

SSR = Σ(ŷi – y)2

• Sum of Squares Error (SSE) – The sum of squared differences between predicted data points
(ŷi) and observed data points (yi).

SSE = Σ(ŷi – yi)2

You might also like