Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

R Programming for Data Analysis (1)

The document provides an overview of R programming for data analysis, covering topics such as data exploration, data types, loops, functions, and advanced visualization techniques. It highlights R's features, including its open-source nature, comprehensive tools for data manipulation, and integration capabilities with other programming languages. Additionally, it explains how to create and manipulate various data structures like vectors, matrices, and data frames, as well as the use of built-in and user-defined functions.

Uploaded by

Devanshi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

R Programming for Data Analysis (1)

The document provides an overview of R programming for data analysis, covering topics such as data exploration, data types, loops, functions, and advanced visualization techniques. It highlights R's features, including its open-source nature, comprehensive tools for data manipulation, and integration capabilities with other programming languages. Additionally, it explains how to create and manipulate various data structures like vectors, matrices, and data frames, as well as the use of built-in and user-defined functions.

Uploaded by

Devanshi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

R

Programming
For Data
Analysis
Agenda

• Exploring Data With R


• Data Types and Data Structure
• Loops and Function In R
• Fundamental in R
• Matrices
• Data Frames
• Advanced Visualizaton With GGPlot
1
Exploring Data With R
Introduction to R

“R is a programming language and software environment for statistical computing and


graphics. It is widely used among data scientists, statisticians, and researchers for data
analysis, data visualization, and machine learning tasks.”

R is an open-source language, which means it is freely available and can be modified


and distributed by anyone. The R software is maintained by the R Development Core
Team and is supported by a large and active community of users and developers.

Some of the features that make R popular among data scientists and statisticians
include its powerful data analysis and visualization capabilities, its support for a wide
range of statistical techniques and models, and its ability to integrate with other
programming languages and software tools.

R can be used for a variety of applications, including data cleaning and preprocessing,
exploratory data analysis, statistical modeling, machine learning, and data
visualization. It is also frequently used in research and academic settings, as well as in
industry for applications such as financial analysis, marketing research, and predictive
modeling.
How to Download and Install R?

Here are some steps to get started with exploring R:


• Install R on your computer: You can download R from the official website (https://cran.r-project.org/). R is
available for Windows, Mac, and Linux operating systems.

• Install an Integrated Development Environment (IDE): RStudio is a popular IDE for R. It provides a user-friendly
interface for writing R code and analyzing data. You can download RStudio from its website
(https://www.rstudio.com/).
Features of R Programming

R is a programming language and environment that is widely used for statistical


computing and graphics. Here are some of its key features:
o Open-source: R is an open-source language, meaning that its source code is available
to everyone, and it can be freely used and distributed.
o Comprehensive: R has a comprehensive set of tools for data analysis, including
statistical modeling, machine learning, data visualization, and data manipulation.
o Object-oriented: R is an object-oriented language, which means that it uses objects to
represent data and functions. This makes it easy to work with complex data structures.
o Syntax: R has a simple and intuitive syntax that is easy to learn, even for people who
don't have a background in programming.
o Graphics: R has powerful graphics capabilities, allowing users to create a wide range of
static and interactive visualizations.
o Community: R has a large and active community of users and developers who
contribute to its development and provide support through forums, mailing lists, and
other resources.
o Integration: R can be easily integrated with other programming languages, such as
Python and SQL, as well as with other tools for data analysis, such as Excel and
Tableau.
o Packages: R has a large and growing collection of packages that extend its capabilities
for specific tasks, such as time-series analysis, geospatial analysis, and network
analysis.
2
Data Types And Data
Structure
Data Types in R

R is a dynamic, object-oriented programming language that provides a wide range of data types for storing and
manipulating data. Some of the commonly used data types in Rare:

Double

If you do calculation on numbers, you can use the data types double to represent the numbers.

Doubles are number like 3.14, 8.0 and 9.1. Doubles are used to represent continuous variables like the weight or length
of a person.

X<- 6.14
Y<- 1.0
Z<- 7.0+13.9

Integer

Integer are natural numbers. They can be used to represent counting variables, for example the number of student in a
class use the function as.integer to create objects of type integer.

Complex

Objects of types ‘complex’ are used to represent complex numbers. In statistical analysis you will not need them often.
Use the function as.complex or complex to create objects of type complex.
Data Types in R
Logical

Logical data type is used to store Boolean values, which are either TRUE or FALSE.

Logical expression are often built from logical Operator:

1. Smaller than (<)


2. Larger than (>)
3. Smaller than or Equal to (<=)
4. Larger than or Equal to (>=)
5. Is Equal to (==)
6. Is unequal to (!=)

X<-c(9,166)

Y<-(3<x)&(x<=10)

Character

A character object is represented by a collection of characters between double quotes (“) for example: “x”, “test
character” and “hello world!”. One way to create character objects is as follows.

X<-c(“a”, “b”, “c”)


Data Types in R
Date:

Date data type is used to store dates. In R, dates are represented by the "Date" class.

Time:

Time data type is used to store time. In R, times are represented by the "POSIXct" class.

Factors:

A factor is a data type used for categorical data. It is a vector of integers that have labels associated with them.
Data Structure in R
In R, there are many built-in data structures that can be used to organize and manipulate data. Some common
examples of data structures in R include vectors, matrices, arrays, lists, and data frames.

Here's an example of each data structure:

 Vectors: Vectors are one-dimensional arrays that can hold homogeneous data. They can be created using the c()
function.

 Matrices: Matrices are two-dimensional arrays that can hold homogeneous data. They can be created using the
matrix() function.

 Arrays: Arrays are multi-dimensional arrays that can hold homogeneous data. They can be created using the array()
function.

 Lists: Lists are a collection of objects that can be of different types and sizes. They can be created using the list()
function.

 Data frames: Data frames are two-dimensional objects that can hold heterogeneous data. They can be created using
the data.frame() function.
3
Loop And Function IN R
Loops

In R, a loop is a programming construct that allows you to repeatedly execute a block of code. A loop can be useful
when you need to perform a certain operation multiple times, such as processing a large data set.

 For Loops

 While Loops

 Repeat Loops

 If-else Loops

 Switch
Loops

While Loop

In R programming language, a while loop is used to repeatedly execute a block of code until a certain condition is
met. The syntax of a while loop in R is as follows:

The condition is an expression that is evaluated before each iteration of the loop. If the condition is TRUE, the code
block is executed. After each iteration, the condition is evaluated again. The loop continues until the condition
becomes FALSE.
Loops

For Loop

The for loop is used to iterate over a sequence of values, such as a vector or a list. It has the following syntax:

For loops can be used in many different contexts in R, such as iterating over rows or columns of a data frame,
performing calculations on a matrix, or generating sequences of numbers. The key is to identify the task you need to
perform and then figure out how to use the loop construct to accomplish it.
Loops

Repeat Loop

The repeat loop is used to repeatedly execute a block of code until a certain condition is met. It has the following
syntax:
Loops

IF-Else

In R programming, the if-else statement is used to execute different blocks of code based on whether a specified
condition is true or false. The basic syntax of if-else statement is:
Loops

Switch

In R programming, the switch function is used to select one of several alternatives based on the value of a given
expression. It has the following syntax:

Here, EXPR is the expression whose value will be used to select the alternative. CASE1, CASE2, etc. are the different
possible values that EXPR can take, and RESULT1, RESULT2, etc. are the corresponding results that will be returned
if EXPR matches that case. DEFAULT is an optional argument that specifies what to do if EXPR doesn't match any of
the cases.
Function In R Programming

In R programming, functions can be classified into two main types: pre-defined (or built-in) functions and user-
defined functions.

Pre-defined functions are built-in functions that are available in R without the need to define them. These functions
are written in the R language itself and are designed to perform common operations, such as mathematical
calculations, data manipulation, and statistical analysis. Examples of pre-defined functions in R include mean(),
sum(), sqrt(), and sd().

R programming language comes with a wide range of built-in functions that can be used for various purposes. Here
is a list of some commonly used built-in functions in R:

• Math Functions: abs, sqrt, exp, log, log10, cos, sin, tan, acos, asin, atan, ceiling, floor, round, trunc.

• Statistical Functions: mean, median, sd, var, cor, cov, min, max, quantile, summary, density, ecdf, t.test,
wilcox.test.

• Data Manipulation Functions: cbind, rbind, subset, merge, aggregate, apply, tapply, sapply, lapply, reshape, dcast,
melt.
Function In R Programming

• Data Import and Export Functions: read.csv, read.table, write.csv, write.table, load, save.

• Graphics Functions: plot, hist, boxplot, barplot, pie, lines, points, legend, title, text, abline.

• Miscellaneous Functions: help, class, dim, length, names, str, summary, system.time, rm, print, rnorm(), c(), seq(),
rep(), print(), is.numeric(), is.integer(), is.double(), is.character(), typeof(), sqrt(), paste(), #(comment).

These are just a few examples of the many built-in functions available in R. The R community has also developed
many packages that extend the functionality of R with additional functions.

On the other hand, user-defined functions are functions that are created by the user to perform specific tasks that
may not be covered by the pre-defined functions. These functions are defined using the "function" keyword in R and
can be given any name that is not already used by a pre-defined function. Once defined, user-defined functions can
be called and used like any other pre-defined function in R.

In R programming, a function is a set of instructions that performs a specific task. Functions are defined using the
function() keyword and can take zero or more arguments as input and can return a value or an object.

Here is the basic syntax for creating a function in R:


Here, my_function is the name of the function, arg1, arg2, etc. are the arguments that the
function takes as input, and value is the value that the function returns.

For example, here is a simple function that adds two numbers together and returns the
result:
Vectorized Functions and Environments in R Programming

In R programming, vectorization is the process of performing operations on entire vectors of data rather than
individual elements. Vectorized functions take advantage of this feature to operate on entire vectors of data, which
can significantly improve computational efficiency.

If we wanted to add these two vectors together element-wise, we could use a loop to iterate over each element:

Alternatively, we could use a vectorized function like + to add the vectors together:
This approach is more efficient because it avoids the overhead of the loop.

Environments in R are objects that contain variables and their values. Each environment has a parent
environment, which can be thought of as a container that encloses the environment. The global environment is
the top-level environment that contains all the variables that are defined in the current R session.

Environments can be useful for managing and organizing variables in complex R programs. For example, you
can create a new environment to store variables related to a specific task, and then pass that environment to
functions that need access to those variables

Overall, vectorized functions and environments are two important features of R programming that can help
improve the efficiency and organization of your code.
The Apply Family of Functions in R Programming

In R, the "Apply" family of functions is a set of functions that are used to apply a specific function to a set of data
elements, such as a matrix, a data frame or a list. The "Apply" functions are particularly useful when working with
large datasets, as they provide a way to efficiently perform operations on groups of data without having to write
complicated loops.

Here are some of the most commonly used "Apply" functions in R:

• apply(): This function applies a function to either rows or columns of a matrix or data frame.

• lapply(): This function applies a function to each element of a list and returns a list.

• sapply(): This function is similar to lapply(), but it tries to simplify the result to a vector or matrix if
possible.

• mapply(): This function applies a function to multiple lists or vectors element-wise.

These functions provide an efficient and easy-to-use way to apply a function to a set of data elements in R.
By using these functions, you can simplify your code and save time when working with large datasets.
4
Fundamental In R
What is Vector in R Programming?

In R programming, a vector is a basic data structure that represents an ordered collection of elements of the same
data type. It can be created using the c() function or by coercing other data structures such as lists or matrices using
the as.vector() function.

Vectors can be of different data types such as numeric, character, logical, and complex. They can also have different
lengths and can be indexed using numerical or logical values.

Let’s Create Some Vectors

There create some vector as :

Creating a numeric Vector:


To create a numeric vector, use the c() function and separate the values with commas:
Creating a Character Vector:
To create a character vector, use the c() function and enclose the values in quotes:
Creating a Logical Vector:
To create a logical vector, use the c() function and enclose the values in TRUE or FALSE:

Creating a Sequence Vector:


To create a sequence vector, use the : operator to define a range of values:
Creating a Repeat Vector:
To create a repeat vector, use the rep() function and specify the value and number of times to repeat:
How to Install Packages in R?

To install packages in R, you can use the install.packages() function. Here are the steps:

1)Open R or RStudio.

2)Make sure you have an active internet connection.

3)Type the following command in the R console or in an R script:

Replace package_name with the name of the package you want to install. For example, to install the tidyverse
package, you would type:
4)Press Enter.
5)R will download the package from the CRAN (Comprehensive R Archive Network) and install it on your computer. This
may take a few minutes, depending on the size of the package and your internet connection.
6)Once the installation is complete, you can load the package into your R session using the library() function. For
example, to load the tidyverse package, you would type:

That's it! You can now use the functions and features of the installed package in your R code.
5
Matrices
Matrices

In R, a matrix is a two-dimensional array-like structure that contains elements of the same data type. Matrices are useful for
performing mathematical operations such as linear algebra, and they can be created and manipulated using various functions and
operations in R.

Here's how to create a matrix in R:

• Use the matrix() function to create a matrix. The matrix() function takes two arguments: the data elements to populate the
matrix and the number of rows and columns in the matrix.
You can also create a matrix by converting a vector into a matrix using the matrix() function and specifying the
number of rows and columns.
You can also create a matrix using the cbind() or rbind() functions to combine vectors into a matrix by columns
or rows, respectively.

You can manipulate matrices using various functions and operations in R. For example, you can access
individual elements of a matrix using indexing (mat[row, column]) or select a subset of a matrix using slicing
(mat[row_start:row_end, col_start:col_end]). You can also perform mathematical operations on matrices
such as addition, subtraction, multiplication, and division using the standard arithmetic operators (+, -, *, /).
Naming Dimensions

In R, naming dimensions refers to the process of assigning names to the dimensions of an array or matrix. This can
be useful for making the data more interpretable and for indexing the data using the names of the dimensions
rather than the numeric indices.
To assign names to the dimensions of an array or matrix, you can use the dimnames() function. The dimnames()
function takes one or more arguments, where each argument is a character vector of names for the corresponding
dimension. The length of each character vector must match the length of the corresponding dimension.
For example, consider a matrix mat with 3 rows and 4 columns:

You can assign names to the rows and columns of the matrix using dimnames():

Now, you can access the elements of the matrix using the names of the rows and columns:
Similarly, you can assign names to the dimensions of an array. The dimnames() function takes a list of character vectors,
where each character vector contains names for the corresponding dimension. The order of the character vectors in the
list corresponds to the order of the dimensions in the array.
Visualizing with Matplot

In R, the matplot() function is used to visualize multiple sets of data on the same plot. The matplot() function takes a
matrix or data frame as input, where each column corresponds to a different set of data to be plotted.

Here's an example of using matplot() to plot two sets of data:


In this example, x is a vector of x-coordinates, and y1 and y2 are vectors of y-coordinates for two different sets of data.
The cbind() function is used to combine y1 and y2 into a matrix data.

The matplot() function is then used to plot the data. The first argument to matplot() is the x-coordinates (x), and the
second argument is the matrix of y-coordinates (data). The type argument specifies that a line plot should be used (type
= "l"), and the lty and col arguments are used to specify line style and color for each set of data.

The xlab, ylab, and main arguments are used to add labels and a title to the plot, and the legend() function is used to
add a legend to the plot.

This code produces a plot with two lines, one for sin(x) and one for cos(x), with the x-axis labeled "X", the y-axis labeled
"Y", and a title "Multiple Data Sets". A legend is also included in the plot to indicate which line corresponds to which set
of data.
Subsetting in R

Subsetting in R means selecting a subset of a data object (such as a vector, matrix, or data frame) based on certain
conditions or criteria. There are several ways to subset data in R, including:

 Using square brackets [ ]: You can use square brackets to extract a subset of a data object based on the indices of
the elements you want to select. For example, to extract the second and fourth elements of a vector named "x",
you can use:

You can also use logical expressions inside the square brackets to select elements based on a certain condition. For
example, to select all elements in a vector named "x" that are greater than 5, you can
 Using the subset() function: The subset() function allows you to select rows of a data frame based on logical
expressions. For example, to select all rows in a data frame named "df" where the value in the "age" column is
greater than 30, you can use:

 Using the filter() function (dplyr package): The filter() function from the dplyr package is another way to select
rows of a data frame based on logical expressions. For example, to select all rows in a data frame named "df"
where the value in the "age" column is greater than 30, you can use:

These are just a few examples of the many ways you can subset data in R. It's important to choose the appropriate
method based on the type of data object you are working with and the specific subset you want to extract.
6
Data Frames
Importing Data In R

To import data in R, there are several options depending on the format of the data. Here are some
commonly used methods:

 CSV file: Use the read.csv() function to import data from a CSV file. For example, if your CSV file is
named "data.csv" and is located in the working directory, you can import it like this:

 Excel file: Use the readxl package to import data from an Excel file. First, you need to install the package by
running install.packages("readxl"). Then, you can use the read_excel() function to import the data. For example
 Text file: Use the read.table() function to import data from a text file. For example, if your text file is named "data.txt"
and is tab-separated, you can import it like this:

 SQL database: Use the RMySQL or RODBC package to connect to a SQL database and import data. For example, to
connect to a MySQL database and import a table named "data", you can use the following code:
Note that you need to replace "username", "password", and "database" with your own information. The RODBC
package works in a similar way, but can connect to other types of databases as well.
Exploring Your Dataset in R

Exploring a dataset is an important first step in any data analysis project, and R provides many useful
tools for doing so. Here are a few key functions you can use to explore your datasets in R:

 str(): The str() function provides a quick way to inspect the structure of a dataset. It shows the data
type and dimension of each variable in the dataset. For example:

This will display information about the "mtcars" dataset, including the names of the variables, their data types, and
the number of observations.

 summary(): The summary() function provides a summary of the distribution of each variable in the dataset. For
numeric variables, it shows the minimum, maximum, median, and quartiles. For categorical variables, it shows
the frequency of each level. For example:
This will display a summary of the "mtcars" dataset, including the mean, median, and quartiles of the numeric
variables, and the frequency of each level for the categorical variables.

 head() and tail(): The head() and tail() functions provide a quick way to view the first or last few rows of a dataset.
For example:

This will display the first and last few rows of the "mtcars" dataset.

 unique(): The unique() function provides a list of unique values for a variable in the dataset. For example:
This will display the unique values of the "gear" variable in the "mtcars" dataset

• dim() and nrow(): The dim() and nrow() functions provide information about the dimensions of the dataset. dim()
returns a vector of the number of rows and columns in the dataset, while nrow() returns the number of rows only.
For example:

This will display the number of rows and columns in the "mtcars" dataset, and the number of rows only.

These are just a few examples of the many functions available in R for exploring datasets. By using these and other
functions, you can gain a better understanding of the data you are working with, and identify potential issues or areas
of interest for further analysis.
Basic Operation with Data Frames in R

Data frames are one of the most commonly used data structures in R, and they allow you to store and
manipulate tabular data. Here are some basic operations you can perform with data frames in R:

 Creating a data frame:


 Viewing a data frame:
 Subsetting a data frame:
 Adding a new column to a data frame:
 Renaming columns in a data frame:
 Aggregating data in a data frame:
7
Advanced Visualization
With GGPlot
What is Factor?
In R, a factor is a data object that represents categorical or nominal data. It is used to store data that
belongs to a fixed set of categories, where each category is represented by a unique value, called a level.

For example, suppose we have a dataset containing information about the color of cars. The color column
would be a categorical variable, and the possible categories would be "red", "blue", "green", etc. We can
represent this variable as a factor in R by assigning each category a level.

To create a factor in R, we can use the factor() function. The function takes a vector of categorical data
and converts it into a factor object. We can also specify the levels of the factor, which determines the
unique values that can be assigned to the factor.

Here's an example of how to create a factor in R:


Aesthetic

In R, the term "aesthetic" refers to a visual property of a plot, such as the color, shape, size,
or position of the plot elements. Aesthetics are typically defined using the aes() function in R,
which is used to map variables in a dataset to the visual properties of a plot.

For example, let's say we have a dataset with columns for x, y, and group. We can create a scatter plot of x and y, and
color the points based on the group column using the aes() function as follows:

In this example, the aes() function is used to specify the


x and y variables for the plot, and the color aesthetic is
used to map the group variable to the color of the
points. This results in a scatter plot where the points
are colored differently based on their group
membership.

Aesthetics can also be modified using various geom_*


functions in R, which are used to add different types of
geometric objects to a plot (e.g., points, lines, bars,
etc.) and modify their visual properties. For example,
the geom_point() function is used to add points to a
scatter plot, and the size parameter can be used to
modify the size of the points. Similarly, the geom_line()
function is used to add lines to a plot, and the linetype
parameter can be used to modify the type of line.
Plotting with Layer in R

In R, there are several ways to create plots, and one of them is by using the ggplot2 package,
which uses the concept of layers to build the plot.

Here is an example of creating a scatter plot with a regression line using ggplot2 and adding layers to customize the
plot:
In this example, we first create a data frame with two variables, x
and y, using the rnorm function to generate random normal
values.

We then create the plot object, p, by using the ggplot function and
specifying the data frame and the variables to use for the x and y
axes. We add the first layer to the plot using the geom_point
function to plot the points.

We then add a second layer to the plot using the geom_smooth


function with the method argument set to "lm" to add a linear
regression line to the plot. We also set the se argument to FALSE to
remove the shaded area around the line.

We customize the plot further by adding a title using the ggtitle


function, changing the colors of the points and line using the
scale_color_manual function, and changing the x and y axis labels
using the xlab and ylab functions.

Finally, we display the plot using the plot object p.


Mapping Vs Setting in R
In R, mapping and setting can refer to different concepts depending on the context of the analysis. Here are
some possible interpretations of these terms in R:

Mapping:

 In data analysis, mapping can refer to the process of transforming data from one form to another, such as
converting categorical variables into numerical values or summarizing data by groups. This can be done
using functions like map, apply, or lapply in R.
 In spatial analysis, mapping can refer to the visualization of geographical data, such as creating maps of
locations, boundaries, or patterns. This can be done using packages like ggplot2, leaflet, or mapview in R.
 In machine learning, mapping can refer to the process of fitting a model to data by finding a mathematical
function that can predict an outcome variable based on input features. This can be done using functions like
lm, glm, or randomForest in R.

Setting:

 In R, setting can refer to the configuration of various options and parameters that control the behavior of R
functions and packages. For example, you can set the working directory, change the plot colors, or adjust the
memory allocation using functions like setwd, par, or options in R.
 In data analysis, setting can refer to the context in which data is collected or analyzed, such as the sampling
method, data cleaning procedures, or data transformations. These settings can have a significant impact on
the validity and reliability of the results.
 In statistical modeling, setting can refer to the assumptions and conditions that are made about the data and
the model, such as the distributional assumptions, the linearity assumptions, or the homoscedasticity
assumptions. These settings can affect the accuracy and precision of the model predictions.
Histogram and Density chart in R

In R, you can create histograms and density plots using various built-in functions and
packages. Here are some examples of how to create these charts in R:

Histogram:

To create a histogram in R, you can use the hist() function. Here's an example:
Density plot:

To create a density plot in R, you can use the density() function to estimate the probability density function (PDF) of
the data, and then plot the result using the plot() or lines() functions. Here's an example:
Statistical transformations
Statistical transformations are used in data analysis to create new variables from existing variables by
applying mathematical functions or operations to them. In R, statistical transformations can be performed
using various functions and packages. Here are a few examples:

 dplyr package: The dplyr package is a popular package for data manipulation in R. It provides a set
of functions for filtering, summarizing, grouping, and transforming data. One of the most commonly
used functions in dplyr for statistical transformation is mutate(). This function is used to create new
variables by applying mathematical functions to existing variables. Here's an example

This code creates a new variable cyl_disp by multiplying the


values of cyl and disp columns in the mtcars data frame
 base R functions: R also provides several base functions for statistical transformations. For example, the log() function
can be used to compute the natural logarithm of a variable, while the sqrt() function can be used to compute the
square root of a variable. Here's an example:

This code creates two new variables log_disp and sqrt_hp by applying the log() and sqrt() functions to the disp
and hp columns, respectively.
 tidyr package: The tidyr package is another popular package for data manipulation in R. It provides functions
for reshaping data, such as gather() and spread(), which can also be used for statistical transformation. Here's
an example:

This code reshapes the mtcars data frame into a longer format using the
gather() function and creates a new variable log_value by applying the
log() function to the value column.

These are just a few examples of how statistical transformations can be


performed in R. There are many other functions and packages that can
be used for this purpose depending on the specific requirements of the
analysis.
Thank You

You might also like