Da Unit-1
Da Unit-1
Da Unit-1
Unit 1: An overview of R, Vectors, factors, univariate time series, Data frames, matrices, Functions, operators, loops,
Graphics, Revealing views of the data, Data summary, Statistical analysis questions, aims, and strategies; Statistical
models, Distributions: models for the random component, Simulation of random numbers and random samples,
Model assumptions
Text Books:
1. Data Analysis and Graphics Using R – an Example-Based Approach, John Maindonald, W. John Braun, Third
Edition, 2010
An overview of R
R is a programming language.
R is often used for statistical computing and graphical presentation to analyze and visualize data.
Why Use R?
It is a great resource for data analysis, data visualization, data science and machine learning
It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to solve different problems
How to Install R
To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows, Mac or Linux.
When you have downloaded and installed R, you can run R on your computer.
The screenshot below shows how it may look like when you run R on a Windows PC:
If you type 5 + 5, and press enter, you will see that R outputs 10.
The command line prompt (>) is an invitation to type commands or expressions. Once the command or expression
is complete, and the Enter key is pressed, R evaluates and prints the result in the console window. This allows the
use of R as a calculator.
For example, type 2+2 and press the Enter key. Here is what appears on the screen:
> 2+2
[1] 4
>
The first element is labeled [1] even when, as here, there is just one element! The final > prompt indicates that R is
ready for another command.
Anything that follows a # on the command line is taken as comment and ignored by R.
A continuation prompt, by default +, appears following a carriage return when the command is not yet complete.
For example, an interruption of the calculation of 3*4ˆ2 by a carriage return could appear as
> 3*4ˆ
+2
[1] 48
Multiple commands may appear on one line, with a semicolon (;) as the separator. For example,
> 3*4ˆ2; (3*4)ˆ2
[1] 48
[1] 144
Example
5+5
Print
Unlike many other programming languages, you can output code in R without using a print function:
Example
"Hello World!"
However, R does have a print() function available if you want to use it. This might be useful if you are familiar with
other programming languages, such as Python, which often uses the print() function to output code.
Example
print("Hello World!")
And there are times you must use the print() function to output code, for example when working with for loops
(which you will learn more about in a later chapter):
Example
for (x in 1:10)
{
print(x)
}
Conclusion: It is up to you whether you want to use the print() function to output code. However, when your code
is inside an R expression (for example inside curly braces {} like in the example above), use the print() function to
output the result.
Creating Variables in R
Variables are containers for storing data values.
R does not have a command for declaring a variable. A variable is created the moment you first assign a value to it.
To assign a value to a variable, use the <- sign. To output (or print) the variable value, just type the variable name:
Example
name <- "John"
age <- 40
name # output "John"
age # output 40
From the example above, name and age are variables, while "John" and 40 are values.
In other programming language, it is common to use = as an assignment operator. In R, we can use both = and <- as
assignment operators.
However, <- is preferred in most cases because the = operator can be forbidden in some context in R.
Data Types
In programming, data type is an important concept.
Variables can store data of different types, and different types can do different things.
In R, variables do not need to be declared with any particular type, and can even change type after they have been
set:
Example
my_var <- 30 # my_var is type of numeric
my_var <- "Sally" # my_var is now of type character (aka string)
We can use the class() function to check the data type of a variable:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
Numbers
There are three number types in R:
numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example
x <- 10.5 # numeric
y <- 10L # integer
z <- 1i # complex
Numeric
A numeric data type is the most common type in R, and contains any number with or without a decimal, like: 10.5,
55, 787:
Example
x <- 10.5
y <- 55
Integer
Integers are numeric data without decimals. This is used when you are certain that you will never create a variable
that should contain decimals. To create an integer variable, you must use the letter L after the integer value:
Example
x <- 1000L
y <- 55L
Complex
A complex number is written with an "i" as the imaginary part:
Example
x <- 3+5i
y <- 5i
Type Conversion
You can convert from one type to another with the following functions:
as.numeric()
as.integer()
as.complex()
Example
x <- 1L # integer
y <- 2 # numeric
Simple Math
In R, you can use operators to perform common mathematical operations on numbers.
The + operator is used to add together two values:
Example
10 + 5
And the - operator is used for subtraction:
Example
10 - 5
sqrt()
The sqrt() function returns the square root of a number:
Example
sqrt(16)
abs()
The abs() function returns the absolute (positive) value of a number:
Example
abs(-4.7)
String Literals
A character, or strings, are used for storing text. A string is surrounded by either single quotation marks, or double
quotation marks:
"hello" is the same as 'hello':
Example
"hello"
'hello'
Multiline Strings
You can assign a multiline string to a variable like this:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."
If you want the line breaks to be inserted at the same position as in the code, use the cat() function:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."
cat(str)
String Length
There are many usesful string functions in R.
For example, to find the number of characters in a string, use the nchar() function:
Example
str <- "Hello World!"
nchar(str)
Check a String
Use the grepl() function to check if a character or a sequence of characters are present in a string:
Example
str <- "Hello World!"
grepl("H", str)
grepl("Hello", str)
grepl("X", str)
paste(str1, str2)
Escape Characters
To insert characters that are illegal in a string, you must use an escape character.
An escape character is a backslash \ followed by the character you want to insert.
An example of an illegal character is a double quote inside a string that is surrounded by double quotes:
Example
str <- "We are the so-called "Vikings", from the north."
str
Result:
Error: unexpected symbol in "str <- "We are the so-called "Vikings"
str
cat(str)
Note that auto-printing the str variable will print the backslash in the output. You can use the cat() function to
print it without backslash.
Other escape characters in R:
Code Result
\\ Backslash
\n New Line
\r Carriage Return
\t Tab
\b Backspace
a>b
You can also run a condition in an if statement, which you will learn much more about in the if..else chapter.
Example
a <- 200
b <- 33
if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}
Operators
Operators are used to perform operations on variables and values.
In the example below, we use the + operator to add together two values:
Example
10 + 5
R Arithmetic Operators
Arithmetic operators are used with numeric values to perform common mathematical operations:
Operator Name Example
+ Addition x+y
- Subtraction x-y
* Multiplication x*y
/ Division x/y
^ Exponent x^y
R Assignment Operators
Assignment operators are used to assign values to variables:
Example
my_var <- 3
my_var <<- 3
3 -> my_var
3 ->> my_var
my_var # print my_var
Note: <<- is a global assigner..
R Comparison Operators
Comparison operators are used to compare two values:
Operator Name Example
== Equal x == y
!= Not equal x != y
R Logical Operators
Logical operators are used to combine conditional statements:
Operator Description
& Element-wise Logical AND operator. It returns TRUE if both elements are TRUE
&& Logical AND operator - Returns TRUE if both statements are TRUE
R Miscellaneous Operators
Miscellaneous operators are used to manipulate data:
Operator Description Example
== Equal x == y
!= Not equal x != y
if (b > a) {
print("b is greater than a")
}
In this example we use two variables, a and b, which are used as a part of the if statement to test whether b is
greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to screen that "b is
greater than a".
R uses curly brackets { } to define the scope in the code.
Else If
The else if keyword is R's way of saying "if the previous conditions were not true, then try this condition":
Example
a <- 33
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
In this example a is equal to b, so the first condition is not true, but the else if condition is true, so we print to
screen that "a and b are equal".
You can use as many else if statements as you want in R.
If Else
The else keyword catches anything which isn't caught by the preceding conditions:
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}
In this example, a is greater than b, so the first condition is not true, also the else if condition is not true, so we go to
the else condition and print to screen that "a is greater than b".
You can also use else without else if:
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else {
print("b is not greater than a")
}
Nested If Statements
You can also have if statements inside if statements, this is called nested if statements.
Example
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
AND
The & symbol (and) is a logical operator, and is used to combine conditional statements:
Example
Test if a is greater than b, AND if c is greater than a:
a <- 200
b <- 33
c <- 500
OR
The | symbol (or) is a logical operator, and is used to combine conditional statements:
Example
Test if a is greater than b, or if c is greater than a:
a <- 200
b <- 33
c <- 500
R While Loops
With the while loop we can execute a set of statements as long as a condition is TRUE:
Example
Print i as long as i is less than 6:
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
In the example above, the loop will continue to produce numbers ranging from 1 to 5. The loop will stop at 6
because 6 < 6 is FALSE.
The while loop requires relevant variables to be ready, in this example we need to define an indexing variable, i,
which we set to 1.
Note: remember to increment i, or else the loop will continue forever.
Break
With the break statement, we can stop the loop even if the while condition is TRUE:
Example
Exit the loop if i is equal to 4.
i <- 1
while (i < 6) {
print(i)
i <- i + 1
if (i == 4) {
break
}
}
The loop will stop at 3 because we have chosen to finish the loop by using the break statement when i is equal to 4
(i == 4).
With the next statement, we can skip an iteration without terminating the loop:
Example
Skip the value of 3:
i <- 0
while (i < 6) {
i <- i + 1
if (i == 3) {
next
}
print(i)
}
When the loop passes the value 3, it will skip it and continue to loop.
Yahtzee!
If .. Else Combined with a While Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print("No Yahtzee")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
If the loop passes the values ranging from 1 to 5, it prints "No Yahtzee". Whenever it passes the value 6, it prints
"Yahtzee!".
For Loops
A for loop is used for iterating over a sequence:
Example
for (x in 1:10) {
print(x)
}
This is less like the for keyword in other programming languages, and works more like an iterator method as found
in other object-orientated programming languages.
With the for loop we can execute a set of statements, once for each item in a vector, array, list, etc..
You will learn about lists and vectors, etc in a later chapter.
Example
Print every item in a list:
for (x in fruits) {
print(x)
}
Example
Print the number of dices:
for (x in dice) {
print(x)
}
The for loop does not require an indexing variable to set beforehand, like with while loops.
Break
With the break statement, we can stop the loop before it has looped through all the items:
Example
Stop the loop at "cherry":
for (x in fruits) {
if (x == "cherry") {
break
}
print(x)
}
The loop will stop at "cherry" because we have chosen to finish the loop by using the break statement when x is
equal to "cherry" (x == "cherry").
With the next statement, we can skip an iteration without terminating the loop:
Example
Skip "banana":
for (x in fruits) {
if (x == "banana") {
next
}
print(x)
}
When the loop passes "banana", it will skip it and continue to loop.
Yahtzee!
If .. Else Combined with a For Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
for(x in dice) {
if (x == 6) {
print(paste("The dice number is", x, "Yahtzee!"))
} else {
print(paste("The dice number is", x, "Not Yahtzee"))
}
}
If the loop reaches the values ranging from 1 to 5, it prints "No Yahtzee" and its number. When it reaches the value
6, it prints "Yahtzee!" and its number.
Nested Loops
You can also have a loop inside of a loop:
Example
Print the adjective of each fruit in a list:
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() { # create a function with the name my_function
print("Hello World!")
}
Call a Function
To call a function, use the function name followed by parenthesis, like my_function():
Example
my_function <- function() {
print("Hello World!")
}
Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you
want, just separate them with a comma.
The following example has a function with one argument (fname). When the function is called, we pass along a first
name, which is used inside the function to print the full name:
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Parameters or Arguments?
The terms "parameter" and "argument" can be used for the same thing: information that are passed into a function.
A parameter is the variable listed inside the parentheses in the function definition.
Number of Arguments
By default, a function must be called with the correct number of arguments. Meaning that if your function expects 2
arguments, you have to call the function with 2 arguments, not more, and not less:
Example
This function expects 2 arguments, and gets 2 arguments:
my_function("Peter", "Griffin")
If you try to call the function with 1 or 3 arguments, you will get an error:
Example
This function expects 2 arguments, and gets 1 argument:
my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter value.
Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Return Values
To let a function return a result, use the return() function:
Example
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
print(my_function(5))
print(my_function(9))
The output of the code above will be:
[1] 15
[1] 25
[1] 45
Nested Functions
There are two ways to create a nested function:
Nested_function(Nested_function(2,2), Nested_function(3,3))
Example Explained
The function tells x to add y.
We need to create a new variable called output and give it a value, which is 3 here.
We then print the output with the desired value of "y", which in this case is 5.
Recursion
R also accepts function recursion, which means a defined function can call itself.
Recursion is a common mathematical and programming concept. It means that a function calls itself. This has the
benefit of meaning that you can loop through data to reach a result.
The developer should be very careful with recursion as it can be quite easy to slip into writing a function which
never terminates, or one that uses excess amounts of memory or processor power. However, when written
correctly, recursion can be a very efficient and mathematically-elegant approach to programming.
In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We use the k variable as
the data, which decrements (-1) every time we recurse. The recursion ends when the condition is not greater than
0 (i.e. when it is 0).
To a new developer it can take some time to work out how exactly this works, best way to find out is by testing and
modifying it.
Example
tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else {
result = 0
return(result)
}
}
tri_recursion(6)
Global Variables
Variables that are created outside of a function are known as global variables.
Global variables can be used by everyone, both inside of functions and outside.
Example
Create a variable outside of a function and use it inside the function:
txt <- "awesome"
my_function <- function() {
paste("R is", txt)
}
my_function()
If you create a variable with the same name inside a function, this variable will be local, and can only be used inside
the function. The global variable with the same name will remain as it was, global and with the original value.
Example
Create a variable inside of a function with the same name as the global variable:
txt <- "global variable"
my_function <- function() {
txt = "fantastic"
paste("R is", txt)
}
my_function()
If you try to print txt, it will return "global variable" because we are printing txt outside the function.
my_function()
print(txt)
Also, use the global assignment operator if you want to change a global variable inside a function:
Example
To change the value of a global variable inside a function, refer to the variable by using the global assignment
operator <<-:
txt <- "awesome"
my_function <- function() {
txt <<- "fantastic"
paste("R is", txt)
}
my_function()
Vectors
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items by a comma.
In the example below, we create a vector variable called fruits, that combine strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
# Print numbers
numbers
To create a vector with numerical values in a sequence, use the : operator:
Example
# Vector with numerical values in a sequence
numbers <- 1:10
numbers
You can also create numerical values with decimals in a sequence, but note that if the last element does not belong
to the sequence, it is not used:
Example
# Vector with numerical decimals in a sequence
numbers1 <- 1.5:6.5
numbers1
# Vector with numerical decimals in a sequence where the last element is not used
numbers2 <- 1.5:6.3
numbers2
Result:
[1] 1.5 2.5 3.5 4.5 5.5 6.5
[1] 1.5 2.5 3.5 4.5 5.5
log_values
Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")
length(fruits)
Sort a Vector
To sort items in a vector alphabetically or numerically, use the sort() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)
Access Vectors
You can access the vector items by referring to its index number inside brackets []. The first item has index 1, the
second item has index 2, and so on:
Example
fruits <- c("banana", "apple", "orange")
You can also access multiple elements by referring to different index positions with the c() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
You can also use negative index numbers to access all items except the ones specified:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Print fruits
fruits
Repeat Vectors
To repeat vectors, use the rep() function:
Example
Repeat each value:
repeat_each <- rep(c(1,2,3), each = 3)
repeat_each
Example
Repeat the sequence of the vector:
repeat_times <- rep(c(1,2,3), times = 3)
repeat_times
Example
Repeat each value independently:
repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))
repeat_indepent
numbers
To make bigger or smaller steps in a sequence, use the seq() function:
Example
numbers <- seq(from = 0, to = 100, by = 20)
numbers
Note: The seq() function has three parameters: from is where the sequence starts, to is where the sequence stops,
and by is the interval of the sequence.
Vectors are the most basic R data objects and there are six types of atomic vectors. They are logical, integer, double,
complex, character and raw.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of the above vector
types.
# Atomic vector of type character.
print("abc");
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
When we execute the above code, it produces the following result −
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
When we execute the above code, it produces the following result −
[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
Vector Element Recycling
If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are
recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list
inside it. A list can also contain a matrix or a function as its elements. List is created using list() function.
Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.
[[2]]
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$<NA>
NULL
Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. They
contain elements of the same atomic types. Though we can create a matrix containing only characters or only
logical values, they are not of much use. We use matrices containing numeric elements to be used in mathematical
calculations.
A Matrix is created using the matrix() function.
Syntax
The basic syntax for creating a matrix in R is −
matrix(data, nrow, ncol, byrow, dimnames)
Following is the description of the parameters used −
data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.
Example
Create a matrix taking a vector of numbers as input.
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators. The result of the operation
is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the operation.
Arrays are the R data objects which can store data in more than two dimensions. For example − If we create an
array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can
store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter
to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
,,2
, , Matrix2
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
When we execute the above code, it produces the following result −
,,1
,,2
[1] 56 68 60
Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings
and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female"
and True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
Example
print(data)
print(is.factor(data))
print(factor_data)
print(is.factor(factor_data))
When we execute the above code, it produces the following result −
[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
Factors in Data Frame
On creating any data frame with a column of text data, R treats the text column as categorical data and creates
factors on it.
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain same number of data items.
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
When we execute the above code, it produces the following result −
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Extract the first two rows and then all columns
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
When we execute the above code, it produces the following result −
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
Add Column
Just add the column vector using a new column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
R packages are a collection of R functions, complied code and sample data. They are stored under a directory
called "library" in the R environment. By default, R installs a set of packages during installation. More packages are
added later, when they are needed for some specific purpose. When we start the R console, only the default
packages are available by default. Other packages which are already installed have to be loaded explicitly to be
used by the R program that is going to use them.
All the packages available in R language are listed at R Packages.
Below is a list of commands to be used to check, verify and use the R packages.
library()
When we execute the above code, it produces the following result. It may vary depending on the local settings of
your pc.
Install a New Package
There are two ways to add new R packages. One is installing directly from the CRAN directory and another is
downloading the package to your local system and installing it manually.
Install directly from CRAN
The following command gets the packages directly from CRAN webpage and installs the package in the R
environment. You may be prompted to choose a nearest mirror. Choose the one appropriate to your location.
install.packages("Package Name")
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data
processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and
columns of a data frame but there are situations when we need the data frame in a format that is different from
format in which we received it. R has many functions to split, merge and change the rows to columns and vice-
versa in a data frame.
Joining Columns and Rows in a Data Frame
We can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames
using rbind() function.
# Print a header.
cat("# # # # The First data frame\n")
# Print a header.
cat("# # # The Second data frame\n")
# Print a header.
cat("# # # The combined data frame\n")
library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)
When we execute the above code, it produces the following result −
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
3 24 No
4 21 No
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17
Melting and Casting
One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to
get a desired shape. The functions used to do this are called melt() and cast().
We consider the dataset called ships present in the library called "MASS".
library(MASS)
print(ships)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
4 A 65 75 1095 4
5 A 70 60 1512 6
.............
.............
8 A 75 75 2244 11
9 B 60 60 44882 39
10 B 60 75 17176 29
11 B 65 60 28609 58
............
............
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
............
............
Melt the Data
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
When we execute the above code, it produces the following result −
type year variable value
1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75
............
............
9 B 60 period 60
10 B 60 period 75
11 B 65 period 60
12 B 65 period 75
13 B 70 period 60
...........
...........
41 A 60 service 127
42 A 60 service 63
43 A 65 service 1095
...........
...........
70 D 70 service 1208
71 D 75 service 0
72 D 75 service 2051
73 E 60 service 45
74 E 60 service 0
75 E 65 service 789
...........
...........
101 C 70 incidents 6
102 C 70 incidents 2
103 C 75 incidents 0
104 C 75 incidents 1
105 D 60 incidents 0
106 D 60 incidents 0
...........
...........
Cast the Molten Data
We can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is
done using the cast() function.
recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
7 B 70 135 20163 56
8 B 75 135 7117 18
9 C 60 135 1731 2
10 C 65 135 1457 1
11 C 70 135 2731 8
12 C 75 135 274 1
13 D 60 135 356 0
14 D 65 135 480 0
15 D 70 135 1557 13
16 D 75 135 2051 4
17 E 60 135 45 0
18 E 65 135 1226 14
19 E 70 135 3318 17
20 E 75 135 542 1
In R, we can read data from files stored outside the R environment. We can also write data into files which will be
stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml etc.
In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should be
present in current working directory so that R can read it. Of course we can also set our own directory and read
files from there.
Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function. You can also set a new
working directory using setwd()function.
[1] "/web/com/1441086124_2016"
[1] "/web/com"
This result depends on your OS and your current directory where you are working.
You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the
save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current working directory −
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result −
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in
subsequent section.
[1] 843.25
Get the details of the person with max salary
We can fetch rows meeting specific filter criteria similar to a SQL where clause.