UNIT 2_Advanced Data Structures
UNIT 2_Advanced Data Structures
ADVANCED DATA
STRUCTURES ON R
Ms Devibala Subramanian
Assistant Professor
PG & Research Department of Computer Science
Sri Ramakrishna College of Arts and Science
Coimbatore
Basics of R
R is a powerful tool for all manner of calculations, data manipulation and scientific computations.
Like most languages R has its share of mathematical capability, variables, functions and data
types.
Basic Math
Being a statistical programming language, R can certainly be used to do basic math.
In the console there is a right angle bracket (>) where code should be entered.
If this returns 2, then everything is great; if not, then something is very, very wrong.
Complicated expressions:
>1+2+3
[1] 6
>3*7*2
[1] 42
>4/2
[1] 2
>4/3
[1] 1.333333
These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division,
Addition and Subtraction (PEMDAS).
This means operations inside parentheses take priority over other operations. Next on the priority
list is exponentiation. After that, multiplication and division are performed, followed by addition
and subtraction.
This is why the first two lines in the following code have the same result, while the third is
different.
>4*6+5
[1] 29
> (4 * 6) + 5
[1] 29
> 4 * (6 + 5)
[1] 44
Variables are an integral part of any programming language and R offers a great deal of flexibility.
Unlike statically typed languages such as C++, R does not require variable types to be declared.
It can also hold any R object such as a function, the result of an analysis or a plot.
A single variable can at one point hold a number, then later hold a character and then later a
number again.
Variable Assignment
There are a number of ways to assign a value to a variable, and again, this does not depend
on the type of value being assigned.
The valid assignment operators are <- and =, with the first being preferred.
> x <- 2
>x
[1] 2
>y=5
>y
[1] 5
> 3 -> z
>z
[1] 3
The assignment operation can be used successively to assign a value to multiple variables
simultaneously.
>b
[1] 7
A more laborious, though sometimes necessary, way to assign variables is to use the assign
function.
> assign("j", 4)
>j
[1] 4
Variable names can contain any combination of alphanumeric characters along with periods (:)
and underscores ( _ ).
The most common form of assignment in the R community is the left arrow .
There is also a particularly nice benefit for people coming from languages like SQL, where a
single equal sign (=) tests for equality.
It is generally considered best practice to use actual names, usually nouns, for variables instead of
single letters.
>j
[1] 4
> rm(j)
This frees up memory so that R can store more objects, although it does not necessarily free up
memory for the operating system.
To guarantee that, use gc, which performs garbage collection, releasing unused memory to the
operating system.
> THEVARIABLE
Data Types
There are numerous data types in R that store various kinds of data.
The four main types of data most likely to be used are numeric, character (string), Date/POSIXct
(time-based) and logical (TRUE/FALSE).
The type of data contained in a variable is checked with the class function.
> class(x)
[1] "numeric"
Numeric Data
It handles integers and decimals, both positive and negative, and of course, zero.
> is.numeric(x)
[1] TRUE
Another important, type is integer.
> i <- 5L
>i
[1] 5
> is.integer(i)
[1] TRUE
Do note that, even though i is an integer, it will also pass a numeric check.
> is.numeric(i)
[1] TRUE
R promotes integers to numeric when needed. This is obvious when multiplying an integer by a
numeric, but importantly it works when dividing an integer by another integer, resulting in a
decimal number.
> class(4L)
[1] "integer"
> class(2.8)
[1] "numeric"
> 4L * 2.8
[1] 11.2
> class(4L * 2.8)
[1] "numeric"
> class(5L)
[1] "integer"
> class(2L)
[1] "integer"
> 5L / 2L
[1] 2.5
> class(5L / 2L)
[1] "numeric"
Character Data
Even though it is not explicitly mathematical, the character (string) data type is very common in
statistical analysis and must be handled with care.
R has two primary ways of handling character data: character and factor.
While they may seem similar on the surface, they are treated quite differently.
x contains the word “data” encapsulated in quotes, while y has the word “data” without quotes and
a second line of information about the levels of y.
Characters are case sensitive, so “Data” is different from “data” or “DATA”.
To find the length of a character (or numeric) use the nchar function.
> nchar(x)
[1] 4
> nchar("hello")
[1] 5
> nchar(3)
[1] 1
> nchar(452)
[1] 3
Dealing with dates and times can be difficult in any language, and to further complicate matters R has
numerous different types of dates.
Date stores just a date while POSIXct stores a date and time. Both objects are actually represented as the
number of days (Date) or seconds (POSIXct) since January 1, 1970.
> class(date1)
[1] "Date“
> as.numeric(date1)
[1] 15519
> date2 <- as.POSIXct("2012-06-28 17:42")
> date2
Easier manipulation of date and time objects can be accomplished using the lubridate and chron
packages.
Using functions such as as.numeric or as.Date does not merely change the formatting of an
object but actually changes the underlying type.
> class(date1)
[1] "Date "
> class(as.numeric(date1))
[1] "numeric"
Logical
Logicals are a way of representing data that can be either TRUE or FALSE.
Numerically, TRUE is the same as 1 and FALSE is the same as 0. So TRUE 5 equals 5 while
FALSE 5 equals 0.
> TRUE * 5
[1] 5
> FALSE * 5
[1] 0
Similar to other types, logicals have their own test, using the is.logical function.
> is.logical(k)
[1] TRUE
R provides T and F as shortcuts for TRUE and FALSE, respectively, but it is best practice not
to use them, as they are simply variables storing the values TRUE and FALSE and can be
overwritten, which can cause a great deal of frustration as seen in the following example.
> TRUE
[1] TRUE
>T
[1] TRUE
> class(T)
[1] "logical"
> T <- 7
>T
[1] 7
> class(T)
[1] "numeric"
Logicals can result from comparing two numbers, or characters.
Similarly, c("R", "Excel", "SAS", "Excel") is a vector of the character elements, “R”, “Excel”,
“SAS”, and “Excel”.
More than being simple containers, vectors in R are special in that R is a vectorized language.
That means operations are applied to each element of the vector automatically, without the need to
loop through the vector.
This is a powerful concept that may seem foreign to people coming from other languages, but it is
one of the greatest things about R.
Vectors do not have a dimension, meaning there is no such thing as a column vector or row vector.
These vectors are not like the mathematical vector, where there is a difference between row and
column orientation.
The “c” stands for combine because multiple elements are being combined into a vector.