Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

UNIT 2_Advanced Data Structures

This document provides an overview of advanced data structures in R, covering basic mathematical operations, variable assignment, and data types including numeric, character, Date/POSIXct, and logical types. It explains how to manipulate variables, including removal and checking their types, as well as the concept of vectors in R. The document emphasizes the flexibility of R as a programming language and its vectorized operations, which allow for efficient data manipulation.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

UNIT 2_Advanced Data Structures

This document provides an overview of advanced data structures in R, covering basic mathematical operations, variable assignment, and data types including numeric, character, Date/POSIXct, and logical types. It explains how to manipulate variables, including removal and checking their types, as well as the concept of vectors in R. The document emphasizes the flexibility of R as a programming language and its vectorized operations, which allow for efficient data manipulation.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT 2

ADVANCED DATA
STRUCTURES ON R
Ms Devibala Subramanian
Assistant Professor
PG & Research Department of Computer Science
Sri Ramakrishna College of Arts and Science
Coimbatore
Basics of R

R is a powerful tool for all manner of calculations, data manipulation and scientific computations.

Like most languages R has its share of mathematical capability, variables, functions and data
types.

Basic Math
Being a statistical programming language, R can certainly be used to do basic math.

In the console there is a right angle bracket (>) where code should be entered.

Simply test R by running


>1+1
[1] 2

If this returns 2, then everything is great; if not, then something is very, very wrong.
Complicated expressions:

>1+2+3
[1] 6

>3*7*2
[1] 42

>4/2
[1] 2

>4/3
[1] 1.333333

These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division,
Addition and Subtraction (PEMDAS).

This means operations inside parentheses take priority over other operations. Next on the priority
list is exponentiation. After that, multiplication and division are performed, followed by addition
and subtraction.
This is why the first two lines in the following code have the same result, while the third is
different.

>4*6+5
[1] 29

> (4 * 6) + 5
[1] 29

> 4 * (6 + 5)
[1] 44

So far there is white space in between each operator, such as * and /.

This is not necessary but is encouraged as good coding practice.


Variables

Variables are an integral part of any programming language and R offers a great deal of flexibility.

Unlike statically typed languages such as C++, R does not require variable types to be declared.

A variable can take on any available data type.

It can also hold any R object such as a function, the result of an analysis or a plot.

A single variable can at one point hold a number, then later hold a character and then later a
number again.

Variable Assignment

There are a number of ways to assign a value to a variable, and again, this does not depend
on the type of value being assigned.
The valid assignment operators are <- and =, with the first being preferred.

For example, let’s save 2 to the variable x and 5 to the variable y.

> x <- 2
>x
[1] 2

>y=5
>y
[1] 5

The arrow operator can also point in the other direction.

> 3 -> z
>z
[1] 3
The assignment operation can be used successively to assign a value to multiple variables
simultaneously.

> a <- b <- 7


>a
[1] 7

>b
[1] 7

A more laborious, though sometimes necessary, way to assign variables is to use the assign
function.

> assign("j", 4)
>j
[1] 4
Variable names can contain any combination of alphanumeric characters along with periods (:)
and underscores ( _ ).

However, they cannot start with a number or an underscore.

The most common form of assignment in the R community is the left arrow .

It make sense, as the variable is sort of pointing to its value.

There is also a particularly nice benefit for people coming from languages like SQL, where a
single equal sign (=) tests for equality.

It is generally considered best practice to use actual names, usually nouns, for variables instead of
single letters.

This provides more information to the person reading the code.


Removing Variables
For various reasons a variable may need to be removed. This is easily done using remove or its
shortcut rm.

>j
[1] 4
> rm(j)

> # now it is gone


>j

Error in eval(expr, envir, enclos): object 'j' not found

This frees up memory so that R can store more objects, although it does not necessarily free up
memory for the operating system.

To guarantee that, use gc, which performs garbage collection, releasing unused memory to the
operating system.

R automatically does garbage collection periodically, so this function is not essential.


Variable names are case sensitive

> theVariable <- 17


> theVariable
[1] 17

> THEVARIABLE

Error in eval(expr, envir, enclos): object 'THEVARIABLE' not found

Data Types

There are numerous data types in R that store various kinds of data.

The four main types of data most likely to be used are numeric, character (string), Date/POSIXct
(time-based) and logical (TRUE/FALSE).

The type of data contained in a variable is checked with the class function.

> class(x)
[1] "numeric"
Numeric Data

R excels at running numbers, so numeric data is the most common type in R.

The most commonly used numeric data is numeric.

This is similar to a float or double in other languages.

It handles integers and decimals, both positive and negative, and of course, zero.

A numeric value stored in a variable is automatically assumed to be numeric.

Testing whether a variable is numeric is done with the function is.numeric.

> is.numeric(x)
[1] TRUE
Another important, type is integer.

As the name implies this is for whole numbers only, no decimals.

To set an integer to a variable it is necessary to append the value with an L.

As with checking for a numeric, the is.integer function is used.

> i <- 5L
>i
[1] 5

> is.integer(i)
[1] TRUE

Do note that, even though i is an integer, it will also pass a numeric check.

> is.numeric(i)
[1] TRUE
R promotes integers to numeric when needed. This is obvious when multiplying an integer by a
numeric, but importantly it works when dividing an integer by another integer, resulting in a
decimal number.

> class(4L)
[1] "integer"
> class(2.8)
[1] "numeric"
> 4L * 2.8
[1] 11.2
> class(4L * 2.8)
[1] "numeric"
> class(5L)
[1] "integer"
> class(2L)
[1] "integer"
> 5L / 2L
[1] 2.5
> class(5L / 2L)
[1] "numeric"
Character Data

Even though it is not explicitly mathematical, the character (string) data type is very common in
statistical analysis and must be handled with care.

R has two primary ways of handling character data: character and factor.

While they may seem similar on the surface, they are treated quite differently.

> x <- "data"


>x
[1] "data "

> y <- factor("data")


>y
[1] data
Levels: data

x contains the word “data” encapsulated in quotes, while y has the word “data” without quotes and
a second line of information about the levels of y.
Characters are case sensitive, so “Data” is different from “data” or “DATA”.

To find the length of a character (or numeric) use the nchar function.

> nchar(x)
[1] 4
> nchar("hello")
[1] 5
> nchar(3)
[1] 1
> nchar(452)
[1] 3

This will not work for factor data.


> nchar(y)

Error in nchar(y): 'nchar()' requires a character vector


Dates

Dealing with dates and times can be difficult in any language, and to further complicate matters R has
numerous different types of dates.

The most useful are Date and POSIXct.

Date stores just a date while POSIXct stores a date and time. Both objects are actually represented as the
number of days (Date) or seconds (POSIXct) since January 1, 1970.

> date1 <- as.Date("2012-06-28")


> date1
[1] "2012-06-28“

> class(date1)
[1] "Date“

> as.numeric(date1)
[1] 15519
> date2 <- as.POSIXct("2012-06-28 17:42")
> date2

[1] "2012-06-28 17:42:00 EDT"


> class(date2)

[1] "POSIXct" "POSIXt"


> as.numeric(date2)
[1] 1340919720

Easier manipulation of date and time objects can be accomplished using the lubridate and chron
packages.

Using functions such as as.numeric or as.Date does not merely change the formatting of an
object but actually changes the underlying type.

> class(date1)
[1] "Date "

> class(as.numeric(date1))
[1] "numeric"
Logical

Logicals are a way of representing data that can be either TRUE or FALSE.

Numerically, TRUE is the same as 1 and FALSE is the same as 0. So TRUE 5 equals 5 while
FALSE 5 equals 0.

> TRUE * 5
[1] 5
> FALSE * 5
[1] 0

Similar to other types, logicals have their own test, using the is.logical function.

> k <- TRUE


> class(k)
[1] "logical“

> is.logical(k)
[1] TRUE
R provides T and F as shortcuts for TRUE and FALSE, respectively, but it is best practice not
to use them, as they are simply variables storing the values TRUE and FALSE and can be
overwritten, which can cause a great deal of frustration as seen in the following example.

> TRUE
[1] TRUE
>T
[1] TRUE

> class(T)
[1] "logical"
> T <- 7
>T
[1] 7

> class(T)
[1] "numeric"
Logicals can result from comparing two numbers, or characters.

> # does 2 equal 3?


> 2 == 3
[1] FALSE

> # does 2 not equal three?


> 2 != 3
[1] TRUE

> # is two less than three?


>2<3
[1] TRUE

> # is two less than or equal to three?


> 2 <= 3
[1] TRUE
> # is two greater than three?
>2>3
[1] FALSE

> # is two greater than or equal to three?


> 2 >= 3
[1] FALSE

> # is "data" equal to "stats"?


> "data" == "stats"
[1] FALSE

> # is "data" less than "stats"?


> "data" < "stats"
[1] TRUE
Vectors

A vector is a collection of elements, all of the same type.

For instance, c(1, 3, 2, 1, 5) is a vector consisting of the numbers 1; 3; 2; 1; 5, in that order.

Similarly, c("R", "Excel", "SAS", "Excel") is a vector of the character elements, “R”, “Excel”,
“SAS”, and “Excel”.

A vector cannot be of mixed type.

Vectors play a crucial, and helpful, role in R.

More than being simple containers, vectors in R are special in that R is a vectorized language.

That means operations are applied to each element of the vector automatically, without the need to
loop through the vector.

This is a powerful concept that may seem foreign to people coming from other languages, but it is
one of the greatest things about R.
Vectors do not have a dimension, meaning there is no such thing as a column vector or row vector.

These vectors are not like the mathematical vector, where there is a difference between row and
column orientation.

The most common way to create a vector is with c.

The “c” stands for combine because multiple elements are being combined into a vector.

> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)


>x
[1] 1 2 3 4 5 6 7 8 9 10

You might also like