Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
92 views

R Programming Language Notes

This document provides an overview of the R programming language and its use for statistical analysis and data science. It compares R to Python and discusses some key features of R, including its use of data frames as the primary data type and its focus on statistical analysis, graphics, and data analysis. The document then provides examples of loading and manipulating data frames in R, including selecting rows and columns, sorting, aggregation, and joining data. It also demonstrates some plotting and string manipulation functions.

Uploaded by

Foster Karmon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

R Programming Language Notes

This document provides an overview of the R programming language and its use for statistical analysis and data science. It compares R to Python and discusses some key features of R, including its use of data frames as the primary data type and its focus on statistical analysis, graphics, and data analysis. The document then provides examples of loading and manipulating data frames in R, including selecting rows and columns, sorting, aggregation, and joining data. It also demonstrates some plotting and string manipulation functions.

Uploaded by

Foster Karmon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

R programming language

R programming language language and software environment for statistical computing,


graphics, and more recently, Data analysis!!

● Dynamically typed and interpreted like python


● Primary data type is “data frame.” Similar to relational table and pandas tables.

R Python
● Easier for experienced programers ● Good for beginners and
● Tends to be favored by academics, experienced
researchers, hard-core data ● Used by software engineers of all
scientists types
● Shorter code for complex analysis, ● Better integrated for general
statistics, graphics purpose coding.
● Extremely slow! ● Not especially fast

Open CSV files and load data into data frames


In [ ]:
C1 = open('Cities.csv').read()
C2 = open('Countries.csv').read()
In [ ]:
%%R -i C1 -i C2
cities <- read.csv(text=C1)
countries <- read.csv(text=C2)
Data frame introduction

1 (before comma) talking about rows 2 (after comma) talking


about columns
In [ ]:
%%R
cities[1,2]

Printing # or rows and # of columns


In [ ]:
%%R
print(nrow(cities))
print(ncol(cities))

#Prints first row, and all columns


In [ ]:
%%R
cities[1,]

#Prints rows 1 through 10


In [ ]:
%%R
cities[1:10,]

#This says loop through the first 10 rows, each time in the loop print the ith row of cities
The {} are equivalent to the indent in python for loop
In [ ]:
%%R
for (i in 1:10) { print(cities[i,]) }

#After comma is columns, so this prints the 2nd column.


In [ ]:
%%R
cities[,2]
# change to cities[,4]
#Prints 5th row, 4th column
In [ ]:
%%R
cities[5,4]
# change to cities[5:10,2:4]

#Will print first 10 rows


In [ ]:
%%R
head(cities,10)
# also show number of rows, tail()

#Will print last 10 rows


In [ ]:
%%R
tail(cities,10)
# also show number of rows, tail()

Basic data operations

#Select single column


In [ ]:
%%R
cities[,'city']
# change to cities['city']

#Select multiple columns (adding the ,c makes a list)


In [ ]:
%%R
cities[,c('city','temperature')]

#Select all rows in the dataframe where the longitude is less than 0
In [ ]:
%%R
cities[cities$longitude < 0,]
#Select rows and columns
In [ ]:
%%R
cities[cities$latitude > 50 & cities$temperature > 9,
c('city','latitude','temperature')]

#Sort the rows based on temperature (note its BEFORE comma cuz its rows we sort)
Decreasing = TRUE changes the order of sorting
In [ ]:
%%R
cities[order(cities$temperature),decreasing = TRUE]
# descending country with ascending temperature?
# can use - on string columns with as.numeric()

#Sorting by increasing country, then within each country, increasing temp (like grouping)
# ascending count
%%R
cities[order(cities$country,cities$temperature),]
NOTE: if we add a minus sign - before cities, it will do decreasing temp but country will
remain increasing

# If we want to do the same thing but sort decreasing country and increasing temp we
have to add as.numeric() around cities cuz minus sign expects numbers
# ascending count
%%R
cities[order(-as.numeric(cities$country),cities$temperature),]

# If we want to do a selection (which goes before comma) and want to sort (which goes
before) we can just put them together. We have to use “temporary” cities2 to pick out
temperatures. And then within cities2, it will order by decreasing temperature.
In [ ]:
%%R
cities2 <- cities[cities$longitude < 0 & cities$temperature > 12,
c('city','temperature')]
cities2[order(-cities2$temperature),]
Your Turn
Find all countries that are not in the EU and don't have coastline, together with their populations,
sorted by country name in reverse alphabetical order. Note: equality uses '==' and strings can be
single (') or double (") quoted.
In [6]:
%%R
countries2 <- countries[countries$EU =='no' & countries$coastline == 'no',
c('country','population')]
countries2[order(countries2$country, decreasing=TRUE),]

Aggregation

#Overall average temperature


In [ ]:
%%R
mean(cities$temperature)

#Average temperature of cities in each country


In [ ]:
%%R
aggregate(cities$temperature, by=list(cities$country), FUN=mean)

#Overall min and Max


In [ ]:
%%R
print(min(cities$temperature))
print(max(cities$temperature))

#Grouped aggregation by EU and Coastline


In [ ]:
%%R
aggregate(countries$population, by=list(countries$EU,countries$coastline),
FUN=mean)

EU Coastline Average
1 no no 4.35375
2 yes no 6.99000
3 no yes 19.59571
4 yes yes 21.37818
#Number of cities west of the Prime Meridian (i.e., longitude < 0) - error then fix
In [ ]:
%%R
cities2 <- cities[cities$longitude < 0,]
nrow(cities2)

Your Turn
Considering only cities with latitude < 40, find the average temperature for each country. Then
considering only cities with latitude > 60, find the average temperature for each country. Remember
print() is needed to see a result unless it's the last line.
In [12]:
%%R
south <- cities[cities$latitude < 40,]
north <- cities[cities$latitude > 60,]
print(aggregate(south$temperature, by=list(south$country), FUN=mean))
print(aggregate(north$temperature, by=list(north$country), FUN=mean))

Joining

#Cities not in the EU with latitude > 50; return city, country, latitude, and whether country has
coastline
In [ ]:
%%R
citiesext <- merge(cities,countries)
citiesext[citiesext$EU == 'no' & citiesext$latitude > 50,
c('city','country','latitude','coastline')]
Miscellaneous features

#String operations - countries with 'ia' in their name


In [ ]:
%%R
countries[grepl('ia',countries$country),]

#Add fahrenheit column


In [ ]:
%%R
cities['fahrenheit'] <- (cities$temperature * 9/5) + 32
head(cities, 10)

#Print using cat( )


In [ ]:
%%R
cat('Miniumum temperature:', min(cities$temperature), '\n')
cat('Maxiumum temperature:', max(cities$temperature), '\n')

Plotting
Scatterplots

#Temperature versus latitude


In [ ]:
%%R
plot(cities$latitude, cities$temperature)
# add xlab='latitude', ylab='temperature', col='blue', pch=16

#Latitude versus longitude colored by temperature


In [ ]:
%%R
for (i in 1:nrow(cities))
{ if (cities[i,'temperature'] < 7) cities[i,'category'] <- 'blue'
else if (cities[i,'temperature'] < 11) cities[i,'category'] <- 'yellow'
else cities[i,'category'] <- 'red'
}
plot(cities$longitude, cities$latitude, xlab='longitude', ylab='latitude',
col=cities$category, pch=16)
Bar charts

#Bar chart showing populations of countries with 'ia' in their name


In [ ]:
%%R
bars <- countries[grepl('ia',countries$country), 'country']
heights <- countries[grepl('ia',countries$country), 'population']
barplot(heights, names.arg=bars, xlab='country', ylab='population',
col='blue')
# add las=2 for vertical labels

Pie charts

#Pie chart showing number of EU countries versus non-EU countries


In [ ]:
%%R
slices <- c(nrow(countries[countries$EU == 'yes',]),
nrow(countries[countries$EU == 'no',]))
labels <- c('EU', 'not EU')
pie(slices, labels)
# add col=c('blue','red')

You might also like