My Learning From Data Science Classes
My Learning From Data Science Classes
********************
********************
Price<-c(100,200) << Create a object Price
Price << View the object created
length(Price) << Gives the total number of elements in
an object
Chracter object:
***************
Names<-c("John","Robert","NA","Catherine") << Create a character object
length(Names) << Gives the total number
elements in Names
Sequence:
*********
Sequence<-seq(1970,2000) << To assign Sequence object all the
numbers in between 1970 to 2000
Sequence_By<-seq(from=1,to=5,by=0.5) << Sequence of numbers from 1 to 5 with
interval of 0.5
Sequence_By << results will be :
1.0,1.5,2.0........4.5,5.0
Repeat:
******
rep(1,5) << Repeat 1 for 5 times, result will be 1,1,1,1,1
rep(1:5,2) << Repeat 1 to 5 for 2 times, result will be
1,2,3,4,5,1,2,3,4,5
rep(1:5,each=2) << Repeat 1 to 5 each for 2 times
#Both numric and character data
mixed<-c(1,2,3,"hi")
mixed << Result will be "1","2","3","Hi"
class(mixed) << This will give characterm as the whole object is converted
into characters
Vectors:
*******
>Most simplest structure in R
>If data has only one dimesion, like a set of digits, then vectors can be used to
represent it.
Matrices:
********
>Used when data is a higher dimensional array.
>But contains only data of a single class Eg:only character or numeric.
Data Frames:
***********
>it is like a single table with rows and columns of data
>Contains columns or lists of different data
Lists:
*****
>Used when data cannont be reperesented by data frames.
>it contains all kinds of other obects, icluding other lists or data frames.
#Saving an object
save(Names,file="Names.rda")
#Saving the entire workspace
save.image("all_work.RData")
#Saving the entire workspace
save.image("all_work.RData")
#How do i load my object back?
load("Names.rda")
#summary of a dataset
summary of a dataset gives us the summary of any dataset across each of its
elements
summary(iris)
# It is good to know about the following information while started working on any
dataset:
1. Presence of header line
2. Kind of value seperator.
3. Representation of missing values.
4. Notation of comment characters or quotes.
5. Existence of any unfilled or blank lines.
6. Classes of the variables.
2. read.table function:
Reads a file in table format and creates a data frame from it.
Data Exploration
****************
****************
Read a file and specify the type of NA in that file:
cr<-read.csv("Credit.csv",na.strings=c("",NA))
Percentile breakup:
******************
quantile(cr$RevolvingUtilizationOfUnsecuredLines,p=c(1:100)/100)
cr%>%filter(RevolvingUtilizationOfUnsecuredLines<=2)%>%nrow()
cr%>%filter(RevolvingUtilizationOfUnsecuredLines<=2)->cr
cr$MonthlyIncome<-ifelse(cr$MonthlyIncome==0,NA,cr$MonthlyIncome)
For Continous Variable we use ntile function to divide that continous variable into
deciles i.e to convert a continous variable into categorical variable
***********************************************************************************
**********************************************************************
cr%>%mutate(quantile=ntile(MonthlyIncome,10))%>%group_by(Good_Bad,quantile)%>
%summarize(N=n())%>%filter(Good_Bad=="Bad")->dat
One below function can also be used to get sample data/subset of the dataset:
***************************************************************************
library(caret)
indexPC<-createDataPartition(y=cr$Good_Bad,times = 1,p=0.70,list=F)
train_crC<-cr[indexPC,]
test_crC<-cr[-indexPC,]
Cumilative probability True in the BINOM.DIST formula gives us the result for the
success <= x(i.e sum of the probabilities from success = 0 till success =X)
Cumilative probability False in the BINOM.DIST formula gives us the result for the
success = x only.
Hypergeometric Districution:
***************************
If in some case if the some selection has been made such the selection has not
replaced back to the population then our poppulation/probability will get changed
and now Binomial distribution is no more used. In this case we uses Hypergeometric
Distribution.
>Xcel Formula for Hypergeometric Distribution is as below:
hypgeom.dist()
Negative Binomial:
*****************
Used to find out the number of trials needed to get X successes.
What is the probability that the 30th purchase in my store will happen with the
100th customer, when the probability of purchase for any customer is 20%?
Geometric Distribution:
**********************
Used when we are interested in the probability of the first success in the rth
trial.
Same NEGBINOM.DIST formula is used for Geometric Distribution also but with
Cumulative = TRUE
Data Manipulation:
*****************
To Get the data used in the first row and third column we can use:
******************************************************************
oj[1,3]
oj[c(1,2,8,456),c(1,3,6)] << to check rows 1,2,8,456 corresponding to
the columns 1,3,6
#Selecting only those rows where brand bought is tropicana:
dat<-oj[oj$brand=='tropicana',]
We can perfrom the OR/AND operations also while selecting rows and columns:
dat1<oj[oj$brand=='tropicana'|oj$brand=='dominicks',]
head(dat1)
Difference between Logical verctors Vs. which statement
#consider vector sales with missing values
sales<-c(100,200,NA,300,400,NA,500,600,700,NA,1000,1500,NA,NA)
#subset data using logical operator
sales[sales>600]
[1] NA NA 700 NA 1000 1500 NA NA <<< as you can see NA is also included in the
results of above logical querry.
#subset data using which
>sales[which(sales>600)]
[1] 700 1000 1500
#Selecting Columns:
dat4<-oj[,c("week","brand")]
head(dat4)
#Adding new columns:
*******************
oj$logInc<-log(oj$INCOME) << this new column will have the value of Log of income
Ordering of numbers:
*******************
numbers<-c(10,100,5,8)
order(numbers) >>> retruns the indices of the numbers ordered in acccending
order
order(-numbers) >>> returns the indices of the numbers ordered in the
decending order.
GroupWise operations:
********************
aggregate(oj$price,by=list(oj$brand),mean) <<< on the basis of Price group the
data by Brand using mean operation
#Cross tabulations
#Units of different brands sold based on if feature advertisement was run or not
table(oj$brand,oj$feat)
xtabs can also be used for the same operation:
xtabs(oj$INCOME~oj$brand+oj$feat) <<< mean of the incomes based upon the various
brands and whether feat advertisement was done or not
dplyr
*****
1. Works only with Data frames.
#Selecting columns
Suppose we have to select Columns brand, Income and feat, we can do that using
following command:
dat10<-select(oj,brand,INCOME,feat)
#we can drop the columns using the -sign before the columns name:
dat1<-select(oj,-brand,-INCOME,-feat)
#Arranging data
dat13<-arrange(oj,INCOME) << Arrange the OJ dataset based on accending order of
Income
#Summarizing data
#group wise summaries
*********************
gr_brand<group_by(oj,brand)
summarize(gr_brand,mean(INCOME),sd(INCOME))
#Find the mean price for all the people whose income is >=10.5.
#Base R code
mean(oj[oj$income>=10.5,"price"])
#dplyr code
summarize(filter(oj,INCOME>=10.5),mean(price))
Pipe operator:
*************
oj%>%filter(INCOME>=10.5)%>%summarize(mean(price))
Subset the data based on price>=2.5, create a column logIncome, compute the
mean,standard deviation and median of column logIncome
oj%>%filter(price>=2.5)%>%mutate(logIncome=log(INCOME))%>
%summarize(mean(logIncome),sd(logIncome),median(logIncome))
Code | Value
%d | Day of month(decimal number)
%m | Month(decimal number)
%b | Month(abbreviated)
%B | Month(full name)
%y | Year(2 digits)
%Y | Year(4 digits)
25/Aug/04: "%d/%b/%y"
25-August-2004: %d-%B-%Y
month function will get you the month of the date mentioned in the date format:
months(fd$FlightDate) >>>>>>>> will give you the month of the date in date
unique(months(fd$FlightDate)) >>> to get the unique months present in the
particular date column
#difftime function we can use to get the time interval based upon weeks, days and
hours
difftime(fd$FlightDate[3000])
difftime(fd$FlightDate[3000],fd$FlightDate[90],units="weeks")
difftime(fd$FlightDate[3000],fd$FlightDate[90],units="days")
difftime(fd$FlightDate[3000],fd$FlightDate[90],units="hours")
#Whenever data has time information along with date, R uses POSIXct and POSIXit
classes to deal with dates
date1<-Sys.time()
date1
[1] "2015-03-02 17:35:47 IST"
class(date1)
[1] "POSIXct" "POSIXt"
for using weekdays() and month() functions that date/paramter passed need to be in
POSIXCT POSIXt format
weekdays(date1)
[1] "Monday"
month(date1)
[1] "March"
Function Date
dmy() 26/11/2008
ymd() 2008/11/26
mdy() 11/26/2008
dmy_hm() 26/11/2008 20:15
dmy_hms() 26/11/2008 20:15:30
Joining Dataframes:
******************
Inner join: Joining two tables based on a key column,such that rows matching in
both tables are selected.
#Full outer join : Two tables are joined irrespective of any match between the
rows:
#Left Outer Join: All the rows of left table are retained while matching rows of
right table are displayed.
Customer ID Product CustomerID State
1 1 Toaster 1 2 Alabama
2 2 Toaster 2 4 Alabama
3 3 Toaster 3 6 Ohio
4 4 Radio
5 5 Radio
6 6 Radio
Right Outer Join : All the rows of right table are retained while matching rows of
left table are displayed
>merge(x = df1, y = df2, by = "CustomerID",all.y=TRUE) # Right Join
RESHAPE function:
****************
It helps in converting data from Wide to Long format and Long to wide format.
We can convert the data from one format to another using the functions : Melt and
Cast present in the library reshape2
library(reshape2)
person<-c("Sankar","Aiyar","Singh")
age<-c(26,24,25)
weight<-c(70,60,65)
wide<-data.frame(person,age,weight)
the result of the above command will be :
>wide
person age weight
1 Sankar 26 70
2 Aiyar 24 60
3 Singh 25 65
melted<-melt(wide,id.vars="person",value.name="Demo_value")
Person Variable Demo_value
1 Sankar age 26
2 Aiyar age 34
3 Singh age 25
4 Sankar weight 70
5 Aiyar weight 60
6 Singh weight 65
>dcast(melted,person~variable,value.var = "Demo_Value")
Person age weight
1 Aiyar 24 60
2 Sankar 26 70
3 Singh 25 65
c<-"Bat/Man"
strsplit(c,split="/") >>> string split by specifying "/" as the spliting
variable
#sometimes we want to know where some patterns occurs in a string, so we uses grep
command:
c(b,c)
grep("-",c(b,c)) >>> this will tell you the count "-" in the string
So times we want to find whether some pattern exist in the string or not?
>c(b,c)
"Bat-Man" "Bat/Man"
>grepl("/",c(c,b))
FALSE TRUE
#order by statement
oj_s<-sqldf("Select store,brand,week,logmove,feat,price,income from oj order by
income asc")
#Base Plotting
ir<-iris
#To understand the relationship between the two continous variables we uses
vibriant plot
plot(x=ir$Petal.Width,y=ir$Petal.Length) >> whatever we want to put in x axis we
assign it to x
whatever we want to put in y axis we
assign it to y
#Adding xlabels, ylables and title
#Addiing Colors
plot(x=ir$Petal.Width,y=ir$Petal.Length,main=c("Petal Width Vs Petal
Length"),xlab=c("Petal Width"),ylab=c("Petal Length"),col="red")
#Adding a legend
plot(x=ir$Petal.Width,y=ir$Petal.Length,main=c("Petal Width Vs Petal
Length"),xlab=c("Petal Width"),ylab=c("Petal Length"),pch=as.numeric(ir$Species))
legend(0.2,7,c("SEtosa","Versicolor","Verginica"),pch=1:3)
#Box Plot:
boxplot(ir#Petal.Length)
Histograms:
**********
hist(ir$Sepal.Width,col="orange")
Label = True, parameter is added to get the count accross various bins.
GGPLOT2:
*******
Based on grammar of grapphics: Simple syntax, interaces with ggmap and other
packages.
Grammar of Graphics:
*******************
A plot composed of : Aesthetic Mapping, Geoms, Statistical Transformation,
Coordinate Systems and Scales.
Aesthetic Mapping : What component of data appears on X axis, Y axis, how is the
color, size, fill and position of elements is related with the data.
p<-ggplot(ch,aes(x=temp,y=dewpoint,colour=season))
Many times the data and loactional infomation is not the same file.
Most geospatial data is stored in shape files
Shapefile = Data + Location data
How to extract log lat data from the Spatial points data frame:
***************************************************************
shape2<-readOGR(dsn="Subway","DoITT_SUBWAY_ENTRACE_01_13SEPT2010") >>> Subay is
the folder name and "DOIT ... is the file name"
data%>%filter(Industry=="Luxury")->data2
p<-
ggplot(data2,aes(x=Company.Advertising,y=Brand.Revenue,size=Brand.Value,color=Brand
))
q<-p+geom_point()
q+xlab("Company Advertising in Billions of $")+ylab("Brand Revenue in Billions of
$")+scale_size(range = c(2,4),breaks=c(10,28.1),name="Brand Value $ (Billions)")
+geom_text(aes(label=Brand),hjust=0.7,vjust=1.7)+guides(color=FALSE)+theme_light()
+theme(legend.key=element_rect(fill = "light blue", color = "black"))
+scale_x_continuous(breaks=seq(0,6,0.1))