Data Mining Project
Data Mining Project
Data Mining Project
ANALYSIS
Roll no: 2K17/CS/02
1
2
2
Contents
1. Introduction
2. Data Preprocessing
2.1 Data Cleaning
2.1.1 Initial Exploratory Analysis
2.1.2 Visualization of Data
2.1.2.1 Histograms
2.1.2.2 BoxPlots
2.1.3 Cleaning the Errors
2.2 Data Transformation
2.3 Data Reduction
3. Data Evaluation and Presentation
3.1 Libraries Used
3.2 Color Vectors
3.3 Analysis by Graphs
3.3.1 Hour wise
3.3.2 Hour-Month wise
3.3.3 Day wise
3.3.4 Day-Month wise
3.3.5 Month-Week wise
3.3.6 Day-Hour wise
4. Decision Making
4.1 Fare Manipulation
4.2 Availability of Cabs
4.3 Discount Offers
5. Bibliography
3
3
Introduction
For any kind of business in today’s scenario data collection and data analysis is
as important as providing the goods and services. Data collection has no
meaning until and unless it is processed and analyzed in a form which can
understood by different people to implement it to the real environment.
Raw data needs to be cleaned and represented in forms which are useful. Of
the different forms available, visual graphs are very beneficial to implement it.
This project deals with a cab service data set and is later analyzed and
processed into graphs using data visualization using R programming. This data
visualization implements the important libraries of R programming and concepts
of Data Science and Data Mining. The results are in the form of graphs which
can be used to apply the analyzed prospects into business to make constructive
decisions.
4
4
Data Preprocessing
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.
1. Data Cleaning
2. Data Transformation
3. Data Reduction
Data Cleaning
Data cleaning is the process of preparing data for analysis by removing or
modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly
formatted. This data is usually not necessary or helpful when it comes to
analyzing data because it may hinder the process or provide inaccurate results.
setwd("D:/CabridesAnalysis/CabridesDataset")
apr_data <- read.csv("cabrides-raw-data-apr.csv",header =
TRUE, stringsAsFactors = FALSE)
5
5
[1] "data.frame"
dim(sep_data)
[1] 1028136 4
sum(is.na(apr_data))
[1] 1
summary(apr_data)
View(apr_data)
apr_data$Base <- replace(apr_data$Base, apr_data$Base ==
"B02512", NA)
sum(is.na(apr_data))
[1] 35535
2. Visualization of Data
There are 2 types of plots that you should use during your cleaning process –
The Histogram and the BoxPlot
1. Histogram
hist(apr_data$Lat)
7
7
hist(apr_data$Lon)
2. BoxPlot
Boxplots are super useful because it shows you the median, along with the first,
second and third quartiles. BoxPlots are the best way of spotting outliers in your
data frame.
boxplot(apr_data$Lat)
8
8
boxplot(apr_data$Lat)
View(apr_data)
apr_data$Base <- replace(apr_data$Base, apr_data$Base ==
"B02512", NA)
sum(is.na(apr_data))
[1] 35535
3. Correction of Errors
Following the above two steps, we have come across some inconsistency or
null values which can significantly hamper the result of the analysis. So, we are
correcting all the errors which might affect us.
sum(is.na(apr_data))
[1] 35535
View(apr_data)
9
9
any(is.integer(apr_data))
[1] FALSE
sum(is.na(apr_data))
[1] 35535
[1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[29] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[57] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[85] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[113] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[141] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[169] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[197] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[225] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
10
10
[253] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[281] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[309] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[337] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[365] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[393] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[421] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[449] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[477] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[505] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[533] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[561] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[589] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[617] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[645] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[673] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[701] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
11
11
[729] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[757] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[785] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[813] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[841] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[869] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[897] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[925] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[953] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[981] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0"
[ reached getOption("max.print") -- omitted 563516 entries
]
apr_data=apr_data[1:100000,]
may_data=may_data[1:100000,]
jun_data=jun_data[1:100000,]
jul_data=jul_data[1:100000,]
aug_data=aug_data[1:100000,]
sep_data=sep_data[1:100000,]
12
12
data_cabrides=rbind(apr_data,may_data,jun_data,jul_data,aug
_data,sep_data)
str(data_cabrides)
View(data_cabrides)
class(data_cabrides)
[1] "data.frame"
dim(data_cabrides)
[1] 600000 4
summary(data_cabrides)
sum(is.na(data_cabrides))
[1] 0
summary(data_cabrides)
View(data_cabrides)
any(is.factor(data_cabrides))
[1] FALSE
sum(is.na(data_cabrides))
[1] 0
Now we have combined cab rides data of 6 months with 600000 observation of
4 variables.
Data Transformation
This step is taken in order to transform the data in appropriate forms suitable for
mining process.
data_cabrides$Date.Time <-
as.POSIXct(data_cabrides$Date.Time, format = "%m/%d/%Y %H:
%M:%S")
head(data_cabrides)
NA
Now as the data is in time format we can a new variable for time format in Hour-
Min-Sec only.
data_cabrides$Time <-
format(as.POSIXct(data_cabrides$Date.Time, format = "%m/%d/
%Y %H:%M:%S"), format="%H:%M:%S")
head(data_cabrides)
16
16
Data Reduction
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.
In the above table we have made day and month a different variable as a factor.
22
6 2014-04-01 40.738 - B0251 00:33:00 1 Apr 201 Tue
00:33:00 3 74.04 2 4
03
6 rows
Date.Tim Lat Lon Base Time day mont year dayofwe Hou
e <dbl> <dbl> <fctr> <chr> <fct h <fctr ek r
<S3: r> <ord > <ord> <fct
POSIXct> > r>
Similarily all the time value as a adifferent variable in a fator form. Now for
ploting we are creating a new data grouped by hour.
hour
<fct Total
r> <int>
0 13024
1 7838
2 5201
3 6028
4 6890
5 10676
6 rows
NA
NA
The data preprocessing shown above is carried out on the file apr_data and
carried out in similar manner on other files too.
21
21
The packages which are used to execute the evaluation are stated below-
library(lubridate)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(DT)
library(plyr)
library(tidyr)
library(stringr)
library(Scale)
1. lubridate
lubridate is an R package that makes it easier to work with dates and times.
lubridate’s parse functions handle a wide variety of formats and separators,
which simplifies the parsing process.
2. ggplot2
3. ggthemes
4. dplyr
5. DT
6. plyr
plyr is an R package that makes it simple to split data apart, do stuff to it, and
mash it back together. This is a common data-manipulation step. plyr makes it
easy to control the input and output data format from a syntactically consistent
set of functions.
7. tidyr
tidyr is a package that makes it easy to tidy your data. It is often used in
conjunction with dplyr.
8. stringr
9. scale
As the name suggests, with the help of graphical scales, we can automatically
map the data to the correct scales with well-placed axes and legends.
23
23
Finally coming to the evaluation we shall start with trips every hour in order to
infer the rush hours-
We can clearly see the rush hours are between 6.5hrs to 17.5hrs
Now, grouping data by month and hour to determine peak hour month wise.
head(month_hour)
View(month_hour)
By this, we can understand that out of all the days in every month rate of higher
rides are in between 16 to 17 or 17 to 18.
group_by(day) %>%
dplyr::summarize(Total = n())
head(day_group)
Day Total
<fctr> <int>
1 41518
2 46045
3 45218
4 40316
5 45353
6 45029
6 rows
This graph represents trips every day. We can see that first week has maximum
number of rides.
head(day_month_group)
ggplot(day_month_group,aes(day,Total,fill=month))
+geom_bar(stat = "identity")+ scale_fill_manual(values =
colors)
We now plot the graph to find out the day with highest number
of rides month wise.
head(data_month_week)
ggplot(data_cabrides,aes(month,fill=dayofweek))
+geom_bar(position = "dodge")+scale_fill_manual(values =
colors)
ggplot(data_month_week,aes(month,Total,fill=dayofweek))
+geom_bar(stat = "identity",position = "dodge")+
scale_fill_manual(values = colors)
30
30
After this we can now proceed to hourly ride per day by grouping data under
day and hour.
head(day_and_hour)
1 3 461
1 4 436
1 5 654
6 rows
The above graph states that in the first week the numbers of rides are more in
between 4pm to 8pm.
The graph below helps us to identify the day of month with more number of
rides, which in most of the cases are the first nine days of the month.
32
32
Decision Making
This model or project of analyzing the cab rides data can significantly help the
cab service providing company important insights to take business decisions.
These decisions could be the following-
1. Fare Manipulation
Since, we can easily infer the rush hours and not so busy hours, we can
manipulate the fare accordingly.
2. Availability of Cabs
Availability of cabs can be adjusted as per the results of number of rides
during different hours of the day and days of the month.
3. Discount Offers
To attract customers during low business days of the month, different
discount schemes can be introduced to urge customers to take rides.
33
33
Bibliography
1. www.geeksforgeeks.com
2. www.r-project.org
3. www.data.world
4. www.data.cityofnewyork.us
5. www.elitedatascience.com
6. Compiled material of Data Science & Data Mining