Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Mining Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 33

CAB RIDES

College of Vocational Studies, University of Delhi


Prerna
Using R
programming
B.Sc (Hons) Computer Science
Kshatriya
Sem-VI, Year-III
Data Science and Data Mining

ANALYSIS
Roll no: 2K17/CS/02

1
2
2

Contents
1. Introduction
2. Data Preprocessing
2.1 Data Cleaning
2.1.1 Initial Exploratory Analysis
2.1.2 Visualization of Data
2.1.2.1 Histograms
2.1.2.2 BoxPlots
2.1.3 Cleaning the Errors
2.2 Data Transformation
2.3 Data Reduction
3. Data Evaluation and Presentation
3.1 Libraries Used
3.2 Color Vectors
3.3 Analysis by Graphs
3.3.1 Hour wise
3.3.2 Hour-Month wise
3.3.3 Day wise
3.3.4 Day-Month wise
3.3.5 Month-Week wise
3.3.6 Day-Hour wise
4. Decision Making
4.1 Fare Manipulation
4.2 Availability of Cabs
4.3 Discount Offers
5. Bibliography
3
3

Introduction
For any kind of business in today’s scenario data collection and data analysis is
as important as providing the goods and services. Data collection has no
meaning until and unless it is processed and analyzed in a form which can
understood by different people to implement it to the real environment.

Raw data needs to be cleaned and represented in forms which are useful. Of
the different forms available, visual graphs are very beneficial to implement it.
This project deals with a cab service data set and is later analyzed and
processed into graphs using data visualization using R programming. This data
visualization implements the important libraries of R programming and concepts
of Data Science and Data Mining. The results are in the form of graphs which
can be used to apply the analyzed prospects into business to make constructive
decisions.
4
4

Data Preprocessing
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.

Steps to perform data preprocessing:

1. Data Cleaning
2. Data Transformation
3. Data Reduction

Data Cleaning
Data cleaning is the process of preparing data for analysis by removing or
modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly
formatted. This data is usually not necessary or helpful when it comes to
analyzing data because it may hinder the process or provide inaccurate results. 

Steps for Data Cleaning taken:

1. Initial Exploratory Analysis


2. Visualization of Data
3. Cleaning the Errors

The following steps are explained in detail below-

1. Initial Exploratory Analysis

The foremost step to perform data cleaning process is an initial exploration of


the data frame imported into R.

setwd("D:/CabridesAnalysis/CabridesDataset")
apr_data <- read.csv("cabrides-raw-data-apr.csv",header =
TRUE, stringsAsFactors = FALSE)
5
5

may_data <- read.csv("cabrides-raw-data-may.csv"",header =


TRUE, stringsAsFactors = FALSE)
jun_data <- read.csv("cabrides-raw-data-jun.csv"",header =
TRUE, stringsAsFactors = FALSE)
jul_data <- read.csv("cabrides-raw-data-jul.csv"",header =
TRUE, stringsAsFactors = FALSE)
aug_data <- read.csv("cabrides-raw-data-aug.csv"",header =
TRUE, stringsAsFactors = FALSE)
sep_data <- read.csv("cabrides-raw-data-sep.csv"",header =
TRUE, stringsAsFactors = FALSE)
View(sep_data)
class(sep_data)

[1] "data.frame"

dim(sep_data)
[1] 1028136 4

sum(is.na(apr_data))

[1] 1

summary(apr_data)

Date.Time Lat Lon Base


Length:564516 Min. :40.07 Min. :-74.77
Length:564516
6
6

Class :character 1st Qu.:40.72 1st Qu.:-74.00


Class :character
Mode :character Median :40.74 Median :-73.98
Mode :character
Mean :40.74 Mean :-73.98
3rd Qu.:40.76 3rd Qu.:-73.97
Max. :42.12 Max. :-72.07

View(apr_data)
apr_data$Base <- replace(apr_data$Base, apr_data$Base ==
"B02512", NA)
sum(is.na(apr_data))

[1] 35535

2. Visualization of Data

There are 2 types of plots that you should use during your cleaning process –
The Histogram and the BoxPlot

1. Histogram

The histogram is very useful in visualizing the overall distribution of a numeric


column. We can determine if the distribution of data is normal or bi-modal or
unimodal or any other kind of distribution of interest. We can also use
Histograms to figure out if there are outliers in the particular numerical column
under study.

hist(apr_data$Lat)
7
7

hist(apr_data$Lon)

2. BoxPlot

Boxplots are super useful because it shows you the median, along with the first,
second and third quartiles. BoxPlots are the best way of spotting outliers in your
data frame.

boxplot(apr_data$Lat)
8
8

boxplot(apr_data$Lat)

View(apr_data)
apr_data$Base <- replace(apr_data$Base, apr_data$Base ==
"B02512", NA)
sum(is.na(apr_data))

[1] 35535

3. Correction of Errors

Following the above two steps, we have come across some inconsistency or
null values which can significantly hamper the result of the analysis. So, we are
correcting all the errors which might affect us.

sum(is.na(apr_data))

[1] 35535

View(apr_data)
9
9

any(is.integer(apr_data))

[1] FALSE

sum(is.na(apr_data))

[1] 35535

apr_data[is.na(apr_data)] <- (0)


na.omit(apr_data$Base)

[1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[29] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[57] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[85] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[113] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[141] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[169] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[197] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[225] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
10
10

[253] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[281] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[309] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[337] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[365] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[393] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[421] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[449] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[477] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[505] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[533] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[561] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[589] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[617] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[645] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[673] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[701] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
11
11

[729] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[757] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[785] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[813] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[841] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[869] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[897] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[925] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[953] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[981] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0"
[ reached getOption("max.print") -- omitted 563516 entries
]

names(apr_data)[names(apr_data) == "Lon"] <- "Longitude"

apr_data=apr_data[1:100000,]
may_data=may_data[1:100000,]
jun_data=jun_data[1:100000,]
jul_data=jul_data[1:100000,]
aug_data=aug_data[1:100000,]
sep_data=sep_data[1:100000,]
12
12

data_cabrides=rbind(apr_data,may_data,jun_data,jul_data,aug
_data,sep_data)
str(data_cabrides)

'data.frame': 600000 obs. of 4 variables:


$ Date.Time: Factor w/ 260093 levels "4/1/2014 0:00",..:
11 17 21 28 33 33 38 44 54 58 ...
$ Lat : num 40.8 40.7 40.7 40.8 40.8 ...
$ Lon : num -74 -74 -74 -74 -74 ...
$ Base : Factor w/ 6 levels "B02512","B02598",..: 1 1
1 1 1 1 1 1 1 1 ...

View(data_cabrides)
class(data_cabrides)

[1] "data.frame"

dim(data_cabrides)

[1] 600000 4

summary(data_cabrides)

Date.Time Lat Lon


Base
13
13

4/7/2014 20:21 : 43 Min. :39.66 Min. :-74.77


B02512:205671
4/7/2014 20:22 : 35 1st Qu.:40.72 1st Qu.:-74.00
B02598:394329
7/2/2014 18:26:00: 33 Median :40.74 Median :-73.98
B02617: 0
7/3/2014 18:11:00: 32 Mean :40.74 Mean :-73.97
B02682: 0
5/3/2014 18:09:00: 31 3rd Qu.:40.76 3rd Qu.:-73.97
B02764: 0
7/2/2014 18:40:00: 31 Max. :41.37 Max. :-72.30
na : 0
(Other) :599795

data_cabrides$Latitude <- data_cabrides$Lat


data_cabrides <- data_cabrides[, -1]

sum(is.na(data_cabrides))

[1] 0

summary(data_cabrides)

Lat Lon Base


Latitude
Min. :39.66 Min. :-74.77 B02512:205671 Min. :
39.66
1st Qu.:40.72 1st Qu.:-74.00 B02598:394329 1st
Qu.:40.72
14
14

Median :40.74 Median :-73.98 B02617: 0 Median :


40.74
Mean :40.74 Mean :-73.97 B02682: 0 Mean :
40.74
3rd Qu.:40.76 3rd Qu.:-73.97 B02764: 0 3rd
Qu.:40.76
Max. :41.37 Max. :-72.30 na : 0 Max. :
41.37

View(data_cabrides)
any(is.factor(data_cabrides))

[1] FALSE

sum(is.na(data_cabrides))

[1] 0

Now we have combined cab rides data of 6 months with 600000 observation of
4 variables.

Data Transformation
This step is taken in order to transform the data in appropriate forms suitable for
mining process. 

As dvariable data.time is in factor format we have to convert it


in time format.
15
15

data_cabrides$Date.Time <-
as.POSIXct(data_cabrides$Date.Time, format = "%m/%d/%Y %H:
%M:%S")
head(data_cabrides)

 Date.Time Lat Lon Base


 <S3: <dbl> <dbl> <fctr>
POSIXct>
1 2014-04-01 40.7690 -73.9549 B02512
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512
00:21:00
4 2014-04-01 40.7588 -73.9776 B02512
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512
00:33:00
6 rows

NA

Now as the data is in time format we can a new variable for time format in Hour-
Min-Sec only.

data_cabrides$Time <-
format(as.POSIXct(data_cabrides$Date.Time, format = "%m/%d/
%Y %H:%M:%S"), format="%H:%M:%S")
head(data_cabrides)
16
16

  Date.Time Lat Lon Base


  <S3: POSIXct> <dbl> <dbl> <fctr>
1 2014-04-01 40.7690 -73.9549 B02512
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512
00:21:00
4 2014-04-01 40.7588 -73.9776 B02512
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512
00:33:00
6 rows

Now we have made a seperate coloumn for time for analysis.


For ggplot we need to convert each time format individually as a factor.

data_cabrides$Date.Time <- ymd_hms(data_cabrides$Date.Time)


head(data_cabrides)

Date.Time Lat Lon Base


  <S3: POSIXct> <dbl> <dbl> <fctr>
1 2014-04-01 40.7690 -73.9549 B02512
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512
00:21:00
17
17

Date.Time Lat Lon Base


  <S3: POSIXct> <dbl> <dbl> <fctr>
4 2014-04-01 40.7588 -73.9776 B02512
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512
00:33:00
6 rows

Data Reduction
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.

Now we can pullout individual time and convert it into factor.

data_cabrides$day <- factor(day(data_cabrides$Date.Time))


data_cabrides$month <-
factor(month(data_cabrides$Date.Time, label = TRUE))
head(data_cabrides)

  Date.Time Lat Lon Base Time day


  <S3: POSIXct> <dbl> <dbl> <fctr> <chr> <fctr>
1 2014-04-01 40.7690 -73.9549 B02512 00:11:00 1
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512 00:17:00 1
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512 00:21:00 1
18
18

  Date.Time Lat Lon Base Time day


  <S3: POSIXct> <dbl> <dbl> <fctr> <chr> <fctr>
00:21:00
4 2014-04-01 40.7588 -73.9776 B02512 00:28:00 1
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512 00:33:00 1
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512 00:33:00 1
00:33:00
6 rows

In the above table we have made day and month a different variable as a factor.

data_cabrides$year <- factor(year(data_cabrides$Date.Time))


data_cabrides$dayofweek <-
factor(wday(data_cabrides$Date.Time, label = TRUE))
head(data_cabrides)

  Date.Time Lat Lon Base Time day mont year dayofwe


  <S3: <dbl> <dbl> <fctr> <chr> <fctr h <fct ek
POSIXct> > <ord r> <ord>
>
1 2014-04-01 40.769 - B0251 00:11:00 1 Apr 201 Tue
00:11:00 0 73.95 2 4
49
2 2014-04-01 40.726 - B0251 00:17:00 1 Apr 201 Tue
00:17:00 7 74.03 2 4
45
3 2014-04-01 40.731 - B0251 00:21:00 1 Apr 201 Tue
00:21:00 6 73.98 2 4
73
4 2014-04-01 40.758 - B0251 00:28:00 1 Apr 201 Tue
00:28:00 8 73.97 2 4
76
5 2014-04-01 40.759 - B0251 00:33:00 1 Apr 201 Tue
00:33:00 4 73.97 2 4
19
19

22
6 2014-04-01 40.738 - B0251 00:33:00 1 Apr 201 Tue
00:33:00 3 74.04 2 4
03
6 rows

data_cabrides$hour <- factor(hour(hms(data_cabrides$Time)))


data_cabrides$minute <-
factor(minute(hms(data_cabrides$Time)))
data_cabrides$second <-
factor(second(hms(data_cabrides$Time)))
head(data_cabrides)

  Date.Tim Lat Lon Base Time day mont year dayofwe Hou
  e <dbl> <dbl> <fctr> <chr> <fct h <fctr ek r
<S3: r> <ord > <ord> <fct
POSIXct> > r>

1 2014-04- 40.76 - B0251 00:11:0 1 Apr 2014 Tue 0


01 90 73.95 2 0
00:11:00 49
2 2014-04- 40.72 - B0251 00:17:0 1 Apr 2014 Tue 0
01 67 74.03 2 0
00:17:00 45
3 2014-04- 40.73 - B0251 00:21:0 1 Apr 2014 Tue 0
01 16 73.98 2 0
00:21:00 73
4 2014-04- 40.75 - B0251 00:28:0 1 Apr 2014 Tue 0
01 88 73.97 2 0
00:28:00 76
5 2014-04- 40.75 - B0251 00:33:0 1 Apr 2014 Tue 0
01 94 73.97 2 0
00:33:00 22
6 2014-04- 40.73 - B0251 00:33:0 1 Apr 2014 Tue 0
01 83 74.04 2 0
00:33:00 03
6 rows | 1-10 of 12 columns
20
20

Similarily all the time value as a adifferent variable in a fator form. Now for
ploting we are creating a new data grouped by hour.

hour_data <- data_cabrides %>%


group_by(hour) %>%
dplyr::summarize(Total = n())
head(hour_data)

hour
<fct Total
r> <int>
0 13024
1 7838
2 5201
3 6028
4 6890
5 10676
6 rows

NA
NA

The data preprocessing shown above is carried out on the file apr_data and
carried out in similar manner on other files too.
21
21

Data Evaluation and Presentation


Before moving to the data evaluation and presentation, the important tools and
functions of R programming which are used to get the results must be studied.

The packages which are used to execute the evaluation are stated below-

library(lubridate)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(DT)
library(plyr)
library(tidyr)
library(stringr)
library(Scale)

The detailed elaboration of the packages used is given below-

1. lubridate

lubridate is an R package that makes it easier to work with dates and times.
lubridate’s parse functions handle a wide variety of formats and separators,
which simplifies the parsing process.

2. ggplot2

The ggplot2 package offers a powerful graphics language for creating elegant


and complex plots. ggplot2 allows you to create graphs that represent both
univariate and multivariate numerical and categorical data in a straightforward
manner. Grouping can be represented by color, symbol, size, and transparency.
The creation of trellis plots (i.e., conditioning) is relatively simple.
22
22

3. ggthemes

This is more of an extension to the main ggplot2 package.

4. dplyr

dplyr is a package which provides a set of tools for efficiently manipulating


datasets in R.  

5. DT

DT is an interface to the JavaScript library DataTables based on the


htmlwidgets framework, to present rectangular R data objects (such as data
frames and matrices) as HTML tables.

6. plyr

plyr is an R package that makes it simple to split data apart, do stuff to it, and
mash it back together. This is a common data-manipulation step. plyr makes it
easy to control the input and output data format from a syntactically consistent
set of functions.

7. tidyr

tidyr is a package that makes it easy to tidy your data. It is often used in
conjunction with dplyr.

8. stringr

The stringr package provide a cohesive set of functions designed to make


working with strings as easy as possible.

9. scale

As the name suggests, with the help of graphical scales, we can automatically
map the data to the correct scales with well-placed axes and legends.
23
23

To implement the visualization of data in the forms of graphs, we need


colors to denote different factors. So, we are required to make vector of
colors:

colors = c("#CC3399", "#660099", "#FF00CC", "#660099",


"#009900", "#000099", "#FFFF33")

[1] "#CC3399" "#660099" "#FF00CC" "#660099" "#009900"


"#000099" "#FFFF33"

Finally coming to the evaluation we shall start with trips every hour in order to
infer the rush hours-

ggplot(hour_data, aes(hour, Total)) +


geom_bar( stat = "identity", fill = "pink", color =
"black") +
theme(legend.position = "none")
24
24

We can clearly see the rush hours are between 6.5hrs to 17.5hrs 

Now, grouping data by month and hour to determine peak hour month wise.

month_hour <- data_cabrides %>%


group_by(month, hour)%>%
dplyr::summarize(Total = n())

head(month_hour)

month hour Total


<ord> <fctr <int>
>
May 0 2014
May 1 1164
May 2 739
May 3 910
May 4 1052
May 5 1716
6 rows
25
25

View(month_hour)

ggplot(month_hour, aes(hour, Total, fill = month)) +


geom_bar( stat = "identity") +
scale_fill_manual(values = colors)

By this we are now going to plot a graph determine peak hour time


month wise.
 

By this, we can understand that out of all the days in every month rate of higher
rides are in between 16 to 17 or 17 to 18.

Now we move on to making a group data on the basis of day.

day_group <- data_cabrides %>%


26
26

group_by(day) %>%
dplyr::summarize(Total = n())
head(day_group)

Day Total
<fctr> <int>
1 41518
2 46045
3 45218
4 40316
5 45353
6 45029
6 rows

ggplot(day_group, aes(day, Total)) +


geom_bar( stat = "identity", fill =
"pink",col="green") +
theme(legend.position = "none")
27
27

This graph represents trips every day. We can see that first week has maximum
number of rides.

day_month_group <- data_cabrides %>%


group_by(month, day) %>%
dplyr:: summarize(Total = n())

head(day_month_group)

month day Total


<ord> <fctr> <int>
May 1 10182
May 2 10861
May 3 10233
May 4 6268
May 5 8208
May 6 8869
6 rows
28
28

ggplot(day_month_group,aes(day,Total,fill=month))
+geom_bar(stat = "identity")+ scale_fill_manual(values =
colors)

We now plot the graph to find out the day with highest number
of rides month wise.

After this we can proceed to weekly analysis of rides month wise.

data_month_week=data_cabrides %>% group_by(month,dayofweek)


%>%
dplyr:: summarise(Total=n())

head(data_month_week)

month dayofweek Total


29
29

<ord> <ord> <int>


May Sun 8570
May Mon 10791
May Tue 12247
May Wed 13739
May Thu 23396
May Fri 16807
6 rows

ggplot(data_cabrides,aes(month,fill=dayofweek))
+geom_bar(position = "dodge")+scale_fill_manual(values =
colors)

ggplot(data_month_week,aes(month,Total,fill=dayofweek))
+geom_bar(stat = "identity",position = "dodge")+
scale_fill_manual(values = colors)
30
30

The graph below helps us to identify the day of week with


maximum rides month wise.

After this we can now proceed to hourly ride per day by grouping data under
day and hour.

day_and_hour <- data_cabrides %>%


group_by(day, hour) %>%
dplyr::summarize(Total = n())

head(day_and_hour)

day hour Total


<fctr> <fctr> <int>
1 0 1069
1 1 668
1 2 417
31
31

1 3 461
1 4 436
1 5 654
6 rows

ggplot(day_and_hour, aes(day, hour, fill = Total)) +


geom_tile(color = "black") +
scale_color_manual(values = colors,aesthetics = colors )

The above graph states that in the first week the numbers of rides are more in
between 4pm to 8pm.

ggplot(day_month_group, aes(day, month, fill = Total)) +


geom_tile(color = "darkgrey")

The graph below helps us to identify the day of month with more number of
rides, which in most of the cases are the first nine days of the month.
32
32

Decision Making
This model or project of analyzing the cab rides data can significantly help the
cab service providing company important insights to take business decisions.
These decisions could be the following-

1. Fare Manipulation
Since, we can easily infer the rush hours and not so busy hours, we can
manipulate the fare accordingly.

2. Availability of Cabs
Availability of cabs can be adjusted as per the results of number of rides
during different hours of the day and days of the month.

3. Discount Offers
To attract customers during low business days of the month, different
discount schemes can be introduced to urge customers to take rides.
33
33

Bibliography
1. www.geeksforgeeks.com
2. www.r-project.org
3. www.data.world
4. www.data.cityofnewyork.us
5. www.elitedatascience.com
6. Compiled material of Data Science & Data Mining

You might also like