0% found this document useful (0 votes)

112 views

Data Mining Project

This document discusses analyzing and visualizing data from New York City cab rides using R programming. It summarizes preprocessing steps including data cleaning, transformation, and reduction. The data cleaning process involves exploratory analysis, visualization with histograms and boxplots, and correcting errors. Graphs are then generated from the preprocessed data to allow analysis and informed decision making about cab fares, availability, and discounts.

Uploaded by

Ravi Ranjan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views

Data Mining Project

Uploaded by

Ravi Ranjan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

CAB RIDES

College of Vocational Studies, University of Delhi

Prerna
Using R
programming
B.Sc (Hons) Computer Science
Kshatriya
Sem-VI, Year-III
Data Science and Data Mining

ANALYSIS
Roll no: 2K17/CS/02

1
2
2

Contents
1. Introduction
2. Data Preprocessing
2.1 Data Cleaning
2.1.1 Initial Exploratory Analysis
2.1.2 Visualization of Data
2.1.2.1 Histograms
2.1.2.2 BoxPlots
2.1.3 Cleaning the Errors
2.2 Data Transformation
2.3 Data Reduction
3. Data Evaluation and Presentation
3.1 Libraries Used
3.2 Color Vectors
3.3 Analysis by Graphs
3.3.1 Hour wise
3.3.2 Hour-Month wise
3.3.3 Day wise
3.3.4 Day-Month wise
3.3.5 Month-Week wise
3.3.6 Day-Hour wise
4. Decision Making
4.1 Fare Manipulation
4.2 Availability of Cabs
4.3 Discount Offers
5. Bibliography
3
3

Introduction
For any kind of business in today’s scenario data collection and data analysis is
as important as providing the goods and services. Data collection has no
meaning until and unless it is processed and analyzed in a form which can
understood by different people to implement it to the real environment.

Raw data needs to be cleaned and represented in forms which are useful. Of
the different forms available, visual graphs are very beneficial to implement it.
This project deals with a cab service data set and is later analyzed and
processed into graphs using data visualization using R programming. This data
visualization implements the important libraries of R programming and concepts
of Data Science and Data Mining. The results are in the form of graphs which
can be used to apply the analyzed prospects into business to make constructive
decisions.
4
4

Data Preprocessing
Data preprocessing is a data mining technique which is used to transform the
raw data in a useful and efficient format.

Steps to perform data preprocessing:

1. Data Cleaning
2. Data Transformation
3. Data Reduction

Data Cleaning
Data cleaning is the process of preparing data for analysis by removing or
modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly
formatted. This data is usually not necessary or helpful when it comes to
analyzing data because it may hinder the process or provide inaccurate results.

Steps for Data Cleaning taken:

1. Initial Exploratory Analysis

2. Visualization of Data
3. Cleaning the Errors

The following steps are explained in detail below-

1. Initial Exploratory Analysis

The foremost step to perform data cleaning process is an initial exploration of

the data frame imported into R.

setwd("D:/CabridesAnalysis/CabridesDataset")
apr_data <- read.csv("cabrides-raw-data-apr.csv",header =
TRUE, stringsAsFactors = FALSE)
5
5

may_data <- read.csv("cabrides-raw-data-may.csv"",header =

TRUE, stringsAsFactors = FALSE)
jun_data <- read.csv("cabrides-raw-data-jun.csv"",header =
TRUE, stringsAsFactors = FALSE)
jul_data <- read.csv("cabrides-raw-data-jul.csv"",header =
TRUE, stringsAsFactors = FALSE)
aug_data <- read.csv("cabrides-raw-data-aug.csv"",header =
TRUE, stringsAsFactors = FALSE)
sep_data <- read.csv("cabrides-raw-data-sep.csv"",header =
TRUE, stringsAsFactors = FALSE)
View(sep_data)
class(sep_data)

[1] "data.frame"

dim(sep_data)
[1] 1028136 4

sum(is.na(apr_data))

[1] 1

summary(apr_data)

Date.Time Lat Lon Base

Length:564516 Min. :40.07 Min. :-74.77
Length:564516
6
6

Class :character 1st Qu.:40.72 1st Qu.:-74.00

Class :character
Mode :character Median :40.74 Median :-73.98
Mode :character
Mean :40.74 Mean :-73.98
3rd Qu.:40.76 3rd Qu.:-73.97
Max. :42.12 Max. :-72.07

View(apr_data)
apr_data$Base <- replace(apr_data$Base, apr_data$Base ==
"B02512", NA)
sum(is.na(apr_data))

[1] 35535

2. Visualization of Data

There are 2 types of plots that you should use during your cleaning process –
The Histogram and the BoxPlot

1. Histogram

The histogram is very useful in visualizing the overall distribution of a numeric

column. We can determine if the distribution of data is normal or bi-modal or
unimodal or any other kind of distribution of interest. We can also use
Histograms to figure out if there are outliers in the particular numerical column
under study.

hist(apr_data$Lat)
7
7

hist(apr_data$Lon)

2. BoxPlot

Boxplots are super useful because it shows you the median, along with the first,
second and third quartiles. BoxPlots are the best way of spotting outliers in your
data frame.

boxplot(apr_data$Lat)
8
8

boxplot(apr_data$Lat)

View(apr_data)
apr_data$Base <- replace(apr_data$Base, apr_data$Base ==
"B02512", NA)
sum(is.na(apr_data))

[1] 35535

3. Correction of Errors

Following the above two steps, we have come across some inconsistency or
null values which can significantly hamper the result of the analysis. So, we are
correcting all the errors which might affect us.

sum(is.na(apr_data))

[1] 35535

View(apr_data)
9
9

any(is.integer(apr_data))

[1] FALSE

sum(is.na(apr_data))

[1] 35535

apr_data[is.na(apr_data)] <- (0)

na.omit(apr_data$Base)

[1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[29] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[57] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[85] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[113] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[141] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[169] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[197] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[225] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
10
10

[253] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[281] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[309] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[337] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[365] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[393] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[421] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[449] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[477] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[505] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[533] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[561] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[589] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[617] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[645] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[673] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[701] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
11
11

[729] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[757] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[785] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[813] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[841] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[869] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[897] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[925] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[953] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[981] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
"0" "0" "0" "0" "0" "0" "0"
[ reached getOption("max.print") -- omitted 563516 entries
]

names(apr_data)[names(apr_data) == "Lon"] <- "Longitude"

apr_data=apr_data[1:100000,]
may_data=may_data[1:100000,]
jun_data=jun_data[1:100000,]
jul_data=jul_data[1:100000,]
aug_data=aug_data[1:100000,]
sep_data=sep_data[1:100000,]
12
12

data_cabrides=rbind(apr_data,may_data,jun_data,jul_data,aug
_data,sep_data)
str(data_cabrides)

'data.frame': 600000 obs. of 4 variables:

$ Date.Time: Factor w/ 260093 levels "4/1/2014 0:00",..:
11 17 21 28 33 33 38 44 54 58 ...
$ Lat : num 40.8 40.7 40.7 40.8 40.8 ...
$ Lon : num -74 -74 -74 -74 -74 ...
$ Base : Factor w/ 6 levels "B02512","B02598",..: 1 1
1 1 1 1 1 1 1 1 ...

View(data_cabrides)
class(data_cabrides)

[1] "data.frame"

dim(data_cabrides)

[1] 600000 4

summary(data_cabrides)

Date.Time Lat Lon

Base
13
13

4/7/2014 20:21 : 43 Min. :39.66 Min. :-74.77

B02512:205671
4/7/2014 20:22 : 35 1st Qu.:40.72 1st Qu.:-74.00
B02598:394329
7/2/2014 18:26:00: 33 Median :40.74 Median :-73.98
B02617: 0
7/3/2014 18:11:00: 32 Mean :40.74 Mean :-73.97
B02682: 0
5/3/2014 18:09:00: 31 3rd Qu.:40.76 3rd Qu.:-73.97
B02764: 0
7/2/2014 18:40:00: 31 Max. :41.37 Max. :-72.30
na : 0
(Other) :599795

data_cabrides$Latitude <- data_cabrides$Lat

data_cabrides <- data_cabrides[, -1]

sum(is.na(data_cabrides))

[1] 0

summary(data_cabrides)

Lat Lon Base

Latitude
Min. :39.66 Min. :-74.77 B02512:205671 Min. :
39.66
1st Qu.:40.72 1st Qu.:-74.00 B02598:394329 1st
Qu.:40.72
14
14

Median :40.74 Median :-73.98 B02617: 0 Median :

40.74
Mean :40.74 Mean :-73.97 B02682: 0 Mean :
40.74
3rd Qu.:40.76 3rd Qu.:-73.97 B02764: 0 3rd
Qu.:40.76
Max. :41.37 Max. :-72.30 na : 0 Max. :
41.37

View(data_cabrides)
any(is.factor(data_cabrides))

[1] FALSE

sum(is.na(data_cabrides))

[1] 0

Now we have combined cab rides data of 6 months with 600000 observation of
4 variables.

Data Transformation
This step is taken in order to transform the data in appropriate forms suitable for
mining process.

As dvariable data.time is in factor format we have to convert it

in time format.
15
15

data_cabrides$Date.Time <-
as.POSIXct(data_cabrides$Date.Time, format = "%m/%d/%Y %H:
%M:%S")
head(data_cabrides)

Date.Time Lat Lon Base

<S3: <dbl> <dbl> <fctr>
POSIXct>
1 2014-04-01 40.7690 -73.9549 B02512
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512
00:21:00
4 2014-04-01 40.7588 -73.9776 B02512
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512
00:33:00
6 rows

Now as the data is in time format we can a new variable for time format in Hour-
Min-Sec only.

data_cabrides$Time <-
format(as.POSIXct(data_cabrides$Date.Time, format = "%m/%d/
%Y %H:%M:%S"), format="%H:%M:%S")
head(data_cabrides)
16
16

Date.Time Lat Lon Base

<S3: POSIXct> <dbl> <dbl> <fctr>
1 2014-04-01 40.7690 -73.9549 B02512
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512
00:21:00
4 2014-04-01 40.7588 -73.9776 B02512
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512
00:33:00
6 rows

Now we have made a seperate coloumn for time for analysis.

For ggplot we need to convert each time format individually as a factor.

data_cabrides$Date.Time <- ymd_hms(data_cabrides$Date.Time)

head(data_cabrides)

Date.Time Lat Lon Base

<S3: POSIXct> <dbl> <dbl> <fctr>
1 2014-04-01 40.7690 -73.9549 B02512
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512
00:21:00
17
17

Date.Time Lat Lon Base

<S3: POSIXct> <dbl> <dbl> <fctr>
4 2014-04-01 40.7588 -73.9776 B02512
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512
00:33:00
6 rows

Data Reduction
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.

Now we can pullout individual time and convert it into factor.

data_cabrides$day <- factor(day(data_cabrides$Date.Time))

data_cabrides$month <-
factor(month(data_cabrides$Date.Time, label = TRUE))
head(data_cabrides)

Date.Time Lat Lon Base Time day

<S3: POSIXct> <dbl> <dbl> <fctr> <chr> <fctr>
1 2014-04-01 40.7690 -73.9549 B02512 00:11:00 1
00:11:00
2 2014-04-01 40.7267 -74.0345 B02512 00:17:00 1
00:17:00
3 2014-04-01 40.7316 -73.9873 B02512 00:21:00 1
18
18

Date.Time Lat Lon Base Time day

<S3: POSIXct> <dbl> <dbl> <fctr> <chr> <fctr>
00:21:00
4 2014-04-01 40.7588 -73.9776 B02512 00:28:00 1
00:28:00
5 2014-04-01 40.7594 -73.9722 B02512 00:33:00 1
00:33:00
6 2014-04-01 40.7383 -74.0403 B02512 00:33:00 1
00:33:00
6 rows

In the above table we have made day and month a different variable as a factor.

data_cabrides$year <- factor(year(data_cabrides$Date.Time))

data_cabrides$dayofweek <-
factor(wday(data_cabrides$Date.Time, label = TRUE))
head(data_cabrides)

Date.Time Lat Lon Base Time day mont year dayofwe

<S3: <dbl> <dbl> <fctr> <chr> <fctr h <fct ek
POSIXct> > <ord r> <ord>
>
1 2014-04-01 40.769 - B0251 00:11:00 1 Apr 201 Tue
00:11:00 0 73.95 2 4
49
2 2014-04-01 40.726 - B0251 00:17:00 1 Apr 201 Tue
00:17:00 7 74.03 2 4
45
3 2014-04-01 40.731 - B0251 00:21:00 1 Apr 201 Tue
00:21:00 6 73.98 2 4
73
4 2014-04-01 40.758 - B0251 00:28:00 1 Apr 201 Tue
00:28:00 8 73.97 2 4
76
5 2014-04-01 40.759 - B0251 00:33:00 1 Apr 201 Tue
00:33:00 4 73.97 2 4
19
19

22
6 2014-04-01 40.738 - B0251 00:33:00 1 Apr 201 Tue
00:33:00 3 74.04 2 4
03
6 rows

data_cabrides$hour <- factor(hour(hms(data_cabrides$Time)))

data_cabrides$minute <-
factor(minute(hms(data_cabrides$Time)))
data_cabrides$second <-
factor(second(hms(data_cabrides$Time)))
head(data_cabrides)

Date.Tim Lat Lon Base Time day mont year dayofwe Hou
e <dbl> <dbl> <fctr> <chr> <fct h <fctr ek r
<S3: r> <ord > <ord> <fct
POSIXct> > r>

1 2014-04- 40.76 - B0251 00:11:0 1 Apr 2014 Tue 0

01 90 73.95 2 0
00:11:00 49
2 2014-04- 40.72 - B0251 00:17:0 1 Apr 2014 Tue 0
01 67 74.03 2 0
00:17:00 45
3 2014-04- 40.73 - B0251 00:21:0 1 Apr 2014 Tue 0
01 16 73.98 2 0
00:21:00 73
4 2014-04- 40.75 - B0251 00:28:0 1 Apr 2014 Tue 0
01 88 73.97 2 0
00:28:00 76
5 2014-04- 40.75 - B0251 00:33:0 1 Apr 2014 Tue 0
01 94 73.97 2 0
00:33:00 22
6 2014-04- 40.73 - B0251 00:33:0 1 Apr 2014 Tue 0
01 83 74.04 2 0
00:33:00 03
6 rows | 1-10 of 12 columns
20
20

Similarily all the time value as a adifferent variable in a fator form. Now for
ploting we are creating a new data grouped by hour.

hour_data <- data_cabrides %>%

group_by(hour) %>%
dplyr::summarize(Total = n())
head(hour_data)

hour
<fct Total
r> <int>
0 13024
1 7838
2 5201
3 6028
4 6890
5 10676
6 rows

NA
NA

The data preprocessing shown above is carried out on the file apr_data and
carried out in similar manner on other files too.
21
21

Data Evaluation and Presentation

Before moving to the data evaluation and presentation, the important tools and
functions of R programming which are used to get the results must be studied.

The packages which are used to execute the evaluation are stated below-

library(lubridate)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(DT)
library(plyr)
library(tidyr)
library(stringr)
library(Scale)

The detailed elaboration of the packages used is given below-

1. lubridate

lubridate is an R package that makes it easier to work with dates and times.
lubridate’s parse functions handle a wide variety of formats and separators,
which simplifies the parsing process.

2. ggplot2

The ggplot2 package offers a powerful graphics language for creating elegant

and complex plots. ggplot2 allows you to create graphs that represent both
univariate and multivariate numerical and categorical data in a straightforward
manner. Grouping can be represented by color, symbol, size, and transparency.
The creation of trellis plots (i.e., conditioning) is relatively simple.
22
22

3. ggthemes

This is more of an extension to the main ggplot2 package.

4. dplyr

dplyr is a package which provides a set of tools for efficiently manipulating

datasets in R.

5. DT

DT is an interface to the JavaScript library DataTables based on the

htmlwidgets framework, to present rectangular R data objects (such as data
frames and matrices) as HTML tables.

6. plyr

plyr is an R package that makes it simple to split data apart, do stuff to it, and
mash it back together. This is a common data-manipulation step. plyr makes it
easy to control the input and output data format from a syntactically consistent
set of functions.

7. tidyr

tidyr is a package that makes it easy to tidy your data. It is often used in
conjunction with dplyr.

8. stringr

The stringr package provide a cohesive set of functions designed to make

working with strings as easy as possible.

9. scale

As the name suggests, with the help of graphical scales, we can automatically
map the data to the correct scales with well-placed axes and legends.
23
23

To implement the visualization of data in the forms of graphs, we need

colors to denote different factors. So, we are required to make vector of
colors:

colors = c("#CC3399", "#660099", "#FF00CC", "#660099",

"#009900", "#000099", "#FFFF33")

[1] "#CC3399" "#660099" "#FF00CC" "#660099" "#009900"

"#000099" "#FFFF33"

Finally coming to the evaluation we shall start with trips every hour in order to
infer the rush hours-

ggplot(hour_data, aes(hour, Total)) +

geom_bar( stat = "identity", fill = "pink", color =
"black") +
theme(legend.position = "none")
24
24

We can clearly see the rush hours are between 6.5hrs to 17.5hrs

Now, grouping data by month and hour to determine peak hour month wise.

month_hour <- data_cabrides %>%

group_by(month, hour)%>%
dplyr::summarize(Total = n())

head(month_hour)

month hour Total

<ord> <fctr <int>
>
May 0 2014
May 1 1164
May 2 739
May 3 910
May 4 1052
May 5 1716
6 rows
25
25

View(month_hour)

ggplot(month_hour, aes(hour, Total, fill = month)) +

geom_bar( stat = "identity") +
scale_fill_manual(values = colors)

By this we are now going to plot a graph determine peak hour time

month wise.

By this, we can understand that out of all the days in every month rate of higher
rides are in between 16 to 17 or 17 to 18.

Now we move on to making a group data on the basis of day.

day_group <- data_cabrides %>%

26
26

group_by(day) %>%
dplyr::summarize(Total = n())
head(day_group)

Day Total
<fctr> <int>
1 41518
2 46045
3 45218
4 40316
5 45353
6 45029
6 rows

ggplot(day_group, aes(day, Total)) +

geom_bar( stat = "identity", fill =
"pink",col="green") +
theme(legend.position = "none")
27
27

This graph represents trips every day. We can see that first week has maximum
number of rides.

day_month_group <- data_cabrides %>%

group_by(month, day) %>%
dplyr:: summarize(Total = n())

head(day_month_group)

month day Total

<ord> <fctr> <int>
May 1 10182
May 2 10861
May 3 10233
May 4 6268
May 5 8208
May 6 8869
6 rows
28
28

ggplot(day_month_group,aes(day,Total,fill=month))
+geom_bar(stat = "identity")+ scale_fill_manual(values =
colors)

We now plot the graph to find out the day with highest number
of rides month wise.

After this we can proceed to weekly analysis of rides month wise.

data_month_week=data_cabrides %>% group_by(month,dayofweek)

%>%
dplyr:: summarise(Total=n())

head(data_month_week)

month dayofweek Total

29
29

<ord> <ord> <int>

May Sun 8570
May Mon 10791
May Tue 12247
May Wed 13739
May Thu 23396
May Fri 16807
6 rows

ggplot(data_cabrides,aes(month,fill=dayofweek))
+geom_bar(position = "dodge")+scale_fill_manual(values =
colors)

ggplot(data_month_week,aes(month,Total,fill=dayofweek))
+geom_bar(stat = "identity",position = "dodge")+
scale_fill_manual(values = colors)
30
30

The graph below helps us to identify the day of week with

maximum rides month wise.

After this we can now proceed to hourly ride per day by grouping data under
day and hour.

day_and_hour <- data_cabrides %>%

group_by(day, hour) %>%
dplyr::summarize(Total = n())

head(day_and_hour)

day hour Total

<fctr> <fctr> <int>
1 0 1069
1 1 668
1 2 417
31
31

1 3 461
1 4 436
1 5 654
6 rows

ggplot(day_and_hour, aes(day, hour, fill = Total)) +

geom_tile(color = "black") +
scale_color_manual(values = colors,aesthetics = colors )

The above graph states that in the first week the numbers of rides are more in
between 4pm to 8pm.

ggplot(day_month_group, aes(day, month, fill = Total)) +

geom_tile(color = "darkgrey")

The graph below helps us to identify the day of month with more number of
rides, which in most of the cases are the first nine days of the month.
32
32

Decision Making
This model or project of analyzing the cab rides data can significantly help the
cab service providing company important insights to take business decisions.
These decisions could be the following-

1. Fare Manipulation
Since, we can easily infer the rush hours and not so busy hours, we can
manipulate the fare accordingly.

2. Availability of Cabs
Availability of cabs can be adjusted as per the results of number of rides
during different hours of the day and days of the month.

3. Discount Offers
To attract customers during low business days of the month, different
discount schemes can be introduced to urge customers to take rides.
33
33

Bibliography
1. www.geeksforgeeks.com
2. www.r-project.org
3. www.data.world
4. www.data.cityofnewyork.us
5. www.elitedatascience.com
6. Compiled material of Data Science & Data Mining

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Intermedio 8 ICPNA
0% (2)
Intermedio 8 ICPNA
167 pages
User Manual For TK003 CE-09N
No ratings yet
User Manual For TK003 CE-09N
6 pages
SMDM Guided Project Sample Business Report
No ratings yet
SMDM Guided Project Sample Business Report
17 pages
SQL Project Questions
0% (1)
SQL Project Questions
3 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
Knime Project Report
No ratings yet
Knime Project Report
12 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
TSF - Project
100% (1)
TSF - Project
5 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Time Series Forecasting Jupyter Code - Ipynb
No ratings yet
Time Series Forecasting Jupyter Code - Ipynb
2,484 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
PG Program Dsba
No ratings yet
PG Program Dsba
16 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
ML Quiz 2
No ratings yet
ML Quiz 2
1 page
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Business Report: Advanced Statistics Module Project - II
No ratings yet
Business Report: Advanced Statistics Module Project - II
9 pages
PM Guided Project Sample Business Report
No ratings yet
PM Guided Project Sample Business Report
52 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
ML Models
No ratings yet
ML Models
2 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Machine Learning Projects PDF
No ratings yet
Machine Learning Projects PDF
5 pages
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
No ratings yet
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
52 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
Data Mining Project - PCA - Hair Salon
No ratings yet
Data Mining Project - PCA - Hair Salon
8 pages
Project Report: CS 574 - Computer Vision Using Machine Learning
No ratings yet
Project Report: CS 574 - Computer Vision Using Machine Learning
38 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Why Do You Need To Scale Data in KNN: 3 Answers
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
1 page
Problem 2 Businessreport ML
No ratings yet
Problem 2 Businessreport ML
9 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
No ratings yet
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
8 pages
Project-Time Series Forecasting
100% (1)
Project-Time Series Forecasting
10 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Time Series Forecasting
0% (1)
Time Series Forecasting
1 page
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
ISO 80000-3 A Complete Guide
From Everand
ISO 80000-3 A Complete Guide
Gerardus Blokdyk
No ratings yet
TP Debug Info
No ratings yet
TP Debug Info
115 pages
What Is Scribd?: Youtube For Videos
No ratings yet
What Is Scribd?: Youtube For Videos
4 pages
Citra Log.txt.Old
No ratings yet
Citra Log.txt.Old
4 pages
Catchlogs - 2024-04-07 at 01-43-40 - 7.60.0 - .Java
No ratings yet
Catchlogs - 2024-04-07 at 01-43-40 - 7.60.0 - .Java
52 pages
MTCRE Presentation Material-English
No ratings yet
MTCRE Presentation Material-English
157 pages
CS Project Final
No ratings yet
CS Project Final
15 pages
RicohIM C2000 C2500 C3000 C3500 C4500 C5500 C6000SeriesCCGuide - ESF - v1.1
No ratings yet
RicohIM C2000 C2500 C3000 C3500 C4500 C5500 C6000SeriesCCGuide - ESF - v1.1
51 pages
Excel Chapter - 11
No ratings yet
Excel Chapter - 11
14 pages
How To Access Your Telus International Email Account
No ratings yet
How To Access Your Telus International Email Account
2 pages
Python Questions
No ratings yet
Python Questions
5 pages
Computer Network Syllabus
No ratings yet
Computer Network Syllabus
3 pages
The Complete Android Guide
100% (2)
The Complete Android Guide
282 pages
What Are Breakout Boards SMT Breakout PCB
No ratings yet
What Are Breakout Boards SMT Breakout PCB
14 pages
SAS1700-2015 - Creating Multi - Sheet Microsoft Excel Workbooks With SAS - Part 2
No ratings yet
SAS1700-2015 - Creating Multi - Sheet Microsoft Excel Workbooks With SAS - Part 2
21 pages
For Queries Please Contact:::: All Fields Marked With Are Mandatory Manage Stamp Sales
No ratings yet
For Queries Please Contact:::: All Fields Marked With Are Mandatory Manage Stamp Sales
4 pages
final_ Accops Certified Sales Professional - Fundamentals
No ratings yet
final_ Accops Certified Sales Professional - Fundamentals
14 pages
Páginas Desdep64x - Parte3-4
No ratings yet
Páginas Desdep64x - Parte3-4
90 pages
Yuvaraju Devops
No ratings yet
Yuvaraju Devops
5 pages
IP Class 12 Library Management Project 2024-25
No ratings yet
IP Class 12 Library Management Project 2024-25
40 pages
Experiment 3B
No ratings yet
Experiment 3B
4 pages
Reports
No ratings yet
Reports
43 pages
Chapter 7
No ratings yet
Chapter 7
52 pages
Oppstartsmanual KV-Multiprog 2
No ratings yet
Oppstartsmanual KV-Multiprog 2
12 pages
Q1-Avatar-based Innovation - Using Virtual Worlds For Real-World Innovation
No ratings yet
Q1-Avatar-based Innovation - Using Virtual Worlds For Real-World Innovation
13 pages
Business Model Canvas of
No ratings yet
Business Model Canvas of
11 pages
Modified_Android _App_Development_Questionbank
No ratings yet
Modified_Android _App_Development_Questionbank
12 pages
Computer Literacy Complete
No ratings yet
Computer Literacy Complete
12 pages
Introduction To Word Processing
No ratings yet
Introduction To Word Processing
12 pages
инструкция-для-кофемашин-servomat-vending machinesteigler-cino-xs-speed-mixнем
0% (1)
инструкция-для-кофемашин-servomat-vending machinesteigler-cino-xs-speed-mixнем
52 pages