Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
443 views

Clustering Documentation R Code

The document discusses performing hierarchical clustering on two datasets: an airlines dataset and a crime dataset. For the airlines data, 9 clusters were identified and the data was aggregated and summarized. For the crime data, 6 clusters were identified after normalizing the values. The clustered data was then saved as a CSV file.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
443 views

Clustering Documentation R Code

The document discusses performing hierarchical clustering on two datasets: an airlines dataset and a crime dataset. For the airlines data, 9 clusters were identified and the data was aggregated and summarized. For the crime data, 6 clusters were identified after normalizing the values. The clustered data was then saved as a CSV file.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

1.

)Perform clustering (Both hierarchical and K means clustering) for


the airlines data to obtain optimum number of clusters.

=> Inferences from the Data Set


Dataset talks about the airlines data transactions

=> In R studio :
install.packages("readxl")
library(readxl)
install.packages()

=> loading the library “readxl” the data set is in excel format
air <- read_xlsx("E:\\EastWestAirlines.xlsx",sheet=2)
=> reducing the coloumn in data set to summary the data
air1 <- air[ , c(2:12)]

=> summary data to find the max and min value for each and every coloumn
summary(air1)
Balance Qual_miles cc1_miles cc2_miles
Min. : 0 Min. : 0.0 Min. :1.00 Min. :1.000
1st Qu.: 18528 1st Qu.: 0.0 1st Qu.:1.00 1st Qu.:1.000
Median : 43097 Median : 0.0 Median :1.00 Median :1.000
Mean : 73601 Mean : 144.1 Mean :2.06 Mean :1.015
3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.:3.00 3rd Qu.:1.000
Max. :1704838 Max. :11148.0 Max. :5.00 Max. :3.000
cc3_miles Bonus_miles Bonus_trans Flight_miles_12mo
Min. :1.000 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.:1.000 1st Qu.: 1250 1st Qu.: 3.0 1st Qu.: 0.0
Median :1.000 Median : 7171 Median :12.0 Median : 0.0
Mean :1.012 Mean : 17145 Mean :11.6 Mean : 460.1
3rd Qu.:1.000 3rd Qu.: 23801 3rd Qu.:17.0 3rd Qu.: 311.0
Max. :5.000 Max. :263685 Max. :86.0 Max. :30817.0

=> Normalizing the data


normalized_data <- scale(air1[ ,2:11 ])
=> summary the normalized_data
summary(normalized_data)
=> Finding the distance of the data
d <- dist(normalized_data, method = "euclidean")
fit <- hclust(d, method = "complete")
=> Ploting the dindogram to plot the data
plot(fit)
plot(fit, hang = -1)

=> Cluster Dendrogram


=> Grouping the plot into 9 clusters
groups <- cutree(fit, k = 9)#cut tree into 9 clusters
rect.hclust(fit, k = 9, border = "blue")
=> membership <- as.matrix(groups)

=> final <- data.frame(membership, air1)


=> Aggerating the data through mean
And list the data
=> aggregate(air1[, 2:11], by = list(final$membership), FUN = mean)
Group .1 Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
1 1 89.60026 2.039119 1.007254 1.000777 15870.57 11.04870
2 2 472.40000 2.120000 1.000000 1.000000 31986.40 30.66000
3 3 648.73333 4.600000 1.000000 1.000000 112247.17 31.53333
4 4 118.20000 3.600000 1.000000 3.600000 79268.70 30.60000
5 5 66.66667 1.000000 3.000000 1.000000 20410.47 18.93333
6 6 7352.20000 1.760000 1.000000 1.000000 14299.56 11.48000
7 7 0.00000 3.200000 1.000000 5.000000 123246.20 23.00000
8 8 694.00000 2.500000 1.000000 1.000000 76325.00 75.50000
9 9 0.00000 2.500000 1.000000 1.000000 54943.50 63.00000
Flight_miles_12mo Flight_trans_12 Days_since_enroll Award?
1 312.3731 0.965544 4093.794 0.3575130
2 7867.2400 21.300000 4304.600 0.7800000
3 3739.0000 11.433333 6646.567 0.9333333
4 650.0000 2.100000 4891.600 0.4000000
5 692.6667 3.200000 4075.533 0.4000000
6 1225.6400 3.560000 4572.240 0.6400000
7 220.0000 0.600000 4058.400 0.8000000
8 26458.5000 49.000000 2602.000 1.0000000
9 13461.5000 49.500000 1798.500 1.0000000

library(readr)
write_csv(final, "hclustoutput.csv")
=> saving the data
getwd()
2.)Perform Clustering for the crime data and identify the number of
clusters formed and draw inferences

Interface of the data tells that states and their crime data
=> loading the library “readxl” the data set is in excel format
library(readxl)
=>loading the data set
cdata <- read_excel(file.choose())
=>Reducing the coloumn to summary the data
data <-cdata[,c(2:5)]
=>summarizing the data
summary(data)
Murder Assault UrbanPop Rape
Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
Median : 7.250 Median :159.0 Median :66.00 Median :20.10
Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00

=> Normalizing the data because the values are so high


normalized_data <- scale(data[ , ])
Murder Assault UrbanPop Rape
Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444

summary(normalized_data)
=>Finding the distance in the values through ‘euclidean’ method
d <- dist(normalized_data, method = "euclidean")
=>Clusting the data through the method ‘Complete’
fit <- hclust(d, method = "complete")

# display dindogram
plot(fit)
6
5
4
3
Height

2
2
1

13 223 9
25 8
739
156
28

33
0

5017

44

42
10
274
454

43
12

381
21
30

24
40
1635

181
26
41
483
46
1519
37
47

20
31
36
23
49
14
32
29
plot(fit, hang = -1)

6
5
4
Height

3
2
1
0
41
48
34
45
19
15
29
12
26
27
17
4
46
50
25
37
47
8
39
21
30
7
23
49
36
14
16
35
38
11
44
6
5
28
9
43
13
32
3
22
20
31
2
1
18
10
42
33
24
40

Creating the tree into 6 clusters


groups <- cutree(fit, k = 6)#cut tree
rect.hclust(fit, k = 6, border = "blue")
6
5
4
Height

3
2
1
0
41
48
34
45
19
15
29
12
26
27
17
4
46
50
25
37
47
8
39
21
30
7
23
49
36
14
16
35
38
11
44
6
5
28
9
43
13
32
3
22
20
31
2
1
18
10
42
33
24
40

membership <- as.matrix(groups)

final <- data.frame(membership, data)


aggregate(data[, 2:5], by = list(final$membership), FUN = mean)
library(readr)
=>saving the data
write_csv(final, "hclustoutput.csv")

getwd()

You might also like