Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views

Certificate

The document describes a clustering technique called genetic K-means clustering. It introduces clustering and different clustering techniques. Then it explains the basic K-means algorithm and genetic algorithm. After that, it proposes the genetic K-means hybrid algorithm which uses genetic algorithm to initialize cluster centroids for K-means clustering. Finally, it implements the proposed technique on the Iris dataset and evaluates its performance compared to standard K-means clustering.

Uploaded by

Vatsal Singh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Certificate

The document describes a clustering technique called genetic K-means clustering. It introduces clustering and different clustering techniques. Then it explains the basic K-means algorithm and genetic algorithm. After that, it proposes the genetic K-means hybrid algorithm which uses genetic algorithm to initialize cluster centroids for K-means clustering. Finally, it implements the proposed technique on the Iris dataset and evaluates its performance compared to standard K-means clustering.

Uploaded by

Vatsal Singh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Certificate

I hereby declare that the work presented in this report entitled “ Clustering Techniques in
Machine Learning” in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering/Information Technology
submitted in the department of Computer Science & Engineering and Information Technology ,
Jaypee University of Information Technology Waknaghat is an authentic record of my own
work carried out over a period from July 2022 to May 2023 under the supervision of Dr.
Monika Bharti Assistant Professor (SG).
The matter embodied in the report has not been submitted for the award of any other degree or
diploma.

Vatsal Singh, 191286

This is to certify that the above statement made by the candidate is true to the best of my
knowledge.

Dr. Monika Bharti


Assistant Ptofessor (SG)
Computer Science and Engineering/Information Technology
Dated:

i
Acknowledgement

The successful completion of any task would be incomplete without acknowledging the
people who made it possible and whose constant guidance and encouragement secured the
success.

First of all I wish to acknowledge the benevolence of omnipotent God who gave me strength
and courage to overcome all obstacles and showed me the silver lining in the dark clouds
with the profound sense of gratitude and heartiest regard. I express my sincere feelings of
indebtedness to my guide Dr. Monika Bharti for their positive attitude, excellent guidance,
constant encouragement, keen interest, invaluable co-operation, generous attitude and above
all their blessings. She has been a source of inspiration for me.

Last but not the least I would like to express my heartfelt thanks to my parents and my
friends who with their thought provoking views, veracity and whole hearted cooperation
helped in doing this project.

Vatsal Singh

ii
(191286)

Abstract

Today's period has seen a fast increase in the number and diversity of data produced by
scientific endeavours and the business environment. There are challenges associated with
gathering and analysing such large amounts of data. Data mining is a method for uncovering
hidden relationships and meaningful information inside data, however owing to their
fundamental complexity, typical data mining techniques cannot be used directly to large
data.

One of the most crucial problems in the fields of data mining and machine learning is data
clustering. Identifying uniform collections of the investigated items is the task of clustering.
Recently, creating clustering algorithms has piqued the interest of several academics. The
biggest issue with clustering lies in the fact that we lack understanding about the dataset that
has been provided. Additionally, the selection of input parameters used in these methods,
including the number of clusters, the number of nearest neighbours, and other elements,
makes the clustering issue more difficult. Therefore, selecting these values incorrectly will
produce poor clustering results. Additionally, the accuracy of these methods is subpar when
the dataset comprises clusters with complicated forms, densities, sizes, noise, and outliers. In
this research, we provide a novel method for the job of unsupervised clustering. Three
operational phases make up our strategy. For determining the very first preliminary cluster
centroid in the first phase, we employ the genetic method. We employ dataset crossover and
mutation in genetic algorithms. The genetic algorithm's first cluster centroid is used in the
second step of K-means clustering to locate clusters We get a collection of clusters for the
supplied dataset from the second phase. As a result, these clusters are taken into account in
the third step for the Davies Bouldin Index-based cluster evaluation. The Genetic K-means
Method is the name of this innovative algorithm. We show tests that demonstrate the
effectiveness of our novel method proposal in locating clusters with various non-convex
sizes, densities, noise, outliers, and greater accuracy. These tests demonstrate our suggested

iii
method is better as compared to the K-means approach.

Table of Contents

Certificate ……………………………………………………………………….. i
Acknowledgement………………………………………………………………. ii

Abstract……………………………………………………….…………………. iii

Table of Contents……………………………………………….……………….. iv

List of Figures…………………………………………….……………………... vii

List of Table………………………………………….………………………….. vii


i
1. Introduction………………………………………………………................ 1

1.1. Introduction to Machine Learning………….…………………………. 1

1.2 Unsupervised Learning……………..……………………… 2

1.3 Types of Clustering ……………………………………............................. 3

1.3.1. Partitioning Methods………………………………........................ 3

1.3.2. Hierarchical Clustering……………………………..……………… 4

1.3.3. Fuzzy Clustering ……….……….…………………….................... 4

1.3.4. Model Based Clustering ……….……….…..………….................... 4

1.4. Comparison of Clusters...……………………........................................... 5

1.4.1 Euclidian Distance……………………………………………… 5

1.4.2 Manhattan Distance…………………………………………….. 6

1.4.3 Edit Distance…………………………………………………… 6

1.4.4 Hamming Distance…………………………………………... 6

iv
1.4.5 Density Based Clustering………………………………………… 7

1.5 Techniques to find optimum no of clusters ……………………………… 7

1.5.1 Elbow method……………………………………………….. 7

1.5.2 Gap Statistical Method……………………………………………….. 7

1.5.3 Davies Bouldin Index……………………………………………….. 8

v
1.5.4. Hamming Distance…………………………………………………… 8

1.6 Methods to find ideal count of clusters…………………………………… 8

1.6.1 Elbow Method ………………………………………………………… 9

1.6.2 Average Silhouette Method………………………………………........ 9

1.6.3 Gap Statistical Method………………………………………………. 10


1.6.4 Davies Bouldin Index Method………………………………………. 11

1.6.5 Dunn Index Method……………………………………….......... …… 11

1.7 Basic K-means Algorithm……………………………………………… 11

1.8 Genetic Algorithm……………………………………………………… 15

1.9 Organization of Thesis…………………………………………………. 18

2. Literature Survey…………………………………………………………... 19

2.1. Clustering ………………………………………………………………. 19

2.2. Partitioning Clustering……………………………………….…………. 20

3. Research Problem…………………………………………………………... 25

3.1. Problem Statement ………………………………………………………… 25

3.2. Research Gaps……………………………………….……………………... 25

3.3. Objectives……………………………………….…………………….......... 26

3.4. Methodology…………………………………….……………………........ 26

4. System Design and Development……………………………………. 27

4.1. Proposed Hybrid Technique ………………………………………………… 27

4.2. Basic Genetic Algorithm….………………………….……………………... 28

4.2.1 Application of Genetic Algorithm…..……….…………………….......... 29

4.2.1 Example of MaxOne Genetic Algorithm…..……….…………….......... 29

4.3. Proposed Algorithm (Genetic K-means)…….…….……………………........ 34

4.3. Conclusion……………………………..…….…….……………………........ 44

5. Implementation and Experimental Results………………………………… 45

5.1. Implementation of Proposed Technique…………………. ………………… 45


vi
vii
5.1.1. Iris Dataset for Implementation………………………………………. 45

5.2. Experimental Results………………….…….…….……………………........ 54

5.2.1 Confusion Matrix of K-means Clustering. …………………………....... 54

5.2.2 Confusion Matrix of Genetic K-means Clustering. …………………....... 55

5.2.3 Test for Performance of Accuracy………………. …………………....... 56

5.3. Conclusion………….………………….…….…….……………………........ 60

6. Conclusion and Future Scope……………….………………………………. 61

6.1. Conclusion……………. ……………………………………………………. 61

6.2. Limitations…………………………..………………………………………. 61

6.3. Future Scope………….……………..………………………………………. 62

References……………………….……………….………………………………. 63

viii
List of Figures

Figure No. Description Page


No.

1.1 Inter and Intra Similarities of Cluster..............................................................4


1.2 Evaluation Graph of Elbow Method................................................................9

1.3 Evaluation Graph of Silhouette Method.........................................................10

1.4 Evaluation Graph of Gap Statistical Method.................................................10

1.5 Flowchart of K-means Clustering..................................................................12

1.6 Clustering on Iris Dataset...............................................................................13

1.7 Genetic Algorithm Chromosomes and Population.........................................16

1.8 Execution Steps of Genetic Algorithm..........................................................18

2.1 Clustering of Scattered Documents................................................................20

4.1 Implementation Methodology for Clustering of dataset................................27

4.2 Process of Genetic Algorithm........................................................................29

4.3 Roulette Wheel Selection of Genetic Algorithm............................................30

4.4 Flowchart of Proposed Algorithm..................................................................35

5.1 Code of K-means on Iris Dataset....................................................................46

5.2 Result of K-means on Iris Dataset.................................................................47

5.3 Code of Genetic K-means Clustering on Iris dataset.....................................48

5.4 Result of Genetic K-means on Iris Dataset....................................................49

5.5 Code of K-means on Wine dataset................................................................ 50

5.6 Result of K-means on Wine Dataset.............................................................51

5.7 Code of Genetic K-means on Wine Dataset.................................................52

5.8 Result of Genetic K-means on Wine Dataset............................................... 53

5.9 Confusion Matrix obtained from K-means Algorithm.................................55

5.10 Confusion Matrix obtained from Genetic K-means Algorithm...................55

ix
List of Tables

Table No. Description Page


No.

1.1 Crossover Operation on chromosome S1 and S3................................................

1.2 Result of Crossover on Chromosome S1 and S3.................................................

1.3 Mutation Operation on Chromosome S1 and S3.................................................

1.4 Result of Mutation on Chromosome S1 and S3..................................................

4.1 Initialization of Chromosome for MaxOne Problem...........................................

4.2 Arrangement of Chromosome based on Fitness value........................................

4.3 Crossover of chromosome S1 and S3..................................................................

4.4 Crossover Result of chromosome S1 and S3.......................................................

4.5Crossover of chromosome S2 and S4...................................................................

4.5 Crossover Result of chromosome S2 and S4......................................................

4.7 Crossover of chromosome S5 and S6

4.8 Crossover Result of chromosome S5 and S6.......................................................

4.9 Mutation Result of chromosomes........................................................................

4.10 Iris dataset for Genetic K-means Clustering......................................................

4.11 Normalized dataset for Genetic K-means Clustering........................................

4.12Selected Row Indices and Chromosomes...........................................................

4.13 Calculated Distance and Assignment of cluster................................................

4.14 Clusters obtained for Fifteen Records...............................................................

5.1 Accuracy obtained from K-means and Genetic Algorithm.................................

5.2 Intra Cluster distance using K-means algorithm.................................................

5.3 Intra Cluster distance using Proposed algorithm.................................................

5.4 Inter Cluster distance using K-means algorithm.................................................

5.5 Inter Cluster distance using Proposed algorithm……………………………..25

x
xi

You might also like