0% found this document useful (0 votes)

22 views

Certificate

The document describes a clustering technique called genetic K-means clustering. It introduces clustering and different clustering techniques. Then it explains the basic K-means algorithm and genetic algorithm. After that, it proposes the genetic K-means hybrid algorithm which uses genetic algorithm to initialize cluster centroids for K-means clustering. Finally, it implements the proposed technique on the Iris dataset and evaluates its performance compared to standard K-means clustering.

Uploaded by

Vatsal Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Certificate

Uploaded by

Vatsal Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Certificate

I hereby declare that the work presented in this report entitled “ Clustering Techniques in
Machine Learning” in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering/Information Technology
submitted in the department of Computer Science & Engineering and Information Technology ,
Jaypee University of Information Technology Waknaghat is an authentic record of my own
work carried out over a period from July 2022 to May 2023 under the supervision of Dr.
Monika Bharti Assistant Professor (SG).
The matter embodied in the report has not been submitted for the award of any other degree or
diploma.

Vatsal Singh, 191286

This is to certify that the above statement made by the candidate is true to the best of my
knowledge.

Dr. Monika Bharti

Assistant Ptofessor (SG)
Computer Science and Engineering/Information Technology
Dated:

i
Acknowledgement

The successful completion of any task would be incomplete without acknowledging the
people who made it possible and whose constant guidance and encouragement secured the
success.

First of all I wish to acknowledge the benevolence of omnipotent God who gave me strength
and courage to overcome all obstacles and showed me the silver lining in the dark clouds
with the profound sense of gratitude and heartiest regard. I express my sincere feelings of
indebtedness to my guide Dr. Monika Bharti for their positive attitude, excellent guidance,
constant encouragement, keen interest, invaluable co-operation, generous attitude and above
all their blessings. She has been a source of inspiration for me.

Last but not the least I would like to express my heartfelt thanks to my parents and my
friends who with their thought provoking views, veracity and whole hearted cooperation
helped in doing this project.

Vatsal Singh

ii
(191286)

Abstract

Today's period has seen a fast increase in the number and diversity of data produced by
scientific endeavours and the business environment. There are challenges associated with
gathering and analysing such large amounts of data. Data mining is a method for uncovering
hidden relationships and meaningful information inside data, however owing to their
fundamental complexity, typical data mining techniques cannot be used directly to large
data.

One of the most crucial problems in the fields of data mining and machine learning is data
clustering. Identifying uniform collections of the investigated items is the task of clustering.
Recently, creating clustering algorithms has piqued the interest of several academics. The
biggest issue with clustering lies in the fact that we lack understanding about the dataset that
has been provided. Additionally, the selection of input parameters used in these methods,
including the number of clusters, the number of nearest neighbours, and other elements,
makes the clustering issue more difficult. Therefore, selecting these values incorrectly will
produce poor clustering results. Additionally, the accuracy of these methods is subpar when
the dataset comprises clusters with complicated forms, densities, sizes, noise, and outliers. In
this research, we provide a novel method for the job of unsupervised clustering. Three
operational phases make up our strategy. For determining the very first preliminary cluster
centroid in the first phase, we employ the genetic method. We employ dataset crossover and
mutation in genetic algorithms. The genetic algorithm's first cluster centroid is used in the
second step of K-means clustering to locate clusters We get a collection of clusters for the
supplied dataset from the second phase. As a result, these clusters are taken into account in
the third step for the Davies Bouldin Index-based cluster evaluation. The Genetic K-means
Method is the name of this innovative algorithm. We show tests that demonstrate the
effectiveness of our novel method proposal in locating clusters with various non-convex
sizes, densities, noise, outliers, and greater accuracy. These tests demonstrate our suggested

iii
method is better as compared to the K-means approach.

Table of Contents

Certificate ……………………………………………………………………….. i
Acknowledgement………………………………………………………………. ii

Abstract……………………………………………………….…………………. iii

Table of Contents……………………………………………….……………….. iv

List of Figures…………………………………………….……………………... vii

List of Table………………………………………….………………………….. vii

i
1. Introduction………………………………………………………................ 1

1.1. Introduction to Machine Learning………….…………………………. 1

1.2 Unsupervised Learning……………..……………………… 2

1.3 Types of Clustering ……………………………………............................. 3

1.3.1. Partitioning Methods………………………………........................ 3

1.3.2. Hierarchical Clustering……………………………..……………… 4

1.3.3. Fuzzy Clustering ……….……….…………………….................... 4

1.3.4. Model Based Clustering ……….……….…..………….................... 4

1.4. Comparison of Clusters...……………………........................................... 5

1.4.1 Euclidian Distance……………………………………………… 5

1.4.2 Manhattan Distance…………………………………………….. 6

1.4.3 Edit Distance…………………………………………………… 6

1.4.4 Hamming Distance…………………………………………... 6

iv
1.4.5 Density Based Clustering………………………………………… 7

1.5 Techniques to find optimum no of clusters ……………………………… 7

1.5.1 Elbow method……………………………………………….. 7

1.5.2 Gap Statistical Method……………………………………………….. 7

1.5.3 Davies Bouldin Index……………………………………………….. 8

v
1.5.4. Hamming Distance…………………………………………………… 8

1.6 Methods to find ideal count of clusters…………………………………… 8

1.6.1 Elbow Method ………………………………………………………… 9

1.6.2 Average Silhouette Method………………………………………........ 9

1.6.3 Gap Statistical Method………………………………………………. 10

1.6.4 Davies Bouldin Index Method………………………………………. 11

1.6.5 Dunn Index Method……………………………………….......... …… 11

1.7 Basic K-means Algorithm……………………………………………… 11

1.8 Genetic Algorithm……………………………………………………… 15

1.9 Organization of Thesis…………………………………………………. 18

2. Literature Survey…………………………………………………………... 19

2.1. Clustering ………………………………………………………………. 19

2.2. Partitioning Clustering……………………………………….…………. 20

3. Research Problem…………………………………………………………... 25

3.1. Problem Statement ………………………………………………………… 25

3.2. Research Gaps……………………………………….……………………... 25

3.3. Objectives……………………………………….…………………….......... 26

3.4. Methodology…………………………………….……………………........ 26

4. System Design and Development……………………………………. 27

4.1. Proposed Hybrid Technique ………………………………………………… 27

4.2. Basic Genetic Algorithm….………………………….……………………... 28

4.2.1 Application of Genetic Algorithm…..……….…………………….......... 29

4.2.1 Example of MaxOne Genetic Algorithm…..……….…………….......... 29

4.3. Proposed Algorithm (Genetic K-means)…….…….……………………........ 34

4.3. Conclusion……………………………..…….…….……………………........ 44

5. Implementation and Experimental Results………………………………… 45

5.1. Implementation of Proposed Technique…………………. ………………… 45

vi
vii
5.1.1. Iris Dataset for Implementation………………………………………. 45

5.2. Experimental Results………………….…….…….……………………........ 54

5.2.1 Confusion Matrix of K-means Clustering. …………………………....... 54

5.2.2 Confusion Matrix of Genetic K-means Clustering. …………………....... 55

5.2.3 Test for Performance of Accuracy………………. …………………....... 56

5.3. Conclusion………….………………….…….…….……………………........ 60

6. Conclusion and Future Scope……………….………………………………. 61

6.1. Conclusion……………. ……………………………………………………. 61

6.2. Limitations…………………………..………………………………………. 61

6.3. Future Scope………….……………..………………………………………. 62

References……………………….……………….………………………………. 63

viii
List of Figures

Figure No. Description Page

No.

1.1 Inter and Intra Similarities of Cluster..............................................................4

1.2 Evaluation Graph of Elbow Method................................................................9

1.3 Evaluation Graph of Silhouette Method.........................................................10

1.4 Evaluation Graph of Gap Statistical Method.................................................10

1.5 Flowchart of K-means Clustering..................................................................12

1.6 Clustering on Iris Dataset...............................................................................13

1.7 Genetic Algorithm Chromosomes and Population.........................................16

1.8 Execution Steps of Genetic Algorithm..........................................................18

2.1 Clustering of Scattered Documents................................................................20

4.1 Implementation Methodology for Clustering of dataset................................27

4.2 Process of Genetic Algorithm........................................................................29

4.3 Roulette Wheel Selection of Genetic Algorithm............................................30

4.4 Flowchart of Proposed Algorithm..................................................................35

5.1 Code of K-means on Iris Dataset....................................................................46

5.2 Result of K-means on Iris Dataset.................................................................47

5.3 Code of Genetic K-means Clustering on Iris dataset.....................................48

5.4 Result of Genetic K-means on Iris Dataset....................................................49

5.5 Code of K-means on Wine dataset................................................................ 50

5.6 Result of K-means on Wine Dataset.............................................................51

5.7 Code of Genetic K-means on Wine Dataset.................................................52

5.8 Result of Genetic K-means on Wine Dataset............................................... 53

5.9 Confusion Matrix obtained from K-means Algorithm.................................55

5.10 Confusion Matrix obtained from Genetic K-means Algorithm...................55

ix
List of Tables

Table No. Description Page

No.

1.1 Crossover Operation on chromosome S1 and S3................................................

1.2 Result of Crossover on Chromosome S1 and S3.................................................

1.3 Mutation Operation on Chromosome S1 and S3.................................................

1.4 Result of Mutation on Chromosome S1 and S3..................................................

4.1 Initialization of Chromosome for MaxOne Problem...........................................

4.2 Arrangement of Chromosome based on Fitness value........................................

4.3 Crossover of chromosome S1 and S3..................................................................

4.4 Crossover Result of chromosome S1 and S3.......................................................

4.5Crossover of chromosome S2 and S4...................................................................

4.5 Crossover Result of chromosome S2 and S4......................................................

4.7 Crossover of chromosome S5 and S6

4.8 Crossover Result of chromosome S5 and S6.......................................................

4.9 Mutation Result of chromosomes........................................................................

4.10 Iris dataset for Genetic K-means Clustering......................................................

4.11 Normalized dataset for Genetic K-means Clustering........................................

4.12Selected Row Indices and Chromosomes...........................................................

4.13 Calculated Distance and Assignment of cluster................................................

4.14 Clusters obtained for Fifteen Records...............................................................

5.1 Accuracy obtained from K-means and Genetic Algorithm.................................

5.2 Intra Cluster distance using K-means algorithm.................................................

5.3 Intra Cluster distance using Proposed algorithm.................................................

5.4 Inter Cluster distance using K-means algorithm.................................................

5.5 Inter Cluster distance using Proposed algorithm……………………………..25

x
xi

Probability and Statistics for Machine Learning_ A Textbook
No ratings yet
Probability and Statistics for Machine Learning_ A Textbook
530 pages
Data Analysis Using WEKA
88% (8)
Data Analysis Using WEKA
24 pages
Clustering On Boston Dataset
No ratings yet
Clustering On Boston Dataset
3 pages
Couple Stady - David H. Olson
No ratings yet
Couple Stady - David H. Olson
22 pages
Guia R
No ratings yet
Guia R
32 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
10 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
77 pages
Certificate
No ratings yet
Certificate
7 pages
1
No ratings yet
1
76 pages
Gaston Edem Awashie
No ratings yet
Gaston Edem Awashie
92 pages
Undergraduate Thesis (Roger Willis)
100% (2)
Undergraduate Thesis (Roger Willis)
87 pages
Thesis
No ratings yet
Thesis
62 pages
Active Learning
100% (3)
Active Learning
116 pages
Thesis
No ratings yet
Thesis
45 pages
Shreya Ghosh MS Thesis Final Revised
No ratings yet
Shreya Ghosh MS Thesis Final Revised
64 pages
Lu Princeton 0181D 13623
No ratings yet
Lu Princeton 0181D 13623
158 pages
Quantitaive Anlysis
No ratings yet
Quantitaive Anlysis
14 pages
Ossei Kofi Tuffuor
No ratings yet
Ossei Kofi Tuffuor
83 pages
Data Analysis and Applications 4
No ratings yet
Data Analysis and Applications 4
295 pages
Colorado State Mathematical Modeling
No ratings yet
Colorado State Mathematical Modeling
12 pages
Dissertation Pu
No ratings yet
Dissertation Pu
255 pages
Data Structures and Algorithms (DSA) : July 2019
No ratings yet
Data Structures and Algorithms (DSA) : July 2019
4 pages
DSA Inside BOOKcontents
No ratings yet
DSA Inside BOOKcontents
4 pages
Designing Test Suites For Software Interaction Testing: Myra B. Cohen
No ratings yet
Designing Test Suites For Software Interaction Testing: Myra B. Cohen
185 pages
2122 2 MOR ECEW101 06 Final Paper ABARQUEZ
No ratings yet
2122 2 MOR ECEW101 06 Final Paper ABARQUEZ
65 pages
Clusteranalysisanddatamining PDF
100% (1)
Clusteranalysisanddatamining PDF
333 pages
Rapport
No ratings yet
Rapport
106 pages
Cluster Analysis and Data Mining
100% (1)
Cluster Analysis and Data Mining
333 pages
Knowledge Management Notes
No ratings yet
Knowledge Management Notes
114 pages
S c0351 054556 Table of Content
No ratings yet
S c0351 054556 Table of Content
4 pages
JOSSELIN LE MAUX. Statistical Tools For Program Evaluation.2017 PDF
100% (1)
JOSSELIN LE MAUX. Statistical Tools For Program Evaluation.2017 PDF
530 pages
310 Artificial Intelligence
No ratings yet
310 Artificial Intelligence
79 pages
Karanja Evanson Mwangi Cit Masters Report Libre PDF
No ratings yet
Karanja Evanson Mwangi Cit Masters Report Libre PDF
136 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Machine Learning for Financial Market Forecasting
No ratings yet
Machine Learning for Financial Market Forecasting
104 pages
Can Well-Being Be Predicted? A Machine Learning Approach: Max Wilckens, Margeret Hall
No ratings yet
Can Well-Being Be Predicted? A Machine Learning Approach: Max Wilckens, Margeret Hall
74 pages
Cevher Volkan 200505 PHD PDF
No ratings yet
Cevher Volkan 200505 PHD PDF
147 pages
Subb Arao
No ratings yet
Subb Arao
191 pages
Full Download Conducting Systematic Reviews in Sport, Exercise, and Physical Activity David Tod PDF
100% (3)
Full Download Conducting Systematic Reviews in Sport, Exercise, and Physical Activity David Tod PDF
62 pages
Intrusion Detection and Response Systems For Mobile Ad Hoc Networks
No ratings yet
Intrusion Detection and Response Systems For Mobile Ad Hoc Networks
180 pages
Classification of EEG Data Using Machine Learning Techniques
No ratings yet
Classification of EEG Data Using Machine Learning Techniques
79 pages
Sctex Ie
No ratings yet
Sctex Ie
141 pages
Churn Prediction and Causal Analysis On Telecom Customer Data
No ratings yet
Churn Prediction and Causal Analysis On Telecom Customer Data
83 pages
Ellouzi SOUT Tirage PDF
No ratings yet
Ellouzi SOUT Tirage PDF
84 pages
Janani Hamed PD
No ratings yet
Janani Hamed PD
159 pages
Lavin - Thesis 4-22-18
No ratings yet
Lavin - Thesis 4-22-18
168 pages
Contoh Table of Content
100% (1)
Contoh Table of Content
3 pages
Factors affecting behavioral intentions in hospitality industry of Pakistan
No ratings yet
Factors affecting behavioral intentions in hospitality industry of Pakistan
63 pages
Thesis
No ratings yet
Thesis
278 pages
Full download Conducting Systematic Reviews in Sport, Exercise, and Physical Activity David Tod pdf docx
100% (5)
Full download Conducting Systematic Reviews in Sport, Exercise, and Physical Activity David Tod pdf docx
55 pages
Extending The Log Likelihood Measure To Improve Collocation Identification
No ratings yet
Extending The Log Likelihood Measure To Improve Collocation Identification
80 pages
Outlier Analysis 2nd Edition Charu C. Aggarwal (Auth.) - The ebook in PDF/DOCX format is available for instant download
100% (2)
Outlier Analysis 2nd Edition Charu C. Aggarwal (Auth.) - The ebook in PDF/DOCX format is available for instant download
63 pages
Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods
No ratings yet
Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods
107 pages
Final Result Murad Mohammed (E) MSC
No ratings yet
Final Result Murad Mohammed (E) MSC
90 pages
Mining Social Media: Tracking Content and Predicting Behaviour
50% (2)
Mining Social Media: Tracking Content and Predicting Behaviour
228 pages
Original Project
No ratings yet
Original Project
61 pages
A Machine Learning Recommender Model For Ride Sharing Based On Ri
No ratings yet
A Machine Learning Recommender Model For Ride Sharing Based On Ri
93 pages
Design and Implementation of A New Blockchain Algorithm To Increase Reliability, Security and Integrity
No ratings yet
Design and Implementation of A New Blockchain Algorithm To Increase Reliability, Security and Integrity
122 pages
Don Honorio Ventura Technological State University I
No ratings yet
Don Honorio Ventura Technological State University I
5 pages
Spatial AutoRegression SAR Model Parameter Estimation Techniques
0% (1)
Spatial AutoRegression SAR Model Parameter Estimation Techniques
81 pages
MediChain Decentralized Medical Record Keeping System 1
No ratings yet
MediChain Decentralized Medical Record Keeping System 1
41 pages
Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives
From Everand
Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives
Andrew Gelman
No ratings yet
Bayesian Networks: A Practical Guide to Applications
From Everand
Bayesian Networks: A Practical Guide to Applications
Olivier Pourret
2.5/5 (2)
Robust Methods in Biostatistics
From Everand
Robust Methods in Biostatistics
Stephane Heritier
No ratings yet
Aimoneyflow
No ratings yet
Aimoneyflow
3 pages
Clustering Monograph DSBA
No ratings yet
Clustering Monograph DSBA
36 pages
Digital Image Processing Lab Manual Part-1
No ratings yet
Digital Image Processing Lab Manual Part-1
41 pages
Chapter
100% (1)
Chapter
101 pages
GCP PMLE Notes
No ratings yet
GCP PMLE Notes
3 pages
Journal Data Mining
No ratings yet
Journal Data Mining
31 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Senticnet 6: Ensemble Application of Symbolic and Subsymbolic Ai For Sentiment Analysis
No ratings yet
Senticnet 6: Ensemble Application of Symbolic and Subsymbolic Ai For Sentiment Analysis
10 pages
2020-Teaching Teacher Recommendation Method Based on Fuzzy Clustering and Latent Factor Model
No ratings yet
2020-Teaching Teacher Recommendation Method Based on Fuzzy Clustering and Latent Factor Model
18 pages
K Means Clustering Solved Numerical - 5 Minutes Engineering
No ratings yet
K Means Clustering Solved Numerical - 5 Minutes Engineering
8 pages
Clustering
No ratings yet
Clustering
17 pages
Hand Gesture Recognition Approach:A Survey
No ratings yet
Hand Gesture Recognition Approach:A Survey
4 pages
Image Segmentation K Mean PDF
No ratings yet
Image Segmentation K Mean PDF
7 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
Students Performance Analysis
No ratings yet
Students Performance Analysis
12 pages
On Color Image Segmentation
No ratings yet
On Color Image Segmentation
17 pages
CS229 Lecture Notes: The K-Means Clustering Algorithm
No ratings yet
CS229 Lecture Notes: The K-Means Clustering Algorithm
3 pages
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
No ratings yet
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
192 pages
Resume - ArchanaBalasubramanian - Assistant Professor - CSE - NGPIT - Coimbatore - 10X
No ratings yet
Resume - ArchanaBalasubramanian - Assistant Professor - CSE - NGPIT - Coimbatore - 10X
3 pages
Student Cluster Analysis Based On Moodle Data and Academic Performance Indicators
No ratings yet
Student Cluster Analysis Based On Moodle Data and Academic Performance Indicators
4 pages
UAV-Assisted Offloading
No ratings yet
UAV-Assisted Offloading
6 pages
BDA Notes Unit-5
No ratings yet
BDA Notes Unit-5
62 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
Ip - Amodha Infotech - 8549932017 PDF
No ratings yet
Ip - Amodha Infotech - 8549932017 PDF
4 pages