Certificate
Certificate
I hereby declare that the work presented in this report entitled “ Clustering Techniques in
Machine Learning” in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science and Engineering/Information Technology
submitted in the department of Computer Science & Engineering and Information Technology ,
Jaypee University of Information Technology Waknaghat is an authentic record of my own
work carried out over a period from July 2022 to May 2023 under the supervision of Dr.
Monika Bharti Assistant Professor (SG).
The matter embodied in the report has not been submitted for the award of any other degree or
diploma.
This is to certify that the above statement made by the candidate is true to the best of my
knowledge.
i
Acknowledgement
The successful completion of any task would be incomplete without acknowledging the
people who made it possible and whose constant guidance and encouragement secured the
success.
First of all I wish to acknowledge the benevolence of omnipotent God who gave me strength
and courage to overcome all obstacles and showed me the silver lining in the dark clouds
with the profound sense of gratitude and heartiest regard. I express my sincere feelings of
indebtedness to my guide Dr. Monika Bharti for their positive attitude, excellent guidance,
constant encouragement, keen interest, invaluable co-operation, generous attitude and above
all their blessings. She has been a source of inspiration for me.
Last but not the least I would like to express my heartfelt thanks to my parents and my
friends who with their thought provoking views, veracity and whole hearted cooperation
helped in doing this project.
Vatsal Singh
ii
(191286)
Abstract
Today's period has seen a fast increase in the number and diversity of data produced by
scientific endeavours and the business environment. There are challenges associated with
gathering and analysing such large amounts of data. Data mining is a method for uncovering
hidden relationships and meaningful information inside data, however owing to their
fundamental complexity, typical data mining techniques cannot be used directly to large
data.
One of the most crucial problems in the fields of data mining and machine learning is data
clustering. Identifying uniform collections of the investigated items is the task of clustering.
Recently, creating clustering algorithms has piqued the interest of several academics. The
biggest issue with clustering lies in the fact that we lack understanding about the dataset that
has been provided. Additionally, the selection of input parameters used in these methods,
including the number of clusters, the number of nearest neighbours, and other elements,
makes the clustering issue more difficult. Therefore, selecting these values incorrectly will
produce poor clustering results. Additionally, the accuracy of these methods is subpar when
the dataset comprises clusters with complicated forms, densities, sizes, noise, and outliers. In
this research, we provide a novel method for the job of unsupervised clustering. Three
operational phases make up our strategy. For determining the very first preliminary cluster
centroid in the first phase, we employ the genetic method. We employ dataset crossover and
mutation in genetic algorithms. The genetic algorithm's first cluster centroid is used in the
second step of K-means clustering to locate clusters We get a collection of clusters for the
supplied dataset from the second phase. As a result, these clusters are taken into account in
the third step for the Davies Bouldin Index-based cluster evaluation. The Genetic K-means
Method is the name of this innovative algorithm. We show tests that demonstrate the
effectiveness of our novel method proposal in locating clusters with various non-convex
sizes, densities, noise, outliers, and greater accuracy. These tests demonstrate our suggested
iii
method is better as compared to the K-means approach.
Table of Contents
Certificate ……………………………………………………………………….. i
Acknowledgement………………………………………………………………. ii
Abstract……………………………………………………….…………………. iii
Table of Contents……………………………………………….……………….. iv
iv
1.4.5 Density Based Clustering………………………………………… 7
v
1.5.4. Hamming Distance…………………………………………………… 8
2. Literature Survey…………………………………………………………... 19
3. Research Problem…………………………………………………………... 25
3.3. Objectives……………………………………….…………………….......... 26
3.4. Methodology…………………………………….……………………........ 26
4.3. Conclusion……………………………..…….…….……………………........ 44
5.3. Conclusion………….………………….…….…….……………………........ 60
6.2. Limitations…………………………..………………………………………. 61
References……………………….……………….………………………………. 63
viii
List of Figures
ix
List of Tables
x
xi