Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DSA Presentation Group 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Presented By Group 6

Data Science and


Analysis

UNSUPERVISED
LEARNING
HIMANI KHANDELWAL 201091070
RIJAB FATIMA 201091005
TANIA SHARMA 201091002
PARINA JAIN 201091066
AYUSHI WAKODE 201091034
PRATIKSHA NAIK 201091069
Introduction
Of Unsupervised
learning
Unsupervised learning is the training of a machine
using information that is neither classified nor
labeled and allowing the algorithm to act on that
information without guidance. Here the task of the
machine is to group unsorted information
according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided
that means no training will be given to the machine.
Therefore the machine is restricted to find the
hidden structure in unlabeled data by itself.
EXAMPLE OF For instance, suppose it is given an image
having both dogs and cats which it has never
UNSUPERVISED seen. Thus the machine has no idea about the
features of dogs and cats so we can’t categorize
LEARNING it as ‘dogs and cats ‘. But it can categorize them
according to their similarities, patterns, and
differences, i.e., we can easily categorize the
above picture into two parts. The first may
contain all pics having dogs in them and the
second part may contain all pics having cats in
them. Here you didn’t learn anything before,
which means no training data or examples.
It allows the model to work on its own to
discover patterns and information that were
previously undetected. It mainly deals with
unlabelled data.
BLOCK DIAGRAM OF UNSUPERVISED LEARNING
Importance of unsupervised learning
Annotating large datasets is very costly and hence we can label
only a few examples manually. Example: Speech Recognition

There may be cases where we don’t know how many/what


classes is the data divided into. Example: Data Mining

We may want to use clustering to gain some insight into the


structure of the data before designing a classifier.
ISSUES WITH UNSUPERVISED
LEARNING
Unsupervised learning is intrinsically more difficult than supervised
learning as it does not have corresponding output.

The result of the unsupervised learning algorithm might be less


accurate as input data is not labeled, and algorithms do not know
the exact output in advance.

It can be expensive as it might require human intervention to


understand the patterns and correlate them with domain
knowledge while, in general, we would like to have less human
intervention as possible.

Unsupervised learning algorithm


In unsupervised learning, the unlabeled data is fed to the machine
learning model, this unlabeled data is interpreted and processed
using algorithms which divides the object into groups according to
the similarities and differences between the objects.
Unsupervised Learning algorithm can be categorized into two
types of problems.

1. Clustering
2. Association
Association Rule Learning
Association rule learning is a kind of unsupervised learning technique
that tests for the reliance of one data element on another data
element and design appropriately so that it can be more cost-
effective.
It tries to discover some interesting relations or associations between
the variables of the dataset. It depends on various rules to find
interesting relations between variables in the database.
The association rule learning is the most important approach of
machine learning, and it is employed in Market Basket analysis, Web
usage mining, continuous production, etc.
How Association Rule works
There are few key terms that we need to be familiar with to understand
how the association rules work.
Apriori: We will be using Apriori for building all the rules in this blog.
Itemsets: It refers to the collection of items. N item set means set of n
items. Simply, it is the set of item purchased by customers.
Support: It is percentage of time X and Y occur together out of all
transaction.
Confidence: It is percent of transactions that contains both X and Y out
of all transaction that contains X.
Lift: It measures how many times more often X and Y occur together
then expected if they are statistically independent to each other
Minlen: the minimum number of items in the rule
Maxlen: the maximum number of items in the rule
Target: indicates the type of association mined.
Frequent Itemsets Generation: Find the most frequent itemsets from the data
based on predetermined support and minimum item and maximum item.
Rule Generation:
LHS > RHS: Left hand side and Right-hand side are usually used to understand
how often item A and item B occur together.

Association rule learning works on the concept of If and Else Statement, such as
if A then B.
Clustering
Clustering can be considered the most
important unsupervised learning problem; so,
as every other problem of this kind, it deals
with finding a structure in a collection of
unlabeled data. Clustering is a method of
grouping the objects into clusters such that
objects with most similarities remains into a
group and has less or no similarities with the
objects of another group.

Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
The Goals of Clustering
The goal of clustering is to determine the internal grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering?

It can be shown that there is no absolute “best” criterion which


would be independent of the final aim of the clustering.

Consequently, it is the user who should supply this criterion, in such a


way that the result of the clustering will suit their needs.

To find a particular clustering solution , we need to define the


similarity measures for the clusters.
Applications of Clustering

Clustering has a myriad of uses in a variety of industries. Some common


applications for clustering include the following:
market segmentation
social network analysis
search result grouping
medical imaging
image segmentation
anomaly detection

After clustering, each cluster is assigned a number called a cluster ID. Now, you can
condense the entire feature set for an example into its cluster ID. Representing a
complex example by a simple cluster ID makes clustering powerful. Extending the
idea, clustering data can simplify large datasets.
Clustering Methods

Partitioning Clustering
Density-Based Clustering
Distribution Model-Based Clustering
Kmeans Clustering
Hierarchical Clustering
Competitive Learning
Partitioning Clustering

It is a type of clustering that divides the data


into non-hierarchical groups. It is also known as
the centroid-based method. The most common
example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k
groups, where K is used to define the number of
pre-defined groups. The cluster center is
created in such a way that the distance
between the data points of one cluster is
minimum as compared to another cluster
centroid
Density-Based Clustering

The density-based clustering method connects


the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as
long as the dense region can be connected. This
algorithm does it by identifying different clusters
in the dataset and connects the areas of high
densities into clusters. The dense areas in data
space are divided from each other by sparser
areas. These algorithms can face difficulty in
clustering the data points if the dataset has
varying densities and high dimensions
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on
the probability of how a dataset belongs to a particular distribution. The grouping
is done by assuming some distributions commonly Gaussian Distribution. The
example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
K-Means Clustering
K-Means Clustering is an
Unsupervised Learning algorithm,
which groups the unlabeled
dataset into different clusters.
Here K defines the number of pre-
defined clusters that need to be
created in the process, as if K=2,
there will be two clusters.

The k-means clustering algorithm mainly performs two tasks:


1) Determines the best value for K center points or centroids by an iterative
process.
2) Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.

Working of K-Means Algorithm:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
K Nearest Neighbor Clustering
K-Nearest Neighbour is one of the
simplest Machine Learning algorithms
based on Supervised Learning
technique.
K-NN algorithm assumes the similarity
between the new case/data and
available cases and put the new case
into the category that is most similar to
the available categories.
K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
ALGORITHM:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each
category.

Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.

Step-6: Our model is ready.


Hierarchical Clustering
Hierarchical clustering is a popular method for grouping
objects. It creates groups so that objects within a group are
similar to each other and different from objects in other groups.
Clusters are visually represented in a hierarchical tree called a
dendrogram.

Hierarchical clustering has a couple of key benefits:

1. There is no need to pre-specify the number of clusters.


Instead, the dendrogram can be cut at the appropriate level to
obtain the desired number of clusters.
2. Data is easily summarized/organized into a hierarchy using
dendrograms. Dendrograms make it easy to examine and
interpret clusters.
Hierarchical Clustering Types
There are two main types of hierarchical clustering:
1. Agglomerative: Initially, each object is considered to be its own cluster.
According to a particular procedure, the clusters are then merged step by
step until a single cluster remains. At the end of the cluster merging process,
a cluster containing all the elements will be formed.
2. Divisive: The Divisive method is the opposite of the Agglomerative method.
Initially, all objects are considered in a single cluster. Then the division
process is performed step by step until each object forms a different
cluster. The cluster division or splitting procedure is carried out according
to some principles that maximum distance between neighboring objects in
the cluster.
Divisive Hierarchical Clustering
The divisive clustering algorithm is a top-down clustering approach, initially, all the points in
the dataset belong to one cluster and split is performed recursively as one moves down the
hierarchy.Initially, all points in the dataset belong to one single cluster.
Divisive Hierarchical Clustering Algorithm:
1. Partition the cluster into two least similar cluster
2. Proceed recursively to form new clusters until the desired number of clusters is obtained.
3. Proceed recursively to form new clusters until the desired number of clusters is obtained.
Agglomerative Hierarchical Clustering
The agglomerative clustering is the most
common type of hierarchical clustering
used to group objects in clusters based on
their similarity.
The algorithm starts by treating each object
as a singleton cluster. Next, pairs of clusters
are successively merged until all clusters
have been merged into one big cluster
containing all objects.
Agglomerative clustering works in a “bottom-up” manner. That is, each object is
initially considered as a single-element cluster (leaf). At each step of the algorithm,
the two clusters that are the most similar are combined into a new bigger cluster
(nodes). This procedure is iterated until all points are member of just one single big
cluster (root) (see figure below).
ALGORITHM:

Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images.

Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
Step 1 Step 2 Step 3

Step 4
Competitve Learning
Competitive learning is a form of
unsupervised learning in artificial
neural networks, in which nodes
compete for the right to respond to a
subset of the input data. A variant of
Hebbian learning, competitive
learning works by increasing the
specialization of each node in the
network. It is well suited to finding
clusters within data.
Competitive Learning Algorithm

Competitive Learning Example:


Leader follower algorithm


Leader algorithm is a incremental clustering algorithm generally used to cluster large
data sets. This algorithm is order dependent and may form different clusters based on
the order the data set is provided to the algorithm. The algorithm consists of the
following steps.

Step 1: Assign the first data item, P1 to cluster C1. This data set will be the leader of
the cluster C1.

Step 2:Now move to the next data item say P2 and calculate its distance from the
leader P1. If the distance between P2 and leader P1 is less than a user specified
threshold (t) then data point P2 is assigned to this cluster (Cluster C1). If the distance
between leader P1 and data item P2 is more than the user specified threshold t, then
form a new cluster C2 and assign P2 to this new cluster. P2 will be the leader of the
cluster C2.
Step3: For all the remaining data items the distance between the data point and the leader of
the clusters is calculated. If the distance between the data items and the any of the leader is
less then the user specified threshold, the data point is assigned to that cluster. However, If the
distance between the data point and the any of the cluster's leader is more than the user
specified threshold, a new cluster is created and that particular data point is assigned to that
cluster and considered the leader of the cluster.

Step 4: Repeat Step 3 till all the data items are assigned to clusters.
Applications of unsupervised
learning
News Sections: Google News uses unsupervised learning to categorize
articles on the same story from various online news outlets. For
example, the results of a presidential election could be categorized
under their label for “US” news.
Computer vision: Unsupervised learning algorithms are used for visual
perception tasks, such as object recognition.
Medical imaging: Unsupervised machine learning provides essential
features to medical imaging devices, such as image detection,
classification and segmentation, used in radiology and pathology to
diagnose patients quickly and accurately.
Anomaly detection: Unsupervised learning models can comb through
large amounts of data and discover a typical data points within a
dataset. These anomalies can raise awareness around faulty equipment,
human error, or breaches in security.
Customer personas: Defining customer personas makes it easier to
understand common traits and business clients' purchasing habits.
Unsupervised learning allows businesses to build better buyer persona
profiles, enabling organizations to align their product messaging more
appropriately.
Recommendation Engines: Using past purchase behavior data,
unsupervised learning can help to discover data trends that can be used
to develop more effective cross-selling strategies. This is used to make
relevant add-on recommendations to customers during the checkout
process for online retailers.

You might also like