Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mid 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

1.

Differentiate classification and prediction

• Classification

• predicts categorical class labels (discrete or nominal)

• classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data.

• Examples

• Credit/loan approval:

• Medical diagnosis: if a tumor is cancerous or benign

• Fraud detection: if a transaction is fraudulent

• Decision Tree Induction, KNN Classifier, Bayesian belief networks

Web page categorization: which category it is

• Numeric Prediction

• models continuous-valued functions, i.e., predicts unknown or missing values

• Applications

• Sales forecasting, whether forecasting.

• Linear Regression and Logarithmic regression are used .

2.Where Bayesian belief network is used.

Ans: Bayesian belief network is used in application where attributes of input are mutual dependent
on each other.

3.Discuss k -Means additional Issues

Ans :

Sensitivity to Initial Conditions


K-means is sensitive to initial conditions. The algorithm randomly initializes the cluster
centroids at the beginning, and the final clustering results can vary depending on these initial
positions.
Different initializations can lead to other local optima, resulting in different clustering
outcomes. This makes the K-means algorithm less reliable and reproducible.
Difficulty in Determining
One of the drawbacks of the K-means algorithm is that we have to set the number of clusters
( ) in advance. Choosing an incorrect number of clusters can lead to inaccurate results.
Various methods are available to estimate the optimal , such as the silhouette
analysis or elbow method, but they may not always provide a clear-cut answer.

If we use a too-small , we’ll get too broad clusters. Conversely, an overlarge l results in
clusters that are too specific:

It may require multiple runs to find the most suitable value of , which can be time-
consuming and resource-consuming.
Inability to Handle Categorical Data
Another drawback of the K-means algorithm is its inability to handle categorical data. The
algorithm works with numerical data, where distances between data points can be calculated.
However, categorical data doesn’t have a natural notion of distance or similarity.
When categorical data is used with the K-means algorithm, it requires converting the
categories into numerical values, such as using one-hot encoding. One shortcoming of using
one-hot encoding is that it treats each feature independently and can degrade performance
since it can significantly increase data dimensionality.
Time Complexity
The time complexity of the algorithm is , where is the number of
clusters, is the number of data points, is the number of dimensions, and is the
number of iterations.
So, even moderately large datasets can be challenging to handle if they’re high-
dimensional.

Explain KNN classifier.

• Nearest neighbor classifier represents each example as a data point in a d-


dimentional space.
• where d is the no of attributes.
• Given a test example we compute its proximity to the rest of data points.
• The k-nearest neighbor refers to k neighbors closest to test data.

Once the nearest neighbor list is obtained take the majority vote of class labels among
the k-nearest neighbors

Where
– v is the class label
– yi is the class label for one of the nearest neighbors.
– I(.) is the indicator function that returns the value 1 if its argument is true and
0 other wise.
Weight the vote according to distance
– weight factor, w=1/d 2

Wards Method.

Text Clustering
• One popular text clustering algorithm is ward’s Minimum Variance method.
• It is an agglomerative hierarchal clustering technique and it tends to generate very
compact clusters.
• We can take either the Euclidean metric or Hamming distance as the measure of
dissimilarities between feature vectors.
• The clustering method begins with n clusters, one for each text.
• At each stage two clusters are merged to generate a new cluster.
• The clusters Ck and Ci are merged to get a new cluster Cki based on the following
criterion.

Ward’s method starts with n clusters, each containing a single object. These n clusters are
combined to make one cluster containing all objects. At each step, the process makes a new
cluster that minimizes variance, measured by an index called E (also called the sum of
squares index).

At each step, the following calculations are made to find E:

1. Find the mean of each cluster.


2. Calculate the distance between each object in a particular cluster, and that
cluster’s mean.
3. Square the differences from Step 2.
4. Sum (add up) the squared values from Step 3.
5. Add up all the sums of squares from Step 4.

Different classes of Web Mining


1.Web Content Mining
2.Web usage Mining
3.Web Structure Mining.

Discuss about hierarchal clustering


Hierarchical clustering is an unsupervised machine
learning algorithm used to group data points into various clusters
based on the similarity between them. It is based on the idea of
creating a hierarchy of clusters, where each cluster is made up of
smaller clusters that can be further divided into even smaller
clusters. This hierarchical structure makes it easy to visualize the
data and identify patterns within the data.
Hierarchical clustering is of two types.

1. Agglomerative clustering
2. Divisive clustering

What is Agglomerative Clustering?


Agglomerative clustering is a type of data clustering method used in
unsupervised learning. It is an iterative process that groups similar
objects into clusters based on some measure of similarity.
Agglomerative clustering uses a bottom-up approach for dividing
data points into clusters. It starts with individual data points and
merges them into larger clusters until all of the objects are clustered
together.

The algorithm begins by assigning each object to its own cluster. It


then uses a distance metric to determine the similarity between
objects and clusters. If two clusters have similar elements, they are
merged together into a larger cluster. This continues until all objects
are grouped into one final cluster. For example, consider the
following image.

Agglomerative Clustering Example (Image taken from YouTube video on Hierarchical clustering by
Edureka )

 In this image, a population has been clustered into different


clusters. The individual persons are assigned their own cluster
at the bottom.
 As we move upwards, the clusters are merged to form bigger
clusters. For instance, the male and employed people are
clustered together and jobless male persons are put into a
different cluster. These clusters are then grouped together to
form a cluster representing all the males.
 Similarly, The individual females are assigned their own cluster
at the bottom. As we move upwards, the clusters are merged
to form bigger clusters. For instance, the female employed
people are clustered together and jobless females are put into
a different cluster. These clusters are then grouped together to
form a cluster representing all the females.
 At the top, the clusters representing males and females are
merged to form a single cluster representing the entire
population.

The advantage of agglomerative clustering over other techniques is


that it can handle large datasets effectively, as it only requires
comparisons between pairs of objects or clusters. This makes it
relatively efficient for large datasets.

What is Divisive Clustering?


Divisive clustering is also a type of hierarchical clustering that is
used to create clusters of data points. It is an unsupervised learning
algorithm that begins by placing all the data points in a single
cluster and then progressively splits the clusters until each data
point is in its own cluster. Divisive clustering is useful for analyzing
datasets that may have complex structures or patterns, as it can
help identify clusters that may not be obvious at first glance.

Divisive clustering works by first assigning all the data points to one
cluster. Then, it looks for ways to split this cluster into two or more
smaller clusters. This process continues until each data point is in its
own cluster. For example, consider the following image.
Divisive Clustering Example (Image taken from YouTube video on Hierarchical clustering by Edureka )

 In the given example, the divisive clustering algorithm first


considers the entire population as a single cluster. Then, it
divides the population into two clusters containing males and
females.
 Next, the algorithm divides the clusters containing males and
females on the basis of whether they are employed or
unemployed.
 The algorithm keeps dividing the clusters into smaller clusters
until it reaches a point where the clusters consist of a single
person.

Advantages of Hierarchical Clustering


Hierarchical clustering is a widely used technique in data analysis,
which involves the grouping of objects into clusters based on their
similarity. This method of clustering is advantageous in a variety of
ways and can be used to solve various types of problems. Here are
10 advantages of hierarchical clustering:

1. Robustness: Hierarchical clustering is more robust than other


methods since it does not require a predetermined number of
clusters to be specified. Instead, it creates hierarchical clusters
based on the similarity between the objects, which makes it
more reliable and accurate.
2. Easy to interpret: Hierarchical clustering produces a tree-like
structure that is easy to interpret and understand. This makes
it ideal for data analysis as it can provide insights into the data
without requiring complex algorithms or deep learning
models.
3. Flexible: Hierarchical clustering is a flexible method that can be
used on any type of data. It can also be used with different
types of similarity functions and distance measures, allowing
for customization based on the application at hand.
4. Scalable: Hierarchical clustering is a scalable method that can
easily handle large datasets without becoming computationally
expensive or time-consuming. This makes it suitable for
applications such as customer segmentation where large
datasets need to be processed quickly and accurately.
5. Visualization: Hierarchical clustering produces a visual tree
structure that can be used to gain insights into the data quickly
and easily. This makes it an ideal choice for exploratory data
analysis as it allows researchers to gain an understanding of
the data at a glance.
6. Versatile: Hierarchical clustering can be used for both
supervised and unsupervised learning tasks, making it
extremely versatile in its range of applications.
7. Easier to apply: Since there are no parameters to specify in
hierarchical clustering, it is much easier to apply compared to
other methods such as k-means clustering or k-prototypes
clustering. This makes it ideal for novice users who need to
quickly apply clustering techniques with minimal effort.
8. Greater accuracy: Hierarchical clustering often tends to
produce superior results compared to other methods of
clustering due to its ability to create more meaningful clusters
based on similarities between objects rather than arbitrary
boundaries set by cluster centroids or other parameters.
9. Non-linearity: Agglomerative or divisive clustering can handle
non-linear datasets better than other methods, which makes it
suitable for cases where linearity cannot be assumed in the
dataset being analyzed.
10. Multiple-level output: By producing a hierarchical tree
structure, hierarchical clustering provides multiple levels of
output which allows users to view data at different levels of
detail depending on their needs. This flexibility makes it an
attractive choice in many situations where multiple levels of
analysis are required.

Disadvantages of Hierarchical Clustering


While hierarchical clustering is a powerful tool for discovering
patterns and relationships in data sets, it also has its drawbacks.
Here are 10 disadvantages of hierarchical clustering:
1. It is sensitive to outliers. Outliers have a significant influence
on the clusters that are formed, and can even cause incorrect
results if the data set contains these types of data points.
2. Hierarchical clustering is computationally expensive. The time
required to run the algorithm increases exponentially as the
number of data points increases, making it difficult to use for
large datasets.
3. The results of Agglomerative or divisive clustering can
sometimes be difficult to interpret the results due to its
complexity. The dendrogram representation of the clusters can
be hard to understand and visualize, making it difficult to draw
meaningful conclusions from the results.
4. It does not guarantee optimal results or the best possible
clusterings. Since it is an unsupervised learning algorithm, it
relies on the researcher’s judgment and experience to assess
the quality of the results.
5. Hierarchical clustering methods require a predetermined
number of clusters before they can begin clustering, which
may not be known beforehand. This makes it difficult to use in
certain applications where this information is not available.
6. The results produced by hierarchical clustering may be
dependent on the order in which the data points are
processed, making it difficult to reproduce or generalize them
for other datasets with similar characteristics.
7. The algorithm does not provide any flexibility when dealing
with multi-dimensional data sets since all variables must be
treated equally in order for accurate results to be obtained.
8. Agglomerative or divisive clustering is prone to producing
overlapping clusters, where different groups of data points may
share common characteristics and thus be grouped together
even though they should not belong to the same cluster.
9. Hierarchical clustering requires manual intervention for
selecting the appropriate number of clusters, which can be
time-consuming and prone to errors if done incorrectly or
without proper knowledge about the data set being analyzed.
10. It cannot handle categorical variables effectively since
they cannot be converted into numerical values and thus will
fail to produce meaningful clusters if used in hierarchical
clustering algorithms.

Applications of Hierarchical Clustering


Hierarchical clustering is a type of unsupervised machine learning
that can be used for many different applications. It is used to group
similar data points into clusters, which can then be used for further
analysis. Here are 10 applications of hierarchical clustering:

1. Customer segmentation: Agglomerative or divisive clustering


can be used to group customers into different clusters based
on their demographic, spending, and other characteristics. This
can be used to better understand customer behavior and to
target marketing campaigns.
2. Image segmentation: Hierarchical clustering can be used to
segment images into different regions, which can then be used
for further analysis.
3. Text analysis: Hierarchical clustering can be used to group text
documents based on their content, which can then be used for
text mining or text classification tasks.
4. Gene expression analysis: Hierarchical clustering can be used
to group genes based on their expression levels, which can
then be used to better understand gene expression patterns.
5. Anomaly detection: Hierarchical clustering can be used to
detect anomalies in data, which can then be used for fraud
detection or other tasks.
6. Recommendation systems: Hierarchical clustering can be used
to group users based on their preferences, which can then be
used to recommend items to them.
7. Risk assessment: Agglomerative or divisive clustering can be
used to group different risk factors in order to better
understand the overall risk of a portfolio.
8. Network analysis: Hierarchical clustering can be used to group
nodes in a network based on their connections, which can then
be used to better understand network structures.
9. Market segmentation: Hierarchical clustering can be used to
group markets into different segments, which can then be used
to target different products or services to them.
10. Outlier detection: Hierarchical clustering can be used to
detect outliers in data, which can then be used for further
analysis.

Web Structure Mining.

Methods of Text Clustering

Ans: Ward’s Min variance method


K-means
Hierarchal methos

You might also like