Unit 5

Unit-V
Classification & Clustering
Classification:
Classification of data mining systems is the process of segregating data points into
distinctive classes. It eases the organization of data sets of diverse sorts. So,
whether you have small & simple databases or large & complex ones, this process
can provide a seamless classification of all.
Types of Classification f Data Mining
Below are the key types of data mining classifications:
Supervised Classification
The type of data mining classification fits when you are on the lookout for a
specific target value to make classifications and predictions in data mining.
The targets can generate one or two results pertaining to the possibilities.
Unsupervised Classification
Here there is no concentration drawn to predetermined attributes. It also has

very few things to do with the target value. It is only used to map out hidden
relations and data structures.
Semi-Supervised Classification
This classification of data mining is the middle ground between the supervised
and unsupervised classification of data mining. Here, the fusion of unlabeled
and labelled datasets becomes functional at the time of the training period.
Reinforcement Classification
This data mining classification type involves trial and error to determine ways
to best react to situations. It also allows the software agent to understand its
behaviour depending on the environmental reviews.
Below are the major classification techniques in data mining:

Generative Classification
It models individual class segmentation. The classification type uses the generation
of data that occurs through assumptions and estimations. It learns from this
generated data and predicts the unseen data.
Discriminative Classification
This is responsible for the classification of all rows involved in data. It uses the power
of data quality to make accurate classifications of the data available.
Data Mining Classification as Per the Type of Knowledge Mined
Classification of data mining systems can occur relevant to the form of knowledge
mined. This implies that the type is reliable on a few functionalities, namely:
 Correlation And Association Analysis

 Classification and Prediction in data mining
 Characterization
 Discrimination
 Evolution Analysis
 Outlier Analysis
Clustering:
Clustering is the process of making a group of abstract objects into classes of
similar objects.
 A cluster of data objects can be treated as one group.

 While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
 The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.
Applications of Cluster Analysis
 Clustering analysis is broadly used in many applications such as

market research, pattern recognition, data analysis, and image
processing.
 Clustering can also help marketers discover distinct groups in their
customer base. And they can characterize their customer groups
based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal
taxonomies, categorize genes with similar functionalities and gain
insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in
an earth observation database. It also helps in the identification of
groups of houses in a city according to house type, value, and
geographic location.
 Clustering also helps in classifying documents on the web for
information discovery.
 Clustering is also used in outlier detection applications such as
detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain
insight into the distribution of data to observe characteristics of
each cluster.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data

mining −
 Scalability − We need highly scalable clustering algorithms to deal

with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based
(numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They
should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able
to handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and
may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable,
comprehensible, and usable.
Clustering Methods
Clustering methods can be classified into the following categories −
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning

method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −
 Each group contains at least one object.

 Each object must belong to exactly one group.
Points to remember −
 For a given number of partitions (say k), the partitioning method

will create an initial partitioning.
 Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data

objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start

with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until
all of the groups are merged into one or until the termination condition
holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start

with all of the objects in the same cluster. In the continuous iteration, a
cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once
a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of
hierarchical clustering −
 Perform careful analysis of object linkages at each hierarchical
partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and
then performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to

continue growing the given cluster as long as the density in the
neighbourhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best
fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
This method also provides a way to automatically determine the number

of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or

application-oriented constraints. A constraint refers to the user
expectation or the properties of desired clustering results. Constraints
provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application
requirement.
decision trees
A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents
a class.
The benefits of having a decision tree are as follows −
he benefits of having a decision tree are as follows −
 It does not require any domain knowledge.

 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple
and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a

decision tree algorithm known as ID3 (Iterative Dichotomiser). Later, he
presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees
are constructed in a top-down recursive divide-and-conquer manner.
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training

data due to noise or outliers. The pruned trees are smaller and less
complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.

 Post-pruning - This approach removes a sub-tree from a fully grown
tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
 Number of leaves in the tree, and

 Error rate of the tree.
Covering a Rules:
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for

classification. We can express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes

THEN buy_computer = yes
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute
tests and these tests are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is
satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-

THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −

 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically
ANDed.
 The leaf node holds the class prediction, forming the rule
consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form

the training data. We do not require to generate a decision tree first. In
this algorithm, each rule for a given class covers many of the tuples of
that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As
per the general strategy the rules are learned one at a time. For each
time rules are learned, a tuple covered by the rule is removed and the
process continues for the rest of the tuples. This is because the path to
each leaf in a decision tree corresponds to a rule.
Rule Pruning
The rule is pruned is due to the following reason −
 The Assessment of quality is made on the original set of training

data. The rule may perform well on training data but less well on
subsequent data. That's why the rule pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if
pruned version of R has greater quality than what was assessed on
an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a
given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R,

respectively.
Task prediction
Task prediction in data mining refers to the process of predicting the nature or type of a task
that should be performed based on the available data and the goals of the data mining
process. This prediction helps in determining the appropriate techniques, algorithms, and
methods that should be applied to the data to achieve the desired outcomes.
Here’s how task prediction typically works in data mining:

1. Data Understanding: First, the data mining practitioner or analyst needs to
understand the dataset they are working with. This involves exploring the data to
identify its characteristics, variables, and potential relationships.
2. Goal Definition: Clearly defining the goals of the data mining project is crucial. This
includes understanding what insights or predictions are sought from the data.
3. Task Prediction: Based on the data characteristics and the project goals, the next step
is to predict the appropriate data mining tasks. This can include tasks such as
classification, regression, clustering, association rule mining, anomaly detection, or
others.
4. Selection of Techniques: Once the task is predicted, the appropriate data mining
techniques and algorithms can be selected. For example, if classification is predicted
as the task, algorithms like decision trees, random forests, or support vector machines
might be considered.
5. Implementation and Evaluation: After selecting the techniques, they are
implemented on the data. The results are evaluated to assess how well they meet the
project goals. This evaluation might involve metrics specific to the chosen task (e.g.,
accuracy for classification, RMSE for regression).
6. Iterative Process: Data mining is often an iterative process where initial results may
lead to refinement of the task prediction, choice of techniques, or even the data
preprocessing steps.
Statistical Classification
Statistical classification in data mining is a fundamental technique used to categorize data

into predefined classes or categories based on a set of training data containing observations
(or instances) with known class labels. It is a supervised learning approach where the goal is
to build a model that can accurately predict the class label of new, unseen data based on its
attributes.
Key Concepts in Statistical Classification:
1. Training Data: This is the initial dataset used to build and train the classification
model. It consists of instances where each instance is described by a set of attributes
(features) and is associated with a known class label.
2. Class Labels: These are the predefined categories or classes that the data instances
belong to. For example, in email spam detection, classes could be "spam" and "not
spam".
3. Attributes or Features: These are the characteristics or variables that describe each
instance. They are used as input to the classification model to predict the class label.
4. Classification Model: The model is created using the training data and a chosen
classification algorithm. The model learns patterns and relationships between the
attributes and class labels in the training data.
5. Classification Algorithms: There are various algorithms used for building
classification models, each with its strengths and weaknesses. Common algorithms
include:
o Decision Trees: Trees where each node represents a test on an attribute, each
branch represents an outcome of the test, and each leaf node represents a class
label.
o Naive Bayes: Based on Bayes' theorem, assumes independence between
features given the class.
o Logistic Regression: Models the probability of a certain class using a logistic
function.
o Support Vector Machines (SVM): Constructs hyperplanes in a high-
dimensional space to separate instances of different classes.
Steps in Statistical Classification:
1. Data Preprocessing: This includes handling missing values, normalizing or

standardizing data, and possibly reducing dimensionality through techniques like PCA
(Principal Component Analysis).
2. Feature Selection: Choosing relevant features that contribute most to predicting the
class labels can improve the performance of the classification model.
3. Model Training: Using the training data, the selected classification algorithm is
applied to learn patterns and relationships between features and class labels.
4. Model Evaluation: The performance of the classification model is assessed using
evaluation metrics such as accuracy, precision, recall, F1-score, ROC curves, etc.
Cross-validation techniques may also be employed to ensure the model's
generalizability.
5. Model Testing: Finally, the trained model is tested on unseen data (test set) to
evaluate its performance in predicting class labels for new instances.
Applications of Statistical Classification:
Statistical classification is widely used in various fields, including:
 Medical Diagnosis: Classifying patients into different disease categories based on

symptoms and test results.
 Sentiment Analysis: Determining the sentiment (positive, negative, neutral) of text
data such as reviews or social media posts.
 Image Recognition: Identifying objects or patterns within images based on their
features.
 Credit Risk Assessment: Classifying loan applicants into low-risk and high-risk
categories based on their financial history and other attributes.
Bayesian network
A Bayesian network in data mining is a probabilistic graphical model that represents a set of
variables and their conditional dependencies via a directed acyclic graph (DAG). It is
particularly useful for modelling uncertain relationships between variables and for making
predictions or inferences based on data.
Here are some key aspects and uses of Bayesian networks in data mining:
1. Structure: The Bayesian network consists of nodes (variables) and directed edges
(dependencies). Nodes represent variables (e.g., attributes or features in a dataset),
and edges represent probabilistic dependencies between variables.
2. Conditional Probability: Each node in a Bayesian network is associated with a
conditional probability table (CPT) that quantifies the probability of the node given its
parent nodes (nodes that have edges pointing into it).
3. Inference: Bayesian networks allow for efficient inference about the probability
distribution over variables, given observed evidence. This is achieved through
algorithms such as variable elimination or belief propagation, which calculate
probabilities based on the network structure and CPTs.
4. Learning: Bayesian networks can be learned from data using various methods such as
constraint-based learning (using conditional independence tests), score-based learning
(optimizing a scoring metric like Bayesian Information Criterion), or hybrid methods.
5. Predictive Modeling: Once a Bayesian network is learned from data, it can be used
for predictive modeling. For example, given some evidence (observed values of
certain variables), the network can predict the probabilities or values of other
variables.
6. Causal Inference: Bayesian networks can also be used to infer causal relationships
between variables, although establishing causality requires careful modeling and
interpretation of the network.
7. Decision Support: They are valuable in decision support systems where uncertainty
is inherent, such as medical diagnosis, risk assessment, fault diagnosis, and more.
Instance based methods:
Instance-based methods, also known as memory-based learning or lazy learning methods, are
a class of algorithms in data mining and machine learning that make predictions based on
similarities between new instances (data points) and instances seen during training. Instead of
constructing a general model of the data, these methods store the training instances and use
them directly to make predictions for new data points. Here are some key characteristics and
examples of instance-based methods:
Characteristics:
1. No explicit model creation: Instance-based methods do not build an explicit model

during the training phase. Instead, they retain the entire training dataset or a subset of
it.
2. Lazy learning: They are often referred to as lazy learners because they delay
generalization until a new instance needs to be classified or predicted. This contrasts
with eager learners (like decision trees or neural networks) that build a model before
being presented with new instances.
3. Similarity measure: Prediction or classification is based on the similarity measure
between the new instance and instances in the training set. Common similarity
measures include Euclidean distance, cosine similarity, and Pearson correlation
coefficient, depending on the nature of the data.
4. Localized decision making: Instance-based methods can adapt well to local
structures in the data and are capable of handling complex decision boundaries.
5. Suitable for non-parametric data: They are particularly useful when the underlying
data distribution is not well-defined or when it changes over time, as they do not make
assumptions about the distribution.
Examples:
1. k-Nearest Neighbors (k-NN): One of the most popular instance-based methods

where the prediction for a new instance is based on the majority class among its k
nearest neighbors in the feature space.
2. Case-Based Reasoning (CBR): In CBR, solutions to new problems are retrieved by
comparing them to previously stored cases (instances) and adapting the solution of
similar cases.
3. Locally Weighted Learning (LWL): This method assigns weights to instances in the
training set based on their proximity to the query instance, with closer instances
having a higher influence on the prediction.
4. Prototype Methods: These methods involve selecting representative examples
(prototypes) from the training data and using them to classify new instances.
Advantages:
 Adaptability: Instance-based methods can handle complex decision boundaries and

can adapt well to local data structures.
 Interpretability: Predictions are often easier to interpret because they are based
directly on similar instances.
 Incremental learning: They can easily incorporate new instances into the existing
model without the need for retraining the entire dataset.
Challenges:
 Computational cost: Prediction can be computationally expensive, especially with

large datasets, because it requires comparing the new instance to all stored instances.
 Storage requirements: Storing the entire training dataset or a large subset of it can be
memory-intensive, particularly with high-dimensional data.
 Sensitivity to noise and irrelevant features: Instance-based methods can be
sensitive to noisy or irrelevant features, which may affect the similarity measure and
hence the predictions.
Instance-based methods are particularly useful in scenarios where the relationship between
input and output variables is complex or where the data distribution is not well understood.
Their flexibility and reliance on similarity measures make them valuable tools in various
applications such as classification, regression, and pattern recognition.
Linear models:
Linear models are fundamental tools in data mining and machine learning for regression and
classification tasks. They assume a linear relationship between the input variables (features)
and the output variable (target). Here are key aspects of linear models in data mining:
Characteristics:
1. Linear Relationship: Linear models assume that the relationship between the input
variables X=(X1,X2,...,Xp)X = (X_1, X_2, ..., X_p)X=(X1,X2,...,Xp) and the output
variable YYY can be approximated by a linear function:
Y=β0+β1X1+β2X2+...+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \
beta_p X_p + \epsilonY=β0+β1X1+β2X2+...+βpXp+ϵ
where β0,β1,...,βp\beta_0, \beta_1, ..., \beta_pβ0,β1,...,βp are coefficients (parameters)

to be estimated, and ϵ\epsilonϵ represents the error term.
2. Parameter Estimation: Linear models estimate the coefficients β\betaβ using

techniques like Ordinary Least Squares (OLS), which minimizes the sum of squared
residuals between the predicted and actual values.
3. Interpretability: Linear models provide interpretable coefficients that indicate the
strength and direction of the relationship between each input variable and the output.
4. Scalability: They are computationally efficient and scalable to large datasets when
compared to more complex models like neural networks or ensemble methods.
5. Assumptions: Linear models assume that the relationship between variables is linear,
errors are normally distributed, and there is no multicollinearity (high correlation
between predictors).
Types of Linear Models:
1. Simple Linear Regression: Involves predicting a continuous output variable YYY

based on a single input variable XXX.
Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilonY=β0+β1X+ϵ
2. Multiple Linear Regression: Predicts a continuous output YYY based on multiple

input variables X1,X2,...,XpX_1, X_2, ..., X_pX1,X2,...,Xp.
Y=β0+β1X1+β2X2+...+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \

beta_p X_p + \epsilonY=β0+β1X1+β2X2+...+βpXp+ϵ
3. Logistic Regression: Used for binary classification tasks where the output YYY is
binary (0 or 1). It models the probability of the output belonging to a certain class
using the logistic function.
logit(P(Y=1∣X))=β0+β1X1+β2X2+...+βpXp\text{logit}(P(Y=1|X)) = \beta_0 + \
beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_plogit(P(Y=1∣X))=β0+β1X1+β2X2+...
+βpXp
4. Linear Discriminant Analysis (LDA): A technique used for classification that

models the distribution of the predictors separately in each class and uses Bayes'
theorem to estimate the posterior probabilities.
Advantages:
 Interpretability: Coefficients provide insights into the relationship between

variables.
 Efficiency: Computationally efficient and suitable for large datasets.
 Ease of implementation: Relatively straightforward to implement and understand.
Challenges:
 Linear Assumption: Limited to linear relationships between variables.

 Overfitting: Can struggle with complex relationships and non-linear patterns in the
data.
 Sensitive to outliers: Outliers can disproportionately affect parameter estimation.
Applications:
 Prediction: Predicting continuous or categorical outcomes based on input variables.

 Classification: Binary or multiclass classification tasks using logistic regression or
linear discriminant analysis.
 Understanding Relationships: Identifying important features and their impact on the
target variable.
Cluster/2:
It seems like you might be referring to the concept of "Cluster 2" in the context of data
mining or clustering. However, in traditional data mining terminology, "Cluster 2" doesn't
have a specific standard meaning. Clustering generally refers to the process of grouping
similar data points together into clusters or segments based on some similarity or distance
measure. Let's clarify and discuss clustering in data mining:
Clustering in Data Mining:
Clustering is an unsupervised learning technique where the goal is to partition a set of data
points into clusters such that points within the same cluster are more similar to each other
than to points in other clusters. Here are some key points:
1. Objective: The main objective of clustering is to discover inherent groupings in the

data without prior knowledge of group memberships.
2. Similarity Measure: Clustering algorithms typically use a similarity measure (e.g.,
Euclidean distance, cosine similarity) to assess how similar or dissimilar data points
are.
3. Cluster Centers: Depending on the algorithm, clusters may be represented by
centroids (e.g., k-means), medoids (e.g., k-medoids), or other types of representations.
4. Types of Clustering Algorithms:
o Partitioning Algorithms: Like k-means, where data points are grouped into k
clusters based on similarity.
o Hierarchical Algorithms: Build a hierarchy of clusters, which can be
represented as a tree (dendrogram).
o Density-based Algorithms: Such as DBSCAN, which identify clusters as
regions of high density separated by regions of low density.
5. Applications: Clustering finds applications in various fields such as customer
segmentation, anomaly detection, image segmentation, and more.
"Cluster 2" Interpretation:

If "Cluster 2" specifically refers to something in your context that is not standard
terminology, it could potentially denote the second cluster identified by a clustering
algorithm (e.g., k-means clustering assigns data points to clusters numbered 1, 2, etc.).
Alternatively, it might refer to a specific subset of data points or a partition of data within a
particular application domain.
Example:
Suppose you have a dataset of customer shopping behaviors. Using clustering, you might
identify several clusters such as "Cluster 1" representing frequent high-spending customers,
"Cluster 2" representing occasional low-spending customers, and so on. Each cluster
(including "Cluster 2") would have distinct characteristics that can be used for targeted
marketing or other business strategies.
Cobweb:
The COBWEB algorithm is an incremental conceptual clustering method used in machine

learning and data mining. It's designed to generate a hierarchical clustering of data instances,
organizing them into a classification tree. Here’s a brief overview of how it works:
Key Concepts
1. Conceptual Clustering: Unlike traditional clustering methods that focus on grouping

similar instances, conceptual clustering aims to group instances that share common
descriptions or concepts.
2. Incremental Learning: COBWEB can update its clusters incrementally as new data
arrives, making it suitable for dynamic datasets.
How COBWEB Works
1. Categorization Tree: COBWEB builds a tree where each node represents a concept,
and the root node represents the most general concept. The tree is built incrementally
as data instances are added.
2. Category Utility (CU): This is a heuristic measure used by COBWEB to evaluate the
quality of a clustering. It balances intra-class similarity (instances within a cluster are
similar) and inter-class dissimilarity (instances in different clusters are dissimilar).
3. Four Operations: When a new instance is added, COBWEB considers four
operations at each node:
o Insert: Insert the instance into an existing category.
o Create: Create a new category for the instance.
o Merge: Merge two existing categories to better accommodate the instance.
o Split: Split an existing category to better accommodate the instance.
Advantages
 Incremental Learning: COBWEB can handle continuous streams of data and update
clusters dynamically.
 Hierarchical Structure: It provides a detailed hierarchy of concepts, which can be
useful for understanding the relationships between different clusters.
Disadvantages
 Scalability: COBWEB may not scale well with very large datasets.
 Sensitivity to Noise: The algorithm can be sensitive to noise and outliers in the data.
K-means is a widely used clustering algorithm in data mining and machine learning. It
partitions a dataset into kkk clusters, where each data point belongs to the cluster with the
nearest mean. Here's a detailed explanation of how K-means works and its characteristics:
Key Concepts
1. Clusters: Groups of data points that are similar to each other.

2. Centroids: The center of each cluster. K-means aims to minimize the distance
between data points and their respective centroids.
How K-means Works
1. Initialization: Select kkk initial centroids randomly or using some heuristic.

2. Assignment Step: Assign each data point to the nearest centroid, forming kkk
clusters.
3. Update Step: Recalculate the centroids as the mean of all data points in each cluster.
4. Convergence: Repeat the assignment and update steps until the centroids no longer
change significantly or a maximum number of iterations is reached.
Algorithm Steps
1. Choose the number of clusters kkk.

2. Initialize the centroids randomly or using a method like k-means++ for better initial
placement.
3. Repeat until convergence:
o Assign each data point to the nearest centroid.
o Recalculate the centroids as the mean of the assigned data points.
Advantages
 Simplicity: Easy to implement and understand.

 Speed: Generally fast and efficient for small to medium-sized datasets.
 Scalability: Can be used on large datasets with appropriate optimizations.
Disadvantages
 Choosing kkk: The number of clusters kkk must be specified in advance, which can
be challenging without domain knowledge.
 Local Optima: The algorithm can converge to local optima depending on the initial
placement of centroids.
 Assumption of Spherical Clusters: K-means assumes that clusters are spherical and
of roughly equal size, which may not always be the case.
 Sensitivity to Outliers: Outliers can significantly affect the placement of centroids.
Applications
 Customer Segmentation: Grouping customers based on purchasing behavior,

demographics, etc.
 Image Compression: Reducing the number of colors in an image by clustering
similar colors.
 Document Clustering: Organizing documents into topics or categories.
Hierarchical methods:
Hierarchical clustering methods are a family of clustering algorithms that build a hierarchy of
clusters, which can be represented as a tree structure called a dendrogram. These methods do
not require the number of clusters to be specified in advance, making them flexible for
exploratory data analysis.
Types of Hierarchical Clustering
1. Agglomerative (Bottom-Up): This method starts with each data point as a single cluster and
iteratively merges the closest pairs of clusters until all points belong to a single cluster or a
stopping criterion is met.
2. Divisive (Top-Down): This method starts with all data points in a single cluster and
recursively splits the most heterogeneous clusters until each data point is in its own cluster
or a stopping criterion is met.
Steps in Hierarchical Clustering
Agglomerative Clustering
1. Start with each data point as a single cluster.

2. Compute the distance (similarity) matrix for all pairs of clusters.
3. Merge the two closest clusters based on the distance matrix.
4. Update the distance matrix to reflect the distances between the new cluster and the
remaining clusters.
5. Repeat steps 3 and 4 until all points are merged into a single cluster or the desired number
of clusters is achieved.
Divisive Clustering
1. Start with all data points in a single cluster.

2. Choose a cluster to split (often the one with the highest variance or largest size).
3. Partition the selected cluster into two sub-clusters.
4. Repeat steps 2 and 3 until each data point is its own cluster or the desired number of
clusters is achieved.
Linkage Criteria
Different linkage criteria determine how the distance between clusters is calculated:
1. Single Linkage (Minimum Linkage): Distance between the closest points of the clusters.
2. Complete Linkage (Maximum Linkage): Distance between the farthest points of the clusters.
3. Average Linkage: Average distance between all pairs of points in the two clusters.
4. Ward's Method: Minimizes the total within-cluster variance.
Advantages
 No Need to Specify Number of Clusters: The dendrogram can be cut at the desired level to
get the appropriate number of clusters.
 Hierarchical Structure: Provides a detailed view of the data, showing how clusters are
formed at each step.
 Versatility: Can handle various types of data and similarity measures.
Disadvantages
 Computational Complexity: Agglomerative methods can be computationally expensive,

especially for large datasets.
 Sensitivity to Noise and Outliers: Hierarchical clustering can be sensitive to outliers, which
can skew the clustering process.
 Scalability: Less scalable compared to partition-based methods like K-means.
Mining real data: preprocessing data from a real medical domain
Preprocessing data from a real medical domain is a crucial step in the data mining process.
Proper preprocessing ensures that the data is clean, consistent, and ready for analysis. Here
are the steps typically involved in preprocessing medical data:
Steps in Data Preprocessing
1. Data Collection
o Sources: Electronic Health Records (EHRs), medical images, clinical trials,
wearable devices, etc.
o Formats: Structured (tables, spreadsheets), semi-structured (XML, JSON),
and unstructured (text, images).
2. Data Cleaning
o Handling Missing Values: Impute missing values using methods like
mean/mode/median imputation, k-nearest neighbors (KNN), or more advanced
techniques.
o Removing Duplicates: Identify and remove duplicate records.
o Outlier Detection: Identify and handle outliers using statistical methods or
machine learning models.
3. Data Transformation
o Normalization/Standardization: Scale numerical features to a standard range
(e.g., 0-1) or to have zero mean and unit variance.
o Encoding Categorical Variables: Convert categorical variables to numerical
format using one-hot encoding, label encoding, or binary encoding.
o Text Data Processing: Tokenization, stemming, lemmatization, and removing
stop words for text fields.
4. Data Integration
oCombining Data from Multiple Sources: Merge datasets from different
sources into a single dataset.
o Resolving Inconsistencies: Ensure that the merged data is consistent (e.g.,
resolving differences in measurement units or formats).
5. Data Reduction
o Feature Selection: Select the most relevant features using techniques like
correlation analysis, recursive feature elimination (RFE), or feature
importance from models.
o Dimensionality Reduction: Reduce the number of features using methods
like Principal Component Analysis (PCA) or t-SNE.
6. Data Discretization
o Bin Continuous Variables: Discretize continuous variables into bins (e.g.,
age groups).
7. Handling Imbalanced Data
o Resampling Techniques: Use oversampling (SMOTE) or undersampling to
balance class distribution.
o Synthetic Data Generation: Generate synthetic data to augment the dataset.
Data mining techniques to create a comprehensive and accurate model of

data:
Creating a comprehensive and accurate model in data mining involves selecting appropriate
techniques that suit the nature of your data and the specific problem you're addressing. Here
are some commonly used data mining techniques:
1. Classification
Classification involves predicting the categorical label of new data instances based on past
observations. Common algorithms include:
 Decision Trees: Splits data into subsets based on the value of input features.
 Random Forest: An ensemble method that uses multiple decision trees to improve
accuracy and prevent overfitting.
 Support Vector Machines (SVM): Finds the hyperplane that best separates different
classes.
 Naive Bayes: Based on Bayes’ theorem, assumes independence between features.
 Neural Networks: Complex models inspired by the human brain, suitable for large
datasets.
2. Regression
Regression is used to predict a continuous value. Common algorithms include:
 Linear Regression: Models the relationship between the dependent and independent
variables as a linear function.
 Polynomial Regression: Extends linear regression by considering polynomial
relationships.
 Ridge/Lasso Regression: Linear regression variants that include regularization to
prevent overfitting.
 Support Vector Regression (SVR): Extension of SVM for regression problems.
3. Clustering
Clustering involves grouping data points into clusters such that points within a cluster are
more similar to each other than to those in other clusters. Common algorithms include:
 K-means: Partitions data into kkk clusters based on the distance to the centroids.
 Hierarchical Clustering: Builds a tree of clusters either by merging or splitting
existing clusters.
 DBSCAN: Density-based clustering that can identify clusters of varying shapes and
sizes.
 Gaussian Mixture Models (GMM): Assumes that the data is generated from a
mixture of several Gaussian distributions.
4. Association Rule Learning
Association rule learning finds interesting relationships (associations) among variables in

large databases. Common algorithms include:
 Apriori Algorithm: Finds frequent itemsets and generates association rules.

 Eclat Algorithm: Uses a depth-first search to find frequent itemsets.
 FP-Growth: Uses a tree structure to find frequent itemsets without candidate
generation.
5. Anomaly Detection
Anomaly detection identifies rare items, events, or observations that raise suspicions by
differing significantly from the majority of the data. Common algorithms include:
 Isolation Forest: Randomly partitions the data and isolates anomalies.

 One-Class SVM: Fits a boundary around the normal data points and identifies
anomalies outside this boundary.
 Local Outlier Factor (LOF): Measures the local density deviation of a data point
with respect to its neighbors.
6. Dimensionality Reduction
Dimensionality reduction techniques reduce the number of random variables under

consideration. Common algorithms include:
 Principal Component Analysis (PCA): Projects data to a lower-dimensional space

by identifying the principal components.
 t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces dimensions while
preserving the structure of data.
 Autoencoders: Neural network-based approach for learning a compressed
representation of data.
Creating a Comprehensive and Accurate Model

1. Understand the Problem: Clearly define the problem you are trying to solve and
understand the business context.
2. Data Collection and Integration: Gather data from multiple sources and integrate it
into a single, cohesive dataset.
3. Data Preprocessing: Clean and preprocess the data to handle missing values,
outliers, and inconsistencies. Transform and normalize the data as needed.
4. Exploratory Data Analysis (EDA): Use statistical and visualization tools to
understand the data distribution, identify patterns, and generate insights.
5. Feature Engineering: Create new features from the raw data that can improve the
model's performance. This may include creating interaction terms, polynomial
features, or domain-specific features.
6. Model Selection: Choose appropriate algorithms based on the problem type
(classification, regression, clustering, etc.) and the characteristics of the data.
7. Model Training: Train multiple models and use techniques like cross-validation to
evaluate their performance.
8. Model Evaluation: Use appropriate metrics to evaluate the model's performance. For
classification, metrics like accuracy, precision, recall, and F1-score are common. For
regression, metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute
Error), and R² (coefficient of determination) are used.
9. Model Tuning: Optimize hyperparameters using techniques like grid search or
random search to improve the model's performance.
10. Ensemble Methods: Combine multiple models to create an ensemble model that
often performs better than individual models.
11. Model Validation: Validate the model on a separate test set to ensure it generalizes
well to unseen data.
12. Deployment: Deploy the model to a production environment where it can be used to
make predictions on new data.
13. Monitoring and Maintenance: Continuously monitor the model's performance and
update it as necessary to handle new data and changing conditions.
text mining:
Text mining, also known as text data mining or text analytics, is a process used in data
mining to derive high-quality information from text. It involves transforming unstructured
text data into a structured format to identify patterns, trends, and insights. Here are some key
aspects:
1. Text Preprocessing: Cleaning and preparing text data by removing noise such as
punctuation, stop words, and converting text to lower case.
2. Tokenization: Splitting text into smaller units called tokens (e.g., words, phrases).
3. Stemming and Lemmatization: Reducing words to their base or root form to
standardize variations.
4. Feature Extraction: Converting text into numerical features using methods like TF-
IDF (Term Frequency-Inverse Document Frequency) or word embeddings.
5. Text Classification: Categorizing text into predefined classes using algorithms like
Naive Bayes, SVM, or neural networks.
6. Sentiment Analysis: Determining the sentiment or emotional tone of text, such as
positive, negative, or neutral.
7. Topic Modeling: Identifying topics or themes within a large set of documents using
techniques like Latent Dirichlet Allocation (LDA).
8. Named Entity Recognition (NER): Extracting specific entities like names, dates, or
locations from text.
Text mining is widely used in various applications, including customer feedback analysis,
social media monitoring, document summarization, and more.
Text Classification:
ext classification is a critical task in data mining that involves assigning predefined categories
or labels to text data. It can be used in a variety of applications such as spam detection,
sentiment analysis, topic categorization, and more. Here’s an overview of the process:
Steps in Text Classification:
1. Data Collection:
o Gather text data from various sources like emails, social media, customer
reviews, etc.
2. Text Preprocessing:
o Tokenization: Splitting the text into individual words or tokens.
o Lowercasing: Converting all characters to lower case to ensure uniformity.
o Stop Words Removal: Removing common words that do not contribute much
to the meaning (e.g., "the", "is").
o Stemming/Lemmatization: Reducing words to their root form (e.g.,
"running" to "run").
o Removing Punctuation and Special Characters: Cleaning the text to retain
only meaningful tokens.
3. Feature Extraction:
o Bag of Words (BoW): Representing text as a collection of words and their
frequencies.
o TF-IDF (Term Frequency-Inverse Document Frequency): A statistic that
reflects how important a word is to a document in a collection.
o Word Embeddings: Representing words in dense vectors that capture
semantic relationships (e.g., Word2Vec, GloVe).
4. Model Selection and Training:
o Choose an appropriate machine learning model. Common algorithms include:
 Naive Bayes: Simple and effective for text classification.
 Support Vector Machines (SVM): Effective for high-dimensional
spaces.
 Logistic Regression: Good for binary classification problems.
 Decision Trees and Random Forests: Useful for interpretability and
handling complex relationships.
 Neural Networks: Deep learning models such as Convolutional
Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) for
more complex patterns and large datasets.
5. Model Evaluation:
o Split the data into training and testing sets to evaluate the model’s
performance.
oUse metrics like accuracy, precision, recall, and F1 score to assess the model.
6. Model Tuning:
o Optimize the model by tuning hyperparameters and improving the
preprocessing steps.
7. Deployment:
o Integrate the trained model into applications for real-time or batch processing.
Applications of Text Classification:
 Spam Detection: Classifying emails as spam or not spam.

 Sentiment Analysis: Determining the sentiment (positive, negative, neutral) of
customer reviews or social media posts.
 Topic Categorization: Categorizing news articles into topics like sports, politics,
technology, etc.
 Language Detection: Identifying the language of a given piece of text.
 Intent Detection: Understanding the intent behind customer queries in chatbots and
virtual assistants.
Text classification is a powerful tool in data mining that enables organizations to extract
meaningful insights and automate the processing of large volumes of text data.
Web mining:
Web mining is a specific area of data mining focused on discovering patterns and extracting
useful information from web data, which includes web content, web structure, and web usage
data. Here's a breakdown of the main types of web mining:
1. Web Content Mining:

o Definition: The process of extracting useful information from the content of
web pages.
o Techniques:
 Text mining
 Image mining
 Video mining
 Audio mining
o Applications:
 Sentiment analysis
 Topic extraction
 Summarization
 Opinion mining
2. Web Structure Mining:
o Definition: The process of analyzing the structure of web hyperlinks to
identify patterns.
o Techniques:
 Link analysis
 Social network analysis
 Graph theory
Applications:
o
 PageRank algorithm
 HITS algorithm
 Community detection
 Web graph clustering
3. Web Usage Mining:
o Definition: The process of analyzing user behavior through web server logs
and clickstreams.
o Techniques:
 Log analysis
 Session tracking
 Pattern recognition
o Applications:
 Personalization and recommendation systems
 User profiling
 Web analytics
 Conversion optimization
Applications of Web Mining:
 E-commerce: Personalizing shopping experiences and improving product

recommendations.
 Search Engines: Enhancing search algorithms and improving search results
relevance.
 Social Media: Analyzing user interactions and sentiments to drive marketing
strategies.
 Cybersecurity: Detecting unusual patterns to prevent fraud and cyber-attacks.
 Market Research: Gathering insights about market trends and consumer preferences.
Tools and Technologies:
 Programming Languages: Python, R, Java

 Libraries: Beautiful Soup, Scrapy, NLTK, TensorFlow, Scikit-learn
 Platforms: Google Analytics, IBM Watson, Apache Hadoop, Apache Spark
Web mining is crucial for leveraging the vast amount of data available on the web to gain
actionable insights and support decision-making in various domains.
data mining software:
Data mining software tools are essential for extracting patterns, relationships, and insights
from large datasets. Here are some popular data mining software tools, along with their
features and applications:
1. RapidMiner
 Features:
o Visual workflow design
o Integrated data preparation, machine learning, deep learning, and text mining
o Support for various data sources (databases, spreadsheets, cloud storage)
o Extensive library of pre-built functions and algorithms
 Applications:
o Predictive analytics
o Customer segmentation
o Risk management
o Fraud detection
2. KNIME (Konstanz Information Miner)
 Features:
o Modular data pipelining concept
o Integration with various data sources and formats
o Extensive library of pre-built nodes for data preprocessing, transformation,
and analysis
o Support for machine learning and deep learning models
 Applications:
o Data cleaning and transformation
o Predictive modeling
o Text mining
o Bioinformatics
3. WEKA (Waikato Environment for Knowledge Analysis)
 Features:
o Collection of machine learning algorithms for data mining tasks
o Graphical user interfaces and command-line interface
o Support for data preprocessing, classification, regression, clustering, and
association rules
 Applications:
o Educational purposes
o Research and development
o Prototyping and experimentation
4. SAS Enterprise Miner
 Features:
o Advanced analytics and predictive modeling
o Automated model building and validation
o Interactive data exploration and visualization
o Integration with other SAS products and solutions
 Applications:
o Marketing optimization
o Credit scoring
o Customer retention analysis
o Fraud detection
5. Oracle Data Mining (ODM)

 Features:
o Embedded in Oracle Database
o Automated data preparation and feature selection
o Integration with SQL and PL/SQL
 Applications:
o Customer relationship management (CRM)
o Financial analysis
o Market basket analysis
o Predictive maintenance
6. IBM SPSS Modeler
 Features:
o Visual data mining and predictive analytics
o Support for various data sources and formats
o Extensive library of pre-built functions and algorithms
o Integration with IBM Watson and other IBM products
 Applications:
o Customer analytics
o Operational optimization
o Risk management
o Fraud detection
7. Apache Mahout
 Features:
o Scalable machine learning algorithms
o Support for collaborative filtering, clustering, and classification
o Integration with Apache Hadoop and Apache Spark
o Flexible and extensible architecture
 Applications:
o Large-scale machine learning
o Recommender systems
o Data clustering and classification
o Text and web mining
8. R and Python
 Features:
o Extensive libraries and packages for data mining and machine learning
o Flexible and powerful scripting languages
o Strong community support and documentation
 Applications:
o Academic research
o Data analysis and visualization
o Machine learning and predictive modeling
o Statistical analysis
9. Tableau
 Features:
o Interactive data visualization and exploration
o Support for various data sources and formats
o Drag-and-drop interface
o Integration with advanced analytics and machine learning models
 Applications:
o Business intelligence
o Data reporting and dashboarding
o Visual analytics
o Data storytelling
10. Microsoft SQL Server Analysis Services (SSAS)
 Features:
o Multidimensional and tabular data models
o Data mining algorithms and tools
o Integration with SQL Server and other Microsoft products
 Applications:
o Business intelligence
o Data warehousing
o Predictive analytics
o Performance management
These tools offer a range of features and capabilities to support various data mining tasks,
from data preprocessing and transformation to predictive modeling and visualization.

Unit 5

Uploaded by

Copyright:

Available Formats

Unit 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5

Uploaded by

Copyright:

Available Formats

Unit-V

Classification & Clustering

Types of Classification f Data Mining

Below are the key types of data mining classifications:

Here there is no concentration drawn to predetermined attributes. It also has

Below are the major classification techniques in data mining:

Data Mining Classification as Per the Type of Knowledge Mined

 Correlation And Association Analysis

 A cluster of data objects can be treated as one group.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data

 Scalability − We need highly scalable clustering algorithms to deal

Clustering methods can be classified into the following categories −

Suppose we are given a database of ‘n’ objects and the partitioning

 Each group contains at least one object.

 For a given number of partitions (say k), the partitioning method

This method creates a hierarchical decomposition of the given set of data

This approach is also known as the bottom-up approach. In this, we start

This approach is also known as the top-down approach. In this, we start

Approaches to Improve Quality of Hierarchical Clustering

This method is based on the notion of density. The basic idea is to

 The major advantage of this method is fast processing time.

This method also provides a way to automatically determine the number

In this method, the clustering is performed by the incorporation of user or

The benefits of having a decision tree are as follows −

he benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a

Tree pruning is performed in order to remove anomalies in the training

There are two approaches to prune a tree −

 Pre-pruning − The tree is pruned by halting its construction early.

The cost complexity is measured by the following two parameters −

 Number of leaves in the tree, and

Rule-based classifier makes use of a set of IF-THEN rules for

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes

Here we will learn how to build a rule-based classifier by extracting IF-

To extract a rule from a decision tree −

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form

The rule is pruned is due to the following reason −

 The Assessment of quality is made on the original set of training

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R,

Here’s how task prediction typically works in data mining:

Statistical classification in data mining is a fundamental technique used to categorize data

Key Concepts in Statistical Classification:

Steps in Statistical Classification:

1. Data Preprocessing: This includes handling missing values, normalizing or

Applications of Statistical Classification:

Statistical classification is widely used in various fields, including:

 Medical Diagnosis: Classifying patients into different disease categories based on

Instance based methods:

1. No explicit model creation: Instance-based methods do not build an explicit model

1. k-Nearest Neighbors (k-NN): One of the most popular instance-based methods