Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Data mining and machine learning

The document discusses various machine learning techniques, focusing on decision trees, clustering methods, and advanced supervised algorithms. It explains decision trees' structure and construction, clustering approaches like K-means and hierarchical clustering, and introduces algorithms such as Support Vector Machines and k-Nearest Neighbors. Additionally, it covers ensemble learning methods like Random Forest and the concept of association rules in data mining.

Uploaded by

ephremd406
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data mining and machine learning

The document discusses various machine learning techniques, focusing on decision trees, clustering methods, and advanced supervised algorithms. It explains decision trees' structure and construction, clustering approaches like K-means and hierarchical clustering, and introduces algorithms such as Support Vector Machines and k-Nearest Neighbors. Additionally, it covers ensemble learning methods like Random Forest and the concept of association rules in data mining.

Uploaded by

ephremd406
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Data mining and machine

learning
Classification (decision trees)
● A decision tree is a supervised machine learning algorithm used for
classification and regression tasks.
● It represents decisions and their possible consequences in a tree-like
structure with nodes, branches, and leaves.
● The tree splits data into subsets based on feature values, guiding predictions
for categorical or numerical outcomes.
● decision trees can also be expressed in logic as implication sentences
● in principle, they can express propositional logic sentences
● each row in the truth table of a sentence can be represented as a path in the
tree.
Cont.
Constructing Decision Trees
● in general, constructing the smallest possible decision tree is an intractable
problem
● algorithms exist for constructing reasonably small trees
● basic idea: test the most important attribute first
● attribute that makes the most difference for the classification of an example
● can be determined through information theory
● hopefully will yield the correct classification with few tests
Decision Trees
● Decision Trees are a type of supervised learning algorithm used for
classification and regression tasks. They model decisions and their possible
consequences, including chance event outcomes, resource costs, and utility.
In a decision tree, each internal node represents a test on an attribute, each
branch represents the outcome of the test, and each leaf node represents a
class label (for classification) or a continuous value (for regression).
● Customer segmentation, medical diagnosis, etc.
Structure of Decision Trees
● Root Node: The top node of the tree, representing the entire dataset and the
initial decision to be made.
● Internal Nodes: Nodes that represent decisions or tests on attributes. These
nodes have one or more child nodes.
● Branches: Edges that connect nodes, representing the outcome of a test and
leading to the next node or decision.
● Leaf Nodes: Terminal nodes that represent the final classification or
regression outcome.
Decision Tree
● Decision Tree Algorithm
○ Steps:
■ Choose the best feature to split on
■ Split the dataset into subsets
■ Recursively repeat for each subset
● Splitting Criteria
○ Gini Index: Measures impurity
○ Information Gain: Measures the reduction in entropy
○ Chi-Square: Measures statistical significance
Clustering
● Clustering is an unsupervised learning technique used to group similar data
points together based on their inherent patterns or features.
● Unlike classification, clustering does not require labeled data, making it useful
for exploratory data analysis.
● The goal is to maximize intra-cluster similarity (points within a cluster are
similar) and minimize inter-cluster similarity (points in different clusters are
dissimilar).
Cont.
● Cluster: A group of data points that are more similar to each other than to
points in other clusters.
● Centroid: The central point or mean of a cluster, often used to represent the
cluster.
● Distance Metrics: Methods to measure similarity or dissimilarity between data
points.
● There are multiple clustering methods and we will see the following on this
chapter:
○ K-means
○ Fuzzy C-means,
○ Hierarchical Clustering,
○ DBSCAN
K-means
● K-Means clustering is a partitioning method that divides a dataset into K
distinct, non-overlapping subsets (clusters) where each data point belongs to
the cluster with the nearest mean.
● The K-Means algorithm starts by initializing K centroids randomly. Each data
point is assigned to the nearest centroid, and the centroids are updated to be
the mean of the points in their cluster. This process repeats until the centroids
no longer change significantly.
Cont.
● Steps:

Start by choosing 𝐾 initial centroids randomly from the


a. Initialize K Centroids Randomly:

dataset. These centroids serve as the initial guess for the
cluster centers.
b. Assign Each Data Point to the Nearest Centroid:
■ For each data point, compute the distance to each centroid
and assign the data point to the cluster whose centroid is
closest. Typically, the Euclidean distance is used, but other
distance measures can be applied depending on the
context.
c. Update Centroids:
■ After all data points have been assigned to clusters,
recompute the centroids as the mean of all data points in
each cluster. This step adjusts the centroids to better
represent the data points in their respective clusters.
d. Repeat Steps B and C Until Convergence:
■ Repeat the assignment and update steps until the centroids
no longer change significantly, indicating that the algorithm
has converged. Convergence can also be determined by
setting a maximum number of iterations or a threshold for
centroid movement.
Cont.
● Example:
a. Application: Segmenting customers based on purchasing behavior.
■ If we have data on customer purchases, K-Means can group customers into clusters
based on similarities in their purchase patterns.
● Advantages:
a. Simple and easy to implement.
b. Scalable to large datasets.
● Disadvantages:
a. Requires specifying the number of clusters (K) in advance.
b. Sensitive to initial centroid placement.
Hierarchical Clustering
● Hierarchical clustering builds a hierarchy of clusters by either merging small clusters into
larger ones (agglomerative) or splitting large clusters into smaller ones (divisive).
● The agglomerative approach starts with each data point as a single cluster and iteratively
merges the closest pairs of clusters until only one cluster remains or a specified number of
clusters is reached. The divisive approach starts with all data points in one cluster and splits
the least similar clusters recursively.
● Types:
○ Agglomerative Clustering: Bottom-up approach.
■ Start with each data point as a separate cluster.
■ Iteratively merge the closest pairs of clusters until only one cluster remains or a specified number of
clusters is reached.
○ Divisive Clustering: Top-down approach.
■ Start with all data points in a single cluster.
■ Recursively split the least similar clusters until each data point is in its own cluster or a specified number
of clusters is achieved
Steps for Agglomerative Clustering:
● Start with each data point as a single cluster.
● Merge the closest pair of clusters.
● Repeat until all points are in a single cluster or a predefined number of
clusters is achieved.
● Example:
○ Application: Hierarchical clustering can be used in bioinformatics to group genes with similar
expression patterns.
Hierarchical Clustering
Example
● Application: Bioinformatics for Grouping Genes with Similar Expression Patterns
Hierarchical clustering can be used to group genes that exhibit similar expression patterns. This
can help in identifying genes that are co-expressed and potentially involved in the same biological
pathways.
● Collect Data:
○ Obtain gene expression data for a set of genes across different conditions or time points.
● Start with Each Gene as a Separate Cluster:
○ Treat each gene as its own cluster.
● Merge the Closest Clusters:
○ Use a distance measure (e.g., Euclidean distance) to determine the similarity between clusters and merge the closest
pairs.
● Build a Dendrogram:
○ Create a dendrogram to visualize the merging process and the hierarchy of clusters.
Hierarchical Clustering
● Advantages
○ No Need to Specify the Number of Clusters in Advance:
■ The algorithm does not require the number of clusters to be defined beforehand.
○ Produces a Dendrogram:
■ A dendrogram provides a visual representation of the clustering process and the
relationships between clusters.
● Disadvantages
○ Computationally Intensive:
■ Hierarchical clustering can be computationally expensive, especially for large datasets.
○ Sensitive to Noise and Outliers:
■ The presence of noise and outliers can significantly affect the clustering results.
Fuzzy C-Means
● Fuzzy C-Means (FCM) is a soft clustering algorithm where each data point
can belong to multiple clusters with varying degrees of membership.
● Unlike hard clustering methods (e.g., K-Means), where a data point belongs
exclusively to one cluster, FCM assigns membership scores to each point for
all clusters based on proximity to cluster centroids.
Cont.
● Membership Matrix:
○ A matrix that represents the degree of membership of each data point to every cluster.
○ Membership values range from 0 to 1, with the sum of memberships for a data point across
all clusters equal to 1.
● Centroids:
○ Represent the center of a cluster, computed as a weighted average of data points, using
membership values as weights.
● Objective Function:
○ The algorithm minimizes a cost function that balances cluster compactness and membership
distribution:
Cont.

● Fuzzification Parameter (𝑚):

Higher 𝑚 increases fuzziness, with data points shared more equally across clusters.
○ Determines the level of cluster overlap.

Cont.
● Initialization:
○ Randomly initialize the membership matrix 𝑈 with values between 0 and 1, ensuring rows sum to 1.
● Compute Centroids:
○ Update the centroid of each cluster using

● Update Membership Matrix:


○ Update membership values for each point using
● Repeat:
○ Iterate steps 2 and 3 until the membership matrix stabilizes or changes fall below a defined
threshold.
● Output:
○ Final centroids and membership matrix.
DBSCAN
● DBSCAN is a density-based clustering algorithm that groups data points based
on their proximity and density. Unlike partition-based algorithms like K-Means,
DBSCAN identifies clusters as areas of high data density separated by areas of
low density. It also handles noise by labeling sparse regions as outliers.
● Density-Based: Clusters are formed in regions of high point density.
● Noise Handling: Points in low-density areas are categorized as noise or outliers.
● Non-Parametric: Does not require specifying the number of clusters beforehand.
● Arbitrary Shape Clusters: Can detect clusters of varying shapes and sizes, unlike
algorithms that assume spherical clusters.
Cont.
● Initialization:
○ Select an arbitrary starting point.
● Density Check:
Calculate the number of points within the 𝜖-radius neighborhood of the selected point:
If the point is a core point (neighborhood has at least 𝑚𝑖𝑛𝑃𝑡𝑠), a new cluster is started.


○ If not, the point is labeled as noise (temporarily).
● Cluster Expansion:
○ For each core point, include all its directly reachable points (points within 𝜖).
○ Expand the cluster by recursively visiting neighbors of neighbors.
● Repeat:
○ Continue until all points have been visited and assigned to a cluster or labeled as noise.
Cont.
DBSCAN is a powerful clustering algorithm for identifying clusters of varying
shapes and sizes while handling noise and outliers. Its ability to work without
predefining the number of clusters makes it a versatile tool in unsupervised
learning tasks. However, careful tuning of parameters is essential for its success.
Association rules
● Association Rules are a set of if-then rules that help uncover relationships,
patterns, or correlations between variables in large datasets. These rules are
commonly used in market basket analysis to understand purchasing behavior.
● It is a powerful tool for uncovering hidden patterns in data. By leveraging
metrics like support, confidence, and lift, data scientists can identify
meaningful relationships that drive decision-making across various domains,
from business to healthcare.
Cont.
● Itemsets
○ A collection of one or more items.
○ Example: In a supermarket transaction, an itemset might be {milk, bread}.
● Support
○ Measures the frequency of an itemset in the dataset.
Cont.
● Confidence
○ Measures the likelihood that a rule is true for a given condition.

● Lift
○ Measures the strength of an association compared to random chance.
Advanced supervised machine learning algorithms
● Supervised machine learning algorithms rely on labeled datasets to train
models and make predictions. Among advanced supervised learning
algorithms, Naive Bayes, k-Nearest Neighbors (k-NN), and Support Vector
Machines (SVM) are widely used due to their effectiveness and versatility in
various applications.
Support Vector Machines (SVM)
● Support Vector Machines (SVM) are supervised learning models used for
classification and regression analysis. They are particularly effective in high-
dimensional spaces and are known for their robustness in handling non-linear
data by using kernel methods.
● Image classification, text categorization, etc.
SVM for Classification
● Hyperplane: A decision boundary that separates classes.
● Support Vectors: Data points that are closest to the hyperplane and influence
its position.
SVM Kernel Trick
● Kernel Function: Transforms data into a higher-dimensional space to make it
linearly separable.
● Common Kernels: Linear, polynomial, radial basis function (RBF)
Soft Margin vs. Hard Margin
● Hard Margin SVM: A Hard Margin Support Vector Machine (SVM) assumes that the
data is perfectly linearly separable. It aims to find a hyperplane that not only
separates the data into two classes but does so with the maximum margin, meaning
the distance between the hyperplane and the nearest data points of each class (the
support vectors) is maximized.
○ Perfect Separation: The hard margin SVM requires that all data points are correctly classified, i.e.,
there is no overlap or misclassification.
○ Maximized Margin: It finds the hyperplane that maximizes the margin between the two classes,
ensuring the largest possible distance between the closest data points (support vectors) from each
class.
● Limitations:
○ Sensitivity to Outliers: Because it requires perfect separation, the hard margin SVM is highly sensitive
to outliers. A single misclassified or noisy data point can dramatically affect the hyperplane.
Soft Margin vs. Hard Margin
● Soft Margin SVM: A Soft Margin Support Vector Machine (SVM) allows for
some misclassifications to handle noisy data and achieve a better
generalization on unseen data. It introduces slack variables that permit some
data points to be within the margin or on the wrong side of the hyperplane.
○ Flexibility: Unlike hard margin SVM, soft margin SVM does not require the data to be perfectly
separable. It allows for a trade-off between maximizing the margin and minimizing
classification errors.
Bayesian Regression
● Bayesian regression is a statistical method that applies Bayes' theorem to
update the probability estimates of the model parameters as more data
becomes available. It allows the incorporation of prior knowledge or beliefs
into the model, which can be particularly useful in supervised learning
contexts.
● In supervised learning, the goal is to learn a function that maps inputs
(features) to outputs (labels) based on a given dataset. Bayesian regression
fits this framework by updating our beliefs about the relationship between
features and labels as more data is observed.
Bayes' Theorem

● P(A|B): Posterior Probability


● P(B|A): Likelihood
● P(A):Prior probability
● P(B):Evidence
k-Nearest Neighbors
● k-NN is a non-parametric, instance-based learning algorithm that classifies
data points based on the majority label of their nearest neighbors in feature
space.
● Distance Metrics: Measures similarity using metrics like Euclidean,

● Value of 𝑘: Determines how many neighbors are considered for classification.


Manhattan, or Minkowski distances.

Small 𝑘: Sensitive to noise.


Large 𝑘: May smooth out patterns.


Cont.
Algorithm steps:

2. Identify the 𝑘-nearest neighbors.


1. Compute the distance between the query point and all training points.

3. Assign the most common label (majority voting) among these neighbors
to the query point.
Cont.
Intro to ensemble learning algorithms
● multiple hypotheses (an ensemble) are generated, and their predictions
combined
● by using multiple hypotheses, the likelihood for misclassification is hopefully
lower
● also enlarges the hypothesis space
● boosting is a frequently used ensemble method
● each example in the training set has a weight associated
● the weights of incorrectly classified examples are increased, and a new
hypothesis is generated from this new weighted training set
● the final hypothesis is a weighted-majority combination of all the generated
hypotheses
Random Forest
● Random Forest is an ensemble learning method that combines multiple
decision trees to improve accuracy and reduce overfitting.
● Each tree in the forest is built using a random subset of the data and features,
and the final prediction is made by averaging (for regression) or majority
voting (for classification) the predictions from all the individual trees.
Cont.
Random Forest builds a large number of decision trees and merges their results.
Here’s how it works in detail:
1. Bootstrapping (Random Sampling):
● The algorithm generates multiple subsets of the data by sampling with
replacement (bootstrap sampling).
● Each subset is used to grow a decision tree, meaning that each tree sees a
slightly different dataset.
2. Feature Randomness (Random Feature Selection):
● For each decision tree node, a random subset of features is chosen. This
adds more diversity to the trees and prevents overfitting by ensuring trees
do not rely too heavily on any single feature.
Cont.
3. Decision Tree Creation:
● Each decision tree is grown independently, and typically, the tree is
grown without pruning (allowing it to grow to its maximum depth).
4. Voting/Averaging:
● In classification, each tree in the forest produces a class label, and the
class that receives the most votes is the final prediction.
● In regression, the prediction is the average of the individual tree outputs.
Key Concepts in Random Forest
1. Out-of-Bag (OOB) Error:
● Since each tree is trained on a bootstrap sample of the data, some
samples will not be included in the training set of a tree. These samples
are called out-of-bag. The OOB error estimate is the average error on
these OOB samples across all trees.
2. Feature Importance:
● Random Forest can provide insights into the importance of each feature
in making predictions. Features that lead to better splits across many
trees are considered more important.
Cont.
3. Overfitting Control:
● Although individual decision trees tend to overfit, Random Forest
mitigates this problem by averaging over many trees and limiting the
growth of each tree through random sampling.
3. Bagging:
● Random Forest is an example of the Bagging ensemble technique.
Bagging reduces variance by averaging over multiple models.
XGBoost Algorithm
● XGBoost (Extreme Gradient Boosting) is an advanced implementation of
gradient boosting designed for speed and performance. It combines the
predictions of several base models to produce a powerful ensemble model.
● XGBoost builds trees sequentially, where each new tree aims to correct the
errors made by the previous trees. It uses a loss function to measure the
errors and applies gradient descent to minimize this loss. The algorithm also
includes regularization terms to prevent overfitting and improve
generalization.
XGBoost Algorithm
● What is Boosting?
○ Boosting is a powerful ensemble technique in machine learning that combines multiple weak learners to
create a strong learner. The idea behind boosting is to sequentially apply the weak learning algorithm to
repeatedly modified versions of the data, focusing more on the examples that previous models
misclassified.
● Weak Learners:
○ These are models that perform slightly better than random guessing. Typically, simple models like
decision stumps (single-level decision trees) are used as weak learners.
● Sequential Learning:
○ Boosting builds models sequentially. Each new model is trained to correct the errors made by the
previous models.
● Weight Adjustment:
○ Boosting adjusts the weights of the training data points. Initially, all data points are weighted equally. In
subsequent rounds, data points that were misclassified by the previous models are given higher
weights, so the new model focuses more on them.
XGBoost Algorithm
XGBoost Algorithm
1. Initialize with a Base Model:
○ Start with an initial prediction model, often a simple one like the mean value of the target variable for regression problems.
2. Compute the Residuals (Errors) of the Model:
○ Calculate the difference between the actual target values and the predictions made by the current model. These
differences (residuals) represent the errors the model needs to correct.
3. Fit a New Tree to the Residuals:
○ Train a new decision tree (weak learner) on the residuals. This tree's purpose is to predict the residuals (errors) from the
previous model.
4. Update the Model by Adding the New Tree:
○ Add the predictions of the new tree to the original model's predictions. This updated model should now make better
predictions as it corrects the errors from the previous step.
5. Repeat Steps 2-4 Until a Stopping Criterion is Met:
○ Continue iterating through steps 2 to 4, adding more trees to the ensemble, until a stopping criterion is reached. Stopping
criteria can include:
■ A predefined number of trees.
■ A minimum improvement in performance.
■ A maximum number of iterations.
XGBoost Algorithm
● Advantages
○ High Performance:
■ XGBoost is optimized for speed and performance, handling large datasets efficiently.
○ Flexibility:
■ Can be used for both regression and classification problems.
○ Regularization:
■ Includes regularization parameters to prevent overfitting.
● Disadvantages
○ Complexity:
■ Can be complex to tune due to many hyperparameters.
○ Computationally Intensive:
■ Requires significant computational resources for very large datasets.
XGBoost Algorithm
● XGBoost iteratively improves model performance by focusing on correcting
errors from previous iterations. Its robust framework, combined with
regularization and speed optimizations, makes it a powerful tool in machine
learning. However, it requires careful tuning and substantial computational
resources, especially for large datasets.

You might also like