Data mining and machine learning
Data mining and machine learning
learning
Classification (decision trees)
● A decision tree is a supervised machine learning algorithm used for
classification and regression tasks.
● It represents decisions and their possible consequences in a tree-like
structure with nodes, branches, and leaves.
● The tree splits data into subsets based on feature values, guiding predictions
for categorical or numerical outcomes.
● decision trees can also be expressed in logic as implication sentences
● in principle, they can express propositional logic sentences
● each row in the truth table of a sentence can be represented as a path in the
tree.
Cont.
Constructing Decision Trees
● in general, constructing the smallest possible decision tree is an intractable
problem
● algorithms exist for constructing reasonably small trees
● basic idea: test the most important attribute first
● attribute that makes the most difference for the classification of an example
● can be determined through information theory
● hopefully will yield the correct classification with few tests
Decision Trees
● Decision Trees are a type of supervised learning algorithm used for
classification and regression tasks. They model decisions and their possible
consequences, including chance event outcomes, resource costs, and utility.
In a decision tree, each internal node represents a test on an attribute, each
branch represents the outcome of the test, and each leaf node represents a
class label (for classification) or a continuous value (for regression).
● Customer segmentation, medical diagnosis, etc.
Structure of Decision Trees
● Root Node: The top node of the tree, representing the entire dataset and the
initial decision to be made.
● Internal Nodes: Nodes that represent decisions or tests on attributes. These
nodes have one or more child nodes.
● Branches: Edges that connect nodes, representing the outcome of a test and
leading to the next node or decision.
● Leaf Nodes: Terminal nodes that represent the final classification or
regression outcome.
Decision Tree
● Decision Tree Algorithm
○ Steps:
■ Choose the best feature to split on
■ Split the dataset into subsets
■ Recursively repeat for each subset
● Splitting Criteria
○ Gini Index: Measures impurity
○ Information Gain: Measures the reduction in entropy
○ Chi-Square: Measures statistical significance
Clustering
● Clustering is an unsupervised learning technique used to group similar data
points together based on their inherent patterns or features.
● Unlike classification, clustering does not require labeled data, making it useful
for exploratory data analysis.
● The goal is to maximize intra-cluster similarity (points within a cluster are
similar) and minimize inter-cluster similarity (points in different clusters are
dissimilar).
Cont.
● Cluster: A group of data points that are more similar to each other than to
points in other clusters.
● Centroid: The central point or mean of a cluster, often used to represent the
cluster.
● Distance Metrics: Methods to measure similarity or dissimilarity between data
points.
● There are multiple clustering methods and we will see the following on this
chapter:
○ K-means
○ Fuzzy C-means,
○ Hierarchical Clustering,
○ DBSCAN
K-means
● K-Means clustering is a partitioning method that divides a dataset into K
distinct, non-overlapping subsets (clusters) where each data point belongs to
the cluster with the nearest mean.
● The K-Means algorithm starts by initializing K centroids randomly. Each data
point is assigned to the nearest centroid, and the centroids are updated to be
the mean of the points in their cluster. This process repeats until the centroids
no longer change significantly.
Cont.
● Steps:
Higher 𝑚 increases fuzziness, with data points shared more equally across clusters.
○ Determines the level of cluster overlap.
○
Cont.
● Initialization:
○ Randomly initialize the membership matrix 𝑈 with values between 0 and 1, ensuring rows sum to 1.
● Compute Centroids:
○ Update the centroid of each cluster using
● Lift
○ Measures the strength of an association compared to random chance.
Advanced supervised machine learning algorithms
● Supervised machine learning algorithms rely on labeled datasets to train
models and make predictions. Among advanced supervised learning
algorithms, Naive Bayes, k-Nearest Neighbors (k-NN), and Support Vector
Machines (SVM) are widely used due to their effectiveness and versatility in
various applications.
Support Vector Machines (SVM)
● Support Vector Machines (SVM) are supervised learning models used for
classification and regression analysis. They are particularly effective in high-
dimensional spaces and are known for their robustness in handling non-linear
data by using kernel methods.
● Image classification, text categorization, etc.
SVM for Classification
● Hyperplane: A decision boundary that separates classes.
● Support Vectors: Data points that are closest to the hyperplane and influence
its position.
SVM Kernel Trick
● Kernel Function: Transforms data into a higher-dimensional space to make it
linearly separable.
● Common Kernels: Linear, polynomial, radial basis function (RBF)
Soft Margin vs. Hard Margin
● Hard Margin SVM: A Hard Margin Support Vector Machine (SVM) assumes that the
data is perfectly linearly separable. It aims to find a hyperplane that not only
separates the data into two classes but does so with the maximum margin, meaning
the distance between the hyperplane and the nearest data points of each class (the
support vectors) is maximized.
○ Perfect Separation: The hard margin SVM requires that all data points are correctly classified, i.e.,
there is no overlap or misclassification.
○ Maximized Margin: It finds the hyperplane that maximizes the margin between the two classes,
ensuring the largest possible distance between the closest data points (support vectors) from each
class.
● Limitations:
○ Sensitivity to Outliers: Because it requires perfect separation, the hard margin SVM is highly sensitive
to outliers. A single misclassified or noisy data point can dramatically affect the hyperplane.
Soft Margin vs. Hard Margin
● Soft Margin SVM: A Soft Margin Support Vector Machine (SVM) allows for
some misclassifications to handle noisy data and achieve a better
generalization on unseen data. It introduces slack variables that permit some
data points to be within the margin or on the wrong side of the hyperplane.
○ Flexibility: Unlike hard margin SVM, soft margin SVM does not require the data to be perfectly
separable. It allows for a trade-off between maximizing the margin and minimizing
classification errors.
Bayesian Regression
● Bayesian regression is a statistical method that applies Bayes' theorem to
update the probability estimates of the model parameters as more data
becomes available. It allows the incorporation of prior knowledge or beliefs
into the model, which can be particularly useful in supervised learning
contexts.
● In supervised learning, the goal is to learn a function that maps inputs
(features) to outputs (labels) based on a given dataset. Bayesian regression
fits this framework by updating our beliefs about the relationship between
features and labels as more data is observed.
Bayes' Theorem
3. Assign the most common label (majority voting) among these neighbors
to the query point.
Cont.
Intro to ensemble learning algorithms
● multiple hypotheses (an ensemble) are generated, and their predictions
combined
● by using multiple hypotheses, the likelihood for misclassification is hopefully
lower
● also enlarges the hypothesis space
● boosting is a frequently used ensemble method
● each example in the training set has a weight associated
● the weights of incorrectly classified examples are increased, and a new
hypothesis is generated from this new weighted training set
● the final hypothesis is a weighted-majority combination of all the generated
hypotheses
Random Forest
● Random Forest is an ensemble learning method that combines multiple
decision trees to improve accuracy and reduce overfitting.
● Each tree in the forest is built using a random subset of the data and features,
and the final prediction is made by averaging (for regression) or majority
voting (for classification) the predictions from all the individual trees.
Cont.
Random Forest builds a large number of decision trees and merges their results.
Here’s how it works in detail:
1. Bootstrapping (Random Sampling):
● The algorithm generates multiple subsets of the data by sampling with
replacement (bootstrap sampling).
● Each subset is used to grow a decision tree, meaning that each tree sees a
slightly different dataset.
2. Feature Randomness (Random Feature Selection):
● For each decision tree node, a random subset of features is chosen. This
adds more diversity to the trees and prevents overfitting by ensuring trees
do not rely too heavily on any single feature.
Cont.
3. Decision Tree Creation:
● Each decision tree is grown independently, and typically, the tree is
grown without pruning (allowing it to grow to its maximum depth).
4. Voting/Averaging:
● In classification, each tree in the forest produces a class label, and the
class that receives the most votes is the final prediction.
● In regression, the prediction is the average of the individual tree outputs.
Key Concepts in Random Forest
1. Out-of-Bag (OOB) Error:
● Since each tree is trained on a bootstrap sample of the data, some
samples will not be included in the training set of a tree. These samples
are called out-of-bag. The OOB error estimate is the average error on
these OOB samples across all trees.
2. Feature Importance:
● Random Forest can provide insights into the importance of each feature
in making predictions. Features that lead to better splits across many
trees are considered more important.
Cont.
3. Overfitting Control:
● Although individual decision trees tend to overfit, Random Forest
mitigates this problem by averaging over many trees and limiting the
growth of each tree through random sampling.
3. Bagging:
● Random Forest is an example of the Bagging ensemble technique.
Bagging reduces variance by averaging over multiple models.
XGBoost Algorithm
● XGBoost (Extreme Gradient Boosting) is an advanced implementation of
gradient boosting designed for speed and performance. It combines the
predictions of several base models to produce a powerful ensemble model.
● XGBoost builds trees sequentially, where each new tree aims to correct the
errors made by the previous trees. It uses a loss function to measure the
errors and applies gradient descent to minimize this loss. The algorithm also
includes regularization terms to prevent overfitting and improve
generalization.
XGBoost Algorithm
● What is Boosting?
○ Boosting is a powerful ensemble technique in machine learning that combines multiple weak learners to
create a strong learner. The idea behind boosting is to sequentially apply the weak learning algorithm to
repeatedly modified versions of the data, focusing more on the examples that previous models
misclassified.
● Weak Learners:
○ These are models that perform slightly better than random guessing. Typically, simple models like
decision stumps (single-level decision trees) are used as weak learners.
● Sequential Learning:
○ Boosting builds models sequentially. Each new model is trained to correct the errors made by the
previous models.
● Weight Adjustment:
○ Boosting adjusts the weights of the training data points. Initially, all data points are weighted equally. In
subsequent rounds, data points that were misclassified by the previous models are given higher
weights, so the new model focuses more on them.
XGBoost Algorithm
XGBoost Algorithm
1. Initialize with a Base Model:
○ Start with an initial prediction model, often a simple one like the mean value of the target variable for regression problems.
2. Compute the Residuals (Errors) of the Model:
○ Calculate the difference between the actual target values and the predictions made by the current model. These
differences (residuals) represent the errors the model needs to correct.
3. Fit a New Tree to the Residuals:
○ Train a new decision tree (weak learner) on the residuals. This tree's purpose is to predict the residuals (errors) from the
previous model.
4. Update the Model by Adding the New Tree:
○ Add the predictions of the new tree to the original model's predictions. This updated model should now make better
predictions as it corrects the errors from the previous step.
5. Repeat Steps 2-4 Until a Stopping Criterion is Met:
○ Continue iterating through steps 2 to 4, adding more trees to the ensemble, until a stopping criterion is reached. Stopping
criteria can include:
■ A predefined number of trees.
■ A minimum improvement in performance.
■ A maximum number of iterations.
XGBoost Algorithm
● Advantages
○ High Performance:
■ XGBoost is optimized for speed and performance, handling large datasets efficiently.
○ Flexibility:
■ Can be used for both regression and classification problems.
○ Regularization:
■ Includes regularization parameters to prevent overfitting.
● Disadvantages
○ Complexity:
■ Can be complex to tune due to many hyperparameters.
○ Computationally Intensive:
■ Requires significant computational resources for very large datasets.
XGBoost Algorithm
● XGBoost iteratively improves model performance by focusing on correcting
errors from previous iterations. Its robust framework, combined with
regularization and speed optimizations, makes it a powerful tool in machine
learning. However, it requires careful tuning and substantial computational
resources, especially for large datasets.