Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Python Theory Notes

The document outlines various machine learning techniques including regression, classification, clustering, and others, along with their applications. It also distinguishes between AI, ML, and deep learning, and discusses Python libraries useful for machine learning tasks. Additionally, it covers supervised and unsupervised learning, model evaluation methods, and the importance of training and testing accuracy.

Uploaded by

shravani.22-25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Python Theory Notes

The document outlines various machine learning techniques including regression, classification, clustering, and others, along with their applications. It also distinguishes between AI, ML, and deep learning, and discusses Python libraries useful for machine learning tasks. Additionally, it covers supervised and unsupervised learning, model evaluation methods, and the importance of training and testing accuracy.

Uploaded by

shravani.22-25
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Techniques in Machine Learning

• Regression: Predict continuous numerical outputs (e.g., house prices, temperature


forecasts).

• Classification: Categorize data (e.g., spam email detection).

• Clustering: Group data points (e.g., customer segmentation in marketing).

• Other Techniques:

Associations: Discover relationships between variables (e.g., market basket


analysis).

Anomaly Detection: Identify unusual patterns (e.g., fraud detection).

Sequence Mining: Find patterns in sequences (e.g., DNA sequence analysis).

Dimension Reduction: Simplify data while retaining key information (e.g.,


Principal Component Analysis).

Recommendation Systems: Suggest items (e.g., movie recommendations on


Netflix).

Key Differences: AI, ML, Deep Learning

• Artificial Intelligence (AI): Broad field, includes computer vision, language


processing, creativity.

• Machine Learning (ML): Algorithms like regression, classification, neural networks.

• Deep Learning: Subset of ML, involves multi-layered neural networks for complex
tasks like image recognition.

Example:

• AI: A robot planning a route to a destination.

• ML: Predicting traffic based on historical data.

• Deep Learning: Identifying traffic signs in images.

PYTHON NOTES SHRAVANI SHINDE


Python Libraries for ML

1. NumPy: Numerical computations.

2. SciPy: Signal processing, optimization, statistics.

3. Matplotlib: Plotting graphs and visualizations.

4. Pandas: Data manipulation and analysis.

5. Scikit-learn: ML algorithms for regression, classification, and clustering.

Real-World Usage: In stock price prediction, Pandas is used for data manipulation,
Matplotlib for visualizing trends, and Scikit-learn for creating models.

Supervised vs. Unsupervised Learning

1. Supervised Learning: Supervised learning is a type of machine learning where the


model is trained on labeled data. The algorithm learns from input-output pairs, where
each input has a corresponding correct output (label). The goal is to map inputs to
outputs by learning the underlying pattern in the labeled dataset.
Requires a dataset with both inputs (features) and outputs (labels)
Objective: Predict the label or output for new, unseen inputs.
Common techniques: Regression (e.g., predicting sales), Classification (e.g.,
classifying emails as spam or not).
Use case:
▪ Diagnosing diseases based on symptoms.
▪ Spam Detection:
✓ Inputs: Email content.
✓ Output (Label): "Spam" or "Not Spam".
✓ Supervised learning algorithms learn the characteristics of spam
emails and apply the learned model to classify new emails.

2. Unsupervised Learning: Unsupervised learning is a type of machine learning where


the model works with unlabeled data. The algorithm tries to identify patterns, structures,
or groupings in the data without predefined labels or categories.

Works on unlabeled data.Only the input data is available; no associated outputs.


Objective: Explore the data structure and discover hidden patterns.

PYTHON NOTES SHRAVANI SHINDE


Techniques: Clustering, Density Estimation, Dimensionality Reduction Market
Basket Analysis.
Use case:
▪ Customer segmentation in e-commerce.
✓ Inputs: Customer demographics and purchase behavior.
✓ Output: Clusters or groups (e.g., "High-spending customers,"
"Occasional buyers").
✓ The algorithm groups customers based on similar characteristics
without predefined labels.

Regression in Python

1. Simple Linear Regression:

o Use scipy.stats.linregress(x, y) to calculate regression parameters like slope,


intercept, and p-value.

Practical Task:

• Use any dataset (e.g., data1) and run regression on two variables.

• Example scenario: Analyzing the relationship between advertising spend and sales.

PYTHON NOTES SHRAVANI SHINDE


Regression Algorithms in ML

1. Ordinal Regression: Ordinal regression predicts an ordinal target variable, where the
values have a natural order but no fixed interval (e.g., rankings, ratings).
Use Case: Predicting customer satisfaction levels (e.g., "Very Unsatisfied,"
"Unsatisfied," "Neutral," "Satisfied," "Very Satisfied")
Industry: Market research and survey analysis.

2. Poisson Regression: Poisson regression is used for modeling count data and events that
occur at a constant rate over time or space.

Use Case: Predicting the number of accidents at a traffic junction in a week.


Industry: Traffic management, healthcare (e.g., predicting patient arrivals at a
clinic).
3. Fast Forest Quantile Regression: An ensemble-based regression technique used to
predict the quantiles (percentiles) of a dependent variable's distribution.
Use Case: Estimating the range of house prices based on features like location,
size, and amenities.
Industry: Real estate valuation.
4. Linear Regression: A basic regression technique that models the relationship between
dependent and independent variables as a straight line.
Use Case: Predicting a student's exam scores based on study hours.
Industry: Education analytics.
5. Polynomial Regression: Extends linear regression by fitting a polynomial equation to
the data for modeling non-linear relationships.
Use Case: Modeling the trajectory of a projectile in physics.
Industry: Engineering and scientific research.

6. Lasso Regression: Linear regression with L1 regularization that minimizes the


absolute sum of coefficients, effectively performing feature selection.

Use Case: Identifying the most impactful features affecting house prices.
Industry: Data science and predictive analytics.
7. Step-Wise Regression: A method of selecting a regression model by adding or
removing predictors based on statistical significance.

PYTHON NOTES SHRAVANI SHINDE


Use Case: Building a sales prediction model by iteratively selecting relevant
marketing factors.
Industry: Business and sales analytics.

8. Ridge Regression: Linear regression with L2 regularization that penalizes large


coefficients to handle multicollinearity and overfitting.

Use Case: Predicting stock prices with highly correlated features like market
indices.
Industry: Financial modeling.

9. Bayesian Linear Regression: A probabilistic regression approach that incorporates


uncertainty in the predictions by treating model parameters as random variables.

Use Case: Forecasting weather with uncertainty intervals.


Industry: Meteorology and climate science.

10. Neural Network Regression: Uses neural networks to model complex and non-linear
relationships between variables.

Use Case: Predicting energy consumption based on historical data and weather
conditions.
Industry: Energy management and IoT.

11. Decision Forest Regression: An ensemble method combining multiple decision trees
to improve prediction accuracy and robustness.

Use Case: Predicting loan defaults based on customer profiles.


Industry: Banking and finance.
12. Boosted Decision Tree Regression: Improves decision forest regression by iteratively
correcting errors in previous models to boost overall performance.
Use Case: Forecasting sales demand for seasonal products.
Industry: Retail and supply chain.

PYTHON NOTES SHRAVANI SHINDE


Scatterplot Visualization

• Use Matplotlib for visualizing data and model predictions

Real-World Example: Scatterplots help visualize the correlation between marketing spend
and customer engagement.

Model Evaluation

Model evaluation is crucial for assessing how well a regression model fits the data and its
ability to make accurate predictions on unseen data. It involves comparing predicted values to
actual values using various techniques and metrics.

Train and Test on the Same Dataset

• The model is trained and evaluated using the same dataset. Predictions are compared
against actual outcomes within the same data.
• Advantages:
➢ Simple and quick to implement.
➢ Provides an initial understanding of the model's behavior.
• Disadvantages:
➢ Overfitting: The model might memorize the dataset instead of generalizing
patterns, leading to inflated accuracy scores.
➢ No Generalization: Results may not hold true for unseen data.
• Example:
Training a linear regression model on sales data and predicting sales within the same dataset. A
high accuracy might mislead if the model fails to predict new sales data accurately.
• Steps
✓ We run the regression on the entire dataset, which is the process of training the model
✓ Then, we select a small subset of the dataset for testing

PYTHON NOTES SHRAVANI SHINDE


✓ We compare the actual values to the values predicted by the regression line
✓ There are several metrics to check the accuracy. For example, average absolute error

Train and Test Model:

➢ Always evaluate models on unseen data to avoid overfitting.


➢ Definition: The dataset is divided into two subsets:
Training Set: Used to train the model.
Testing Set: Used to evaluate the model's performance.
➢ Process:
➢ Split the data, typically 80% for training and 20% for testing.
➢ Train the model using the training set.
➢ Evaluate its performance by predicting the testing set values and comparing
them with actual values.
➢ Improve reliability using k-fold Cross Validation: Rotate test/train splits to
compute average accuracy.
➢ Advantages:
➢ Provides a better understanding of the model's generalization ability.
➢ Reduces overfitting risk by testing on unseen data.
➢ Disadvantages:
➢ Model performance can vary depending on the specific split.
➢ May lead to biased results if the split is not representative of the overall data
➢ Example: Predicting house prices using a train/test split ensures the model's predictions
are reliable for new properties.

K-fold cross-validation: K-fold cross-validation is a method used to evaluate the


performance of a machine learning model. It divides the dataset into k equal-sized subsets
(folds). Each fold is used as a test set exactly once, while the remaining k-1 folds form the
training set. The process is repeated k times, and the model's performance is averaged over
all iterations to provide a more reliable estimate of its accuracy.

Steps in K-Fold Cross Validation

➢ Divide the Data: Split the entire dataset into k equal-sized subsets or folds.
➢ Train and Test: Use k−1 folds as the training set and the remaining one as the test
set. Train the model on the training set and evaluate its accuracy on the test set.

PYTHON NOTES SHRAVANI SHINDE


➢ Repeat k times: Rotate the test set among the k folds, ensuring each subset is used
as a test set once.
➢ Calculate Performance: Record the evaluation metric (e.g., accuracy, RMSE) for
each iteration.
➢ Average Results: Compute the mean of the metrics from all k iterations to
determine the final model performance.

Purpose of K-Fold Cross Validation

Reliability: Provides a more reliable measure of model performance by ensuring that


every data point is used for both training and testing.
Reduces Bias: Avoids the risk of overfitting or underfitting due to a single train-test
split.
Handles Variability: Accounts for data variability, especially in small datasets.

Example

Scenario: Evaluating a regression model predicting house prices.

1. Dataset: 100 house records.


2. k = 5 (5-Fold Cross Validation):
o Step 1: Split into 5 folds of 20 records each.s
o Step 2: Train the model on 80 records and test on 20.
o Step 3: Rotate the folds and repeat.
o Step 4: Record the accuracy (e.g., R2) for each fold.
o Step 5: Compute the average R2 across all 5 folds.

Result: Average R2=0.85, indicating good model performance.

PYTHON NOTES SHRAVANI SHINDE


Accuracy: Accuracy is a metric used to evaluate how well a machine learning
model performs. It measures the percentage of correct predictions made by the
model.

➢ Training Accuracy: Indicates how accurate the model is when tested on the training
dataset (the data used to train the model). High training accuracy shows the model has
learned patterns in the training data well. High training accuracy can indicate
overfitting.
➢ Testing Accuracy: Indicates how accurate the model is when tested on a new, unseen
dataset (test dataset). Testing accuracy is a measure of the model's ability to generalize
to new data. Measures generalization capability.

Why Compare Training and Testing Accuracy?

➢ Identify Overfitting:
A model with very high training accuracy but low testing accuracy is likely
overfitting. It means the model is too tailored to the training dataset and cannot
generalize to new data.
Example: A student memorizing practice questions instead of understanding the
concepts may perform well on the practice test but poorly on new questions.
➢ Generalization:
Testing accuracy indicates whether the model can perform well on data it has
not seen before. A high testing accuracy is desirable as it reflects the model's
robustness and generalization ability.

High Training Accuracy isn’t Always Desirable

Overfitting:

✓ A high training accuracy might suggest the model is overly dependent on the nuances
and noise in the training data.
✓ Such a model will struggle with new data, leading to poor testing accuracy.

Balanced Accuracy:

✓ A balance between training and testing accuracy is preferred, indicating the model has
learned the data patterns effectively without overfitting.

PYTHON NOTES SHRAVANI SHINDE


Low Test Accuracy

If the test set contains data that overlaps or is part of the training set, it can lead to low testing
accuracy. This is because the model may not have been trained on sufficiently distinct data, or
the dataset might not represent the broader problem well.

Desirable Testing Accuracy

✓ A high testing accuracy reflects the model's ability to generalize its learning to new
datasets.

✓ This is why models are often tuned and validated on separate test datasets to ensure
robustness.

Example

• Scenario: Predicting house prices.

✓ Training Accuracy: 98% (the model fits perfectly to the training data).
✓ Testing Accuracy: 70% (the model struggles with new data).
✓ Conclusion: The model is overfitting, and adjustments like reducing model complexity
or using regularization are needed.

Multiple Regression

• Predict a dependent variable using multiple independent variables.

• Example:

▪ Dependent Variable (y): Median house price.


▪ Independent Variables (X): Total rooms, median income.

PYTHON NOTES SHRAVANI SHINDE


Train/Test Split in Python

➢ We use the scikit-learn library for this


➢ Since we have already imported the train_test_split function, we can use it directly as
train_test_split()
➢ If we only import sklearn, we need to recall it using
sklearn.model_selection.train_testsplit()

Parameters

Parameters are variables or inputs that define the behavior of a function, model, or algorithm.
These allow customization or control over how the operation is performed. They serve as input
values that determine the function's operation and output.

Types

➢ Function Parameters: Inputs provided to a function that controls its execution.


➢ Model Parameters: Variables learned from data during training.
➢ Hyperparameters: Configurations set before training to control the training process or
model structure.

1. *arrays: These are the input datasets you want to split. Typically, you provide features (like
X) and target labels (like y) as arguments.

Examples:

• X, y: Feature matrix and target array.


• Multiple arrays can be passed if you need to split several datasets consistently.

2. test_size: Specifies the proportion of the dataset to include in the test split.

• Range: A float value between 0.0 and 1.0.


For example, test_size=0.25 means 25% of the data will be used for testing.

• Default Value: If left as None, it is automatically set to 0.25 if train_size is also None.

PYTHON NOTES SHRAVANI SHINDE


3. train_size: Specifies the proportion of the dataset to include in the training split.

• Range: A float value between 0.0 and 1.0.


For example, train_size=0.75 means 75% of the data will be used for training.

• Default Value: If left as None, the complement of test_size is used (e.g., if


test_size=0.25, then train_size=0.75).

4. random_state: Controls the randomness of data shuffling before splitting the dataset. It acts
as a "seed" for reproducibility.

• How it Works: If you use the same random_state value across multiple runs, the split
will remain consistent.

• Example:

random_state=42 ensures the same split every time the code is run.

5. shuffle: Determines whether the data should be shuffled before splitting.

• Default Value: True (data will be shuffled).

• Why Shuffle?: Shuffling ensures that the split is random, which is especially important
when the data has an inherent order or structure.

6. stratify: Ensures that the training and test splits preserve the proportion of classes in a
classification problem.

• Input: This parameter takes the target variable (e.g., y) to perform the stratification.

• When to Use: Use this when dealing with imbalanced classes to avoid over-
representation of a specific class in either split.

PYTHON NOTES SHRAVANI SHINDE


Classification

Classification is a supervised machine learning technique used to categorize data into


predefined classes or groups. The model is trained on a labeled dataset, meaning each input is
paired with a corresponding output label. The goal is to predict the class label of new, unseen
data.

• Output: A categorical variable (e.g., "Yes" or "No").

• Types:

Binary Classification: Two possible outcomes (e.g., spam email detection:


"Spam" or "Not Spam").
Multi-Class Classification: More than two categories (e.g., identifying flower
species: Setosa, Versicolor, or Virginica).

Real-Life Examples

• Email filtering: Determining if an email is spam or not.

• Fraud detection: Classifying transactions as fraudulent or legitimate.

• Healthcare: Predicting if a patient has a disease based on symptoms.

Classification Algorithms

1. Decision Trees

2. Naive Bayes

3. Linear Discriminant Analysis

4. k-Nearest Neighbour (kNN)

5. Logistic Regression

6. Neural Networks

7. Support Vector Machines (SVM)

PYTHON NOTES SHRAVANI SHINDE


Decision Trees

A decision tree is a classification algorithm that splits data based on features to predict
outcomes. Each decision leads to a node, and the tree continues until reaching a "leaf" node
with a classification.

Example: Suppose, in an example, we have a data of patients who are either prescribed drug
A or B, and we want to find out what a new patient is to be prescribed

Solution:

PYTHON NOTES SHRAVANI SHINDE


Building a Decision Tree

• We want to basically determine which attribute should be used for classification in each
subsequent stage
• We will see what to do when we only have to compare two attributes, sex and
cholesterol, to see which is better
• If we have more attributes, we need to compare all simultaneously using the same
process
• Once the primary attribute is determined and the data is split, we then have to
recursively use the same process at each stage till we get to the leaves, or terminal nodes
• Algorithm Example
➢ We assume that we have 14 patients
➢ We decide which attribute is the most predictive
➢ Suppose, in our example, we choose cholesterol as the attribute
➢ Then, we see what the branches look like for both cholesterol and se

• Use criteria like entropy or Gini index to measure impurity and determine splits.

Entropy

➢ By looking at the data, it appears as though the sex attribute has more predictiveness
and less entropy
➢ Entropy is the measurement of impurity in a particular node • The lower the entropy,
the purer the node
➢ It is defined as
E = -p(A)log2p(A) – p(B)log2p(B)
➢ A perfectly pure node has entropy 0, while a perfectly impure node has entropy 1.
(Lower entropy = purer nodes)

PYTHON NOTES SHRAVANI SHINDE


➢ Information Gain: Difference in entropy before and after splitting.

Sex will give better output, so we go one level below for classification

PYTHON NOTES SHRAVANI SHINDE


Gini Index

• Measures impurity for splits:

P is the probability of an object being chosen

• Lower Gini = purer nodes.

• Used in CART (Classification and Regression Trees).

Real-Life Example

• Predicting whether a patient should be prescribed Drug A or B based on features like


cholesterol level and age.

Comparison: Entropy vs Gini

Metric Entropy Gini Index

Scale 0 (pure) to 1 0 (pure) to 0.5

Computation Logarithmic Quadratic

Advantages of Decision Trees

✓ Easy to interpret and visualize.


✓ Handles numerical and categorical data.
✓ Can identify feature importance.
✓ Robust to missing values and outliers.

Disadvantages of Decision Trees

✓ Prone to overfitting (can be mitigated by pruning).


✓ Sensitive to small changes in data.
✓ May produce biased results with imbalanced data.

k-Nearest Neighbours (kNN)

PYTHON NOTES SHRAVANI SHINDE


➢ A non-parametric, instance-based learning algorithm used for classification and
regression.
➢ A new data point is classified by the majority vote of its k nearest neighbors based on
distance metrics like Euclidean or Manhattan.

Choosing k

• Smaller k: High variance, low bias.

• Larger k: High bias, low variance.

• Usually, higher values of k will be better if there are more outliers

• Also, odd values for k are recommended to avoid ties

• Use cross-validation to select an optimal k.

Applications

• Missing data imputation.

• Recommendation systems (e.g., movie recommendations).

• Medical diagnostics (e.g., cancer risk prediction).

• Finance Profiling

• Pattern Recognition

Advantages

• Simple and easy to implement.

• Adaptable to different problem domains.

Disadvantages

• Computationally intensive for large datasets.

• Sensitive to irrelevant features and high dimensionality.

How is test done?

✓ Suppose we have a 2 categories for the output variable, say A and B, and we have a new
data point which we want to classify as either A or B
✓ We use the scatterplot for the given data, and then plot the new point in the same

PYTHON NOTES SHRAVANI SHINDE


✓ First, we choose the value of k (typically 5 is used, but any value can be used)
✓ Then, we observe the k nearest points to the data point (nearest on the basis of distance)
✓ To calculate distance, we can use the Euclidean distance, Manhattan distance, or the
Minkowski distance
✓ We calculate the distance of the new point from all the points and select the 5 shortest
distances amongst them

PYTHON NOTES SHRAVANI SHINDE


Evaluation Metrics in Classification

Evaluation metrics are measures used to assess the performance of a classification model by
comparing its predictions against actual outcomes. They help determine how well the model
distinguishes between different classes and guide improvements.

The most widely known metrics are:

1. Classification Accuracy
2. Jaccard Index
3. Area under the curve
4. F1 Score
5. Logarithmic Loss
6. Confusion Matrix

Classification Accuracy

Ratio of correct predictions to the total number of input samples.


It's straightforward to calculate but prone to misclassification, especially in imbalanced
datasets.
Real-life Example: If a model predicts 90 out of 100 email classifications correctly
(spam vs. not spam), the accuracy is 90%.

Jaccard Index

Measures the similarity between predicted and actual output values as a ratio of their
intersection to their union.
Real-life Example: Used in recommendation systems to compare user preferences.
In set A we take all predicted values, and in set B we take all test output y values
The index is defined as:

An index closer to 0 would indicate less accuracy, while a value closer to 1 indicates a
higher accuracy

Confusion Matrix

A table summarizing the true positives (TP), false positives (FP), true negatives (TN),
and false negatives (FN).

PYTHON NOTES SHRAVANI SHINDE


Real-life Example: Evaluates a disease prediction model's performance, distinguishing
between correctly and incorrectly diagnosed patients.

It computes 4 kinds of variables – true positives (TP), false positives (FP), true
negatives (TN), and false negatives (FN) • It basically uses the diagonal elements to
calculate accuracy

F1 Score

The harmonic mean of precision and recall.


Formulas:
▪ Precision = TP / (TP + FP)
▪ Recall = TP / (TP + FN)
▪ F1 = 2 × (Precision × Recall) / (Precision + Recall)
Real-life Example: Used in fraud detection systems where both false positives and
false negatives are critical.

Logarithmic Loss

Measures the uncertainty of predictions for multi-class classification by assigning


probabilities between 0 and 1.
Real-life Example: Multi-class sentiment analysis in customer reviews.

PYTHON NOTES SHRAVANI SHINDE


Ensemble Learning Models

Ensemble learning models are machine learning techniques that combine multiple individual
models to improve overall performance, accuracy, and robustness. Instead of relying on a single
model, these techniques aggregate the outputs of several models, often referred to as "weak
learners," to create a stronger, more accurate predictive model. This is equivalent to consulting
several people before making a decision for an unknown situation, and hence it is also called
as a committee of experts

Real-life Example: Weather forecasting uses multiple models for more reliable predictions.

Key Features of Ensemble Learning

1. Improved Performance: By leveraging the strengths of multiple models, ensemble


methods reduce errors like bias and variance.

2. Diversity: Different models or variations of the same model are used to ensure better
generalization.

3. Robustness: More resilient to outliers and noise in data.

4. Error Reduction: Reduces overfitting (high variance) and underfitting (high bias) when
properly implemented.

Types of Ensemble Learning Models

1. Bagging (Bootstrap Aggregating)

• Process:

✓ Subsamples of the dataset are drawn with replacements.


✓ Each subset is used to train an independent model.
✓ Final prediction is the average (regression) or majority vote (classification) of
all models.

• Key Focus: Reduces variance by averaging predictions.

• Examples: Random Forest: Constructs multiple decision trees and averages their
outputs.

• Real-life Example: Predicting house prices based on diverse features like location, size,
and condition.

PYTHON NOTES SHRAVANI SHINDE


2. Boosting

• Process:

✓ Models are trained sequentially, with each model focusing on correcting the
errors made by its predecessor.
✓ Data points misclassified by previous models are given higher weights.
✓ It starts by assigning equal weights to all the data points in the original dataset,
and each subsequent learner increases or decreases the weights on the points
depending on whether the point can been correctly or incorrectly classified

• Key Focus: Reduces bias by iteratively improving predictions.

• Examples:

AdaBoost: Adjusts weights dynamically for misclassified instances.


Gradient Boosting: Optimizes errors using gradient descent.

• Real-life Example: Credit scoring systems for assessing loan risks.

3. Stacking

• Process:

✓ Combines the predictions of different types of base models (e.g., decision trees,
logistic regression).
✓ A meta-model (e.g., a logistic regression model) is trained on the outputs of the
base models to make the final prediction.

• Key Focus: Leverages the strengths of diverse models.

• Examples: Combining decision trees, support vector machines (SVM), and neural
networks.

• Real-life Example: Predicting customer churn using demographic, behavioral, and


transactional data.

Popular Ensemble Learning Algorithms

Random Forests: An ensemble of decision trees, each trained on random subsets, averaging
results for robust predictions.

PYTHON NOTES SHRAVANI SHINDE


It runs using the following steps:

I. Selecting of a random subset and using that as the training set to construct a decision
tree
II. Repeating the procedure for several subsets, creating several decision trees
III. Average over all the outputs to create a final output, thus minimizing the error

The higher the number of trees created, the more resilient the model is against overfitting

Advantages:

Solves regression and classification problems.


Handles large datasets efficiently and avoids overfitting.
It enhances accuracy and is resistant to overfitting
It does not fail due to missing values
It has an aspect of cross-validation built into the structure

Real-life Example: Predicting property prices based on historical data.

Process

➢ Select k data points from the training set, and make a subset
➢ Build a decision tree
➢ Choose n, the number of decision trees to build
➢ Repeat steps 1 and 2
➢ For new data points, find the prediction of each decision tree, and choose the category
that wins the maximum votes from amongst the given decision trees

Parameters:

✓ n_estimators: Number of trees.


✓ criterion: Split quality metric (Gini/Entropy).
✓ Real-life Example: Predicting species of plants using physical characteristics

PYTHON NOTES SHRAVANI SHINDE


Support Vector Machine (SVM): A robust machine learning algorithm used for classification,
regression, and clustering tasks. It is especially powerful for classification problems.

Core Principle: Identifies a hyperplane in an N-dimensional space to separate data points into
distinct categories. It can be used for Linear Classification, Non-Linear Classification,
Regression, and Clustering.

Key Concepts

Hyperplane:
• A hyperplane is the boundary separating different classes in the data.
• For 2D data, it is a line; for 3D, it is a plane; in higher dimensions, it is a
hyperplane.
• Multiple hyperplanes might separate classes, but the goal is to find the "best"
hyperplane.

Best Hyperplane:

• The optimal hyperplane maximizes the margin (distance) between itself and the
nearest data points from each class, known as support vectors.

Hard Margin:

• Used for linearly separable data.

• In this case, we can select two parallel hyperplanes such that the distance
between them is maximized

• The distance between them is called the margin and the maximum margin
hyperplane is the one that is halfway between them

• Maximizes the margin with no tolerance for misclassified points.

• Completely determined by the support vectors which are the nearest datapoints

PYTHON NOTES SHRAVANI SHINDE


Soft Margin:

• Applied when data is not linearly separable.

• Allows misclassifications but assigns penalties for misclassified points using


hinge loss.

• Controlled by the parameter C.

Choosing the best Hyperplane

➢ The hyperplane chosen is the one which maximizes the width of the margin, that is the
distance between the hyperplane, and the first data point on each side
➢ Suppose we are given the data points (x1,y1), (x2,y2),….,(xn,yn)
➢ Let w be the perpendicular vector, and b be the distance of the perpendicular vector
with the hyperplane
➢ The distance between the data point xi and the decision boundary is

The Parameter C:

• Balances the trade-off between maximizing margin and minimizing classification


errors.

PYTHON NOTES SHRAVANI SHINDE


• Low C: Wider margins, tolerates misclassifications, less prone to overfitting but may
underfit.

• High C: Narrower margins, fits the data closely, less tolerant to misclassification but
risks overfitting.

Optimization of C

Methods include:

I. Grid Search: Exhaustively evaluates parameters.


II. Random Search: Tests randomly selected values.
III. Genetic Algorithm/Particle Swarm Optimization: Uses heuristic optimization
techniques.

Kernels and non-linearity:

A kernel is a mathematical function used in Support Vector Machines (SVM) to transform data
into a higher-dimensional space. The transformation enables SVM to handle non-linear
relationships in the data by making it linearly separable in the transformed space.

▪ Kernels perform by projecting data into a space where it becomes linearly separable
▪ This method of projecting the data into a higher dimensional space is called the kernel
trick
▪ This can usually lead to higher amounts of calculations, as transforming each point
takes time
▪ However, SVMs have a kernel function that computes the similarity between the data
points in the higher dimensional space without actually having to compute the
coordinates of each point

Types of Kernels:

➢ Linear Kernel: For linearly separable data.


➢ Polynomial Kernel: Handles more complex patterns but risks overfitting with higher
degrees.
➢ Gaussian/RBF Kernel: Effective for non-linear data, requires careful hyperparameter
tuning.
➢ Sigmoid Kernel: Common in neural networks.

PYTHON NOTES SHRAVANI SHINDE


For data that is mostly linearly separable, the linear kernel is sufficient, as it is much easier
to compute. For complex, non-linear data, the polynomial and the Gaussian/RBF kernel are
used. However, the polynomial kernel is prone to overfitting with higher polynomials,
while the RBF kernel requires careful tuning of the hyperparameter. The sigmoid kernel is
mostly used in neural networks

The transformation is achieved via the kernel trick, avoiding explicit computations of new
coordinates.

PYTHON NOTES SHRAVANI SHINDE

You might also like