Python Theory Notes
Python Theory Notes
• Other Techniques:
• Deep Learning: Subset of ML, involves multi-layered neural networks for complex
tasks like image recognition.
Example:
Real-World Usage: In stock price prediction, Pandas is used for data manipulation,
Matplotlib for visualizing trends, and Scikit-learn for creating models.
Regression in Python
Practical Task:
• Use any dataset (e.g., data1) and run regression on two variables.
• Example scenario: Analyzing the relationship between advertising spend and sales.
1. Ordinal Regression: Ordinal regression predicts an ordinal target variable, where the
values have a natural order but no fixed interval (e.g., rankings, ratings).
Use Case: Predicting customer satisfaction levels (e.g., "Very Unsatisfied,"
"Unsatisfied," "Neutral," "Satisfied," "Very Satisfied")
Industry: Market research and survey analysis.
2. Poisson Regression: Poisson regression is used for modeling count data and events that
occur at a constant rate over time or space.
Use Case: Identifying the most impactful features affecting house prices.
Industry: Data science and predictive analytics.
7. Step-Wise Regression: A method of selecting a regression model by adding or
removing predictors based on statistical significance.
Use Case: Predicting stock prices with highly correlated features like market
indices.
Industry: Financial modeling.
10. Neural Network Regression: Uses neural networks to model complex and non-linear
relationships between variables.
Use Case: Predicting energy consumption based on historical data and weather
conditions.
Industry: Energy management and IoT.
11. Decision Forest Regression: An ensemble method combining multiple decision trees
to improve prediction accuracy and robustness.
Real-World Example: Scatterplots help visualize the correlation between marketing spend
and customer engagement.
Model Evaluation
Model evaluation is crucial for assessing how well a regression model fits the data and its
ability to make accurate predictions on unseen data. It involves comparing predicted values to
actual values using various techniques and metrics.
• The model is trained and evaluated using the same dataset. Predictions are compared
against actual outcomes within the same data.
• Advantages:
➢ Simple and quick to implement.
➢ Provides an initial understanding of the model's behavior.
• Disadvantages:
➢ Overfitting: The model might memorize the dataset instead of generalizing
patterns, leading to inflated accuracy scores.
➢ No Generalization: Results may not hold true for unseen data.
• Example:
Training a linear regression model on sales data and predicting sales within the same dataset. A
high accuracy might mislead if the model fails to predict new sales data accurately.
• Steps
✓ We run the regression on the entire dataset, which is the process of training the model
✓ Then, we select a small subset of the dataset for testing
➢ Divide the Data: Split the entire dataset into k equal-sized subsets or folds.
➢ Train and Test: Use k−1 folds as the training set and the remaining one as the test
set. Train the model on the training set and evaluate its accuracy on the test set.
Example
➢ Training Accuracy: Indicates how accurate the model is when tested on the training
dataset (the data used to train the model). High training accuracy shows the model has
learned patterns in the training data well. High training accuracy can indicate
overfitting.
➢ Testing Accuracy: Indicates how accurate the model is when tested on a new, unseen
dataset (test dataset). Testing accuracy is a measure of the model's ability to generalize
to new data. Measures generalization capability.
➢ Identify Overfitting:
A model with very high training accuracy but low testing accuracy is likely
overfitting. It means the model is too tailored to the training dataset and cannot
generalize to new data.
Example: A student memorizing practice questions instead of understanding the
concepts may perform well on the practice test but poorly on new questions.
➢ Generalization:
Testing accuracy indicates whether the model can perform well on data it has
not seen before. A high testing accuracy is desirable as it reflects the model's
robustness and generalization ability.
Overfitting:
✓ A high training accuracy might suggest the model is overly dependent on the nuances
and noise in the training data.
✓ Such a model will struggle with new data, leading to poor testing accuracy.
Balanced Accuracy:
✓ A balance between training and testing accuracy is preferred, indicating the model has
learned the data patterns effectively without overfitting.
If the test set contains data that overlaps or is part of the training set, it can lead to low testing
accuracy. This is because the model may not have been trained on sufficiently distinct data, or
the dataset might not represent the broader problem well.
✓ A high testing accuracy reflects the model's ability to generalize its learning to new
datasets.
✓ This is why models are often tuned and validated on separate test datasets to ensure
robustness.
Example
✓ Training Accuracy: 98% (the model fits perfectly to the training data).
✓ Testing Accuracy: 70% (the model struggles with new data).
✓ Conclusion: The model is overfitting, and adjustments like reducing model complexity
or using regularization are needed.
Multiple Regression
• Example:
Parameters
Parameters are variables or inputs that define the behavior of a function, model, or algorithm.
These allow customization or control over how the operation is performed. They serve as input
values that determine the function's operation and output.
Types
1. *arrays: These are the input datasets you want to split. Typically, you provide features (like
X) and target labels (like y) as arguments.
Examples:
2. test_size: Specifies the proportion of the dataset to include in the test split.
• Default Value: If left as None, it is automatically set to 0.25 if train_size is also None.
4. random_state: Controls the randomness of data shuffling before splitting the dataset. It acts
as a "seed" for reproducibility.
• How it Works: If you use the same random_state value across multiple runs, the split
will remain consistent.
• Example:
random_state=42 ensures the same split every time the code is run.
• Why Shuffle?: Shuffling ensures that the split is random, which is especially important
when the data has an inherent order or structure.
6. stratify: Ensures that the training and test splits preserve the proportion of classes in a
classification problem.
• Input: This parameter takes the target variable (e.g., y) to perform the stratification.
• When to Use: Use this when dealing with imbalanced classes to avoid over-
representation of a specific class in either split.
• Types:
Real-Life Examples
Classification Algorithms
1. Decision Trees
2. Naive Bayes
5. Logistic Regression
6. Neural Networks
A decision tree is a classification algorithm that splits data based on features to predict
outcomes. Each decision leads to a node, and the tree continues until reaching a "leaf" node
with a classification.
Example: Suppose, in an example, we have a data of patients who are either prescribed drug
A or B, and we want to find out what a new patient is to be prescribed
Solution:
• We want to basically determine which attribute should be used for classification in each
subsequent stage
• We will see what to do when we only have to compare two attributes, sex and
cholesterol, to see which is better
• If we have more attributes, we need to compare all simultaneously using the same
process
• Once the primary attribute is determined and the data is split, we then have to
recursively use the same process at each stage till we get to the leaves, or terminal nodes
• Algorithm Example
➢ We assume that we have 14 patients
➢ We decide which attribute is the most predictive
➢ Suppose, in our example, we choose cholesterol as the attribute
➢ Then, we see what the branches look like for both cholesterol and se
• Use criteria like entropy or Gini index to measure impurity and determine splits.
Entropy
➢ By looking at the data, it appears as though the sex attribute has more predictiveness
and less entropy
➢ Entropy is the measurement of impurity in a particular node • The lower the entropy,
the purer the node
➢ It is defined as
E = -p(A)log2p(A) – p(B)log2p(B)
➢ A perfectly pure node has entropy 0, while a perfectly impure node has entropy 1.
(Lower entropy = purer nodes)
Sex will give better output, so we go one level below for classification
Real-Life Example
Choosing k
Applications
• Finance Profiling
• Pattern Recognition
Advantages
Disadvantages
✓ Suppose we have a 2 categories for the output variable, say A and B, and we have a new
data point which we want to classify as either A or B
✓ We use the scatterplot for the given data, and then plot the new point in the same
Evaluation metrics are measures used to assess the performance of a classification model by
comparing its predictions against actual outcomes. They help determine how well the model
distinguishes between different classes and guide improvements.
1. Classification Accuracy
2. Jaccard Index
3. Area under the curve
4. F1 Score
5. Logarithmic Loss
6. Confusion Matrix
Classification Accuracy
Jaccard Index
Measures the similarity between predicted and actual output values as a ratio of their
intersection to their union.
Real-life Example: Used in recommendation systems to compare user preferences.
In set A we take all predicted values, and in set B we take all test output y values
The index is defined as:
An index closer to 0 would indicate less accuracy, while a value closer to 1 indicates a
higher accuracy
Confusion Matrix
A table summarizing the true positives (TP), false positives (FP), true negatives (TN),
and false negatives (FN).
It computes 4 kinds of variables – true positives (TP), false positives (FP), true
negatives (TN), and false negatives (FN) • It basically uses the diagonal elements to
calculate accuracy
F1 Score
Logarithmic Loss
Ensemble learning models are machine learning techniques that combine multiple individual
models to improve overall performance, accuracy, and robustness. Instead of relying on a single
model, these techniques aggregate the outputs of several models, often referred to as "weak
learners," to create a stronger, more accurate predictive model. This is equivalent to consulting
several people before making a decision for an unknown situation, and hence it is also called
as a committee of experts
Real-life Example: Weather forecasting uses multiple models for more reliable predictions.
2. Diversity: Different models or variations of the same model are used to ensure better
generalization.
4. Error Reduction: Reduces overfitting (high variance) and underfitting (high bias) when
properly implemented.
• Process:
• Examples: Random Forest: Constructs multiple decision trees and averages their
outputs.
• Real-life Example: Predicting house prices based on diverse features like location, size,
and condition.
• Process:
✓ Models are trained sequentially, with each model focusing on correcting the
errors made by its predecessor.
✓ Data points misclassified by previous models are given higher weights.
✓ It starts by assigning equal weights to all the data points in the original dataset,
and each subsequent learner increases or decreases the weights on the points
depending on whether the point can been correctly or incorrectly classified
• Examples:
3. Stacking
• Process:
✓ Combines the predictions of different types of base models (e.g., decision trees,
logistic regression).
✓ A meta-model (e.g., a logistic regression model) is trained on the outputs of the
base models to make the final prediction.
• Examples: Combining decision trees, support vector machines (SVM), and neural
networks.
Random Forests: An ensemble of decision trees, each trained on random subsets, averaging
results for robust predictions.
I. Selecting of a random subset and using that as the training set to construct a decision
tree
II. Repeating the procedure for several subsets, creating several decision trees
III. Average over all the outputs to create a final output, thus minimizing the error
The higher the number of trees created, the more resilient the model is against overfitting
Advantages:
Process
➢ Select k data points from the training set, and make a subset
➢ Build a decision tree
➢ Choose n, the number of decision trees to build
➢ Repeat steps 1 and 2
➢ For new data points, find the prediction of each decision tree, and choose the category
that wins the maximum votes from amongst the given decision trees
Parameters:
Core Principle: Identifies a hyperplane in an N-dimensional space to separate data points into
distinct categories. It can be used for Linear Classification, Non-Linear Classification,
Regression, and Clustering.
Key Concepts
Hyperplane:
• A hyperplane is the boundary separating different classes in the data.
• For 2D data, it is a line; for 3D, it is a plane; in higher dimensions, it is a
hyperplane.
• Multiple hyperplanes might separate classes, but the goal is to find the "best"
hyperplane.
Best Hyperplane:
• The optimal hyperplane maximizes the margin (distance) between itself and the
nearest data points from each class, known as support vectors.
Hard Margin:
• In this case, we can select two parallel hyperplanes such that the distance
between them is maximized
• The distance between them is called the margin and the maximum margin
hyperplane is the one that is halfway between them
• Completely determined by the support vectors which are the nearest datapoints
➢ The hyperplane chosen is the one which maximizes the width of the margin, that is the
distance between the hyperplane, and the first data point on each side
➢ Suppose we are given the data points (x1,y1), (x2,y2),….,(xn,yn)
➢ Let w be the perpendicular vector, and b be the distance of the perpendicular vector
with the hyperplane
➢ The distance between the data point xi and the decision boundary is
The Parameter C:
• High C: Narrower margins, fits the data closely, less tolerant to misclassification but
risks overfitting.
Optimization of C
Methods include:
A kernel is a mathematical function used in Support Vector Machines (SVM) to transform data
into a higher-dimensional space. The transformation enables SVM to handle non-linear
relationships in the data by making it linearly separable in the transformed space.
▪ Kernels perform by projecting data into a space where it becomes linearly separable
▪ This method of projecting the data into a higher dimensional space is called the kernel
trick
▪ This can usually lead to higher amounts of calculations, as transforming each point
takes time
▪ However, SVMs have a kernel function that computes the similarity between the data
points in the higher dimensional space without actually having to compute the
coordinates of each point
Types of Kernels:
The transformation is achieved via the kernel trick, avoiding explicit computations of new
coordinates.