Data science tutorial
Data science tutorial
https://www.geeksforgeeks.org/data-science-with-python-tutorial/
https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/
Missing values can occur due to data collection errors, system failures, or incomplete surveys.
Handling them properly is crucial to maintain data quality and prevent bias in analysis or machine
learning models.
If too many missing values exist in a column (e.g., >50%), it's better to drop it.
Rows with missing values can be removed if they are few and won’t affect results.
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
df['column_name'].fillna(df['column_name'].median(), inplace=True)
Mode Imputation: Use the most frequent value for categorical/numerical data.
df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)
df['category_column'].fillna("Unknown", inplace=True)
✅ 3) Predictive Imputation
Use Machine Learning models (e.g., Regression, KNN) to predict missing values.
imputer = KNNImputer(n_neighbors=5)
df.fillna(method='ffill', inplace=True)
df.fillna(method='bfill', inplace=True)
I'll generate a dataset with missing values and demonstrate different imputation techniques.
1 250000 3
2 275000 3
3 NaN 2
4 300000 NaN
5 320000 4
6 NaN 4
7 350000 NaN
8 370000 3
9 NaN 2
10 400000 5
2️⃣ After Mean Imputation
3 323571.43 2.00
4 300000.00 3.25
6 323571.43 4.00
7 350000.00 3.25
9 323571.43 2.00
✅ Use Case: Works well for normally distributed data but can be affected by outliers.
3 320000 2
4 300000 3
6 320000 4
7 350000 3
9 320000 2
🔹 Summary
K-Nearest Neighbors (KNN) imputation replaces missing values by considering the nearest neighbors
of an instance and using their values to estimate the missing data.
🔹 Step-by-Step Working
For a missing value, identify K closest data points based on other available features.
For numerical features, the missing value is replaced with the mean or weighted average of
the K neighbors.
For categorical features, the most frequent (mode) value among the neighbors is used.
🔹 Example
1 250000 3
2 275000 3
3 NaN 2
4 300000 NaN
5 320000 4
✅ When data has patterns: Works well if missing values depend on relationships with other
variables.
✅ When handling outliers: Unlike mean imputation, KNN uses local information, reducing the effect
of extreme values.
✅ For complex datasets: It captures underlying distributions better than simple mean/median
imputation.
❌ Limitations:
https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/
Scaling is the process of transforming numerical features so that they have a consistent range or
distribution. It ensures that no feature dominates others due to differences in magnitude.
For example, in a dataset with house prices (in millions) and number of bedrooms, the large
numerical difference can cause biased model predictions if not scaled properly.
🔹 Importance of Scaling
Many models (e.g., Linear Regression, KNN, SVM, Neural Networks) perform better when
numerical values are in the same range.
In deep learning and logistic regression, scaling helps converge faster by maintaining
balanced updates.
Algorithms like KNN, K-Means Clustering, PCA use Euclidean distance, which is sensitive to
scale.
Formula:
Best for: Data with known min/max values (e.g., image pixel values, sensor data).
Formula:
Best for: Normally distributed data (e.g., financial data, test scores).
Formula:
It looks like there was a connection issue while executing the code. Let me retry it.
There seems to be an internal issue preventing code execution. However, I can still explain the
expected results:
🔹 Expected Output
100000 1 600
200000 2 800
300000 3 1000
400000 4 1200
500000 5 1400
Formula:
Output Example:
Formula:
Output Example:
Formula:
Output Example:
Each method has its advantages: ✅ Min-Max Scaling: Preserves original distribution but sensitive to
outliers.
✅ Standardization: Best for normally distributed data.
✅ Robust Scaling: Works well with outliers.
comparison plot
import numpy as np
import pandas as pd
# Sample dataset
data = {
df = pd.DataFrame(data)
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()
df_robust = pd.DataFrame(robust_scaler.fit_transform(df), columns=df.columns)
# Create subplots
axes[0, 0].set_yticklabels(df.columns)
axes[0, 1].set_yticklabels(df_minmax.columns)
axes[1, 0].set_yticklabels(df_standard.columns)
axes[1, 1].set_yticklabels(df_robust.columns)
# Adjust layout
plt.tight_layout()
plt.show()
Label encoding is a technique used to convert categorical values (text labels) into numerical values so
that machine learning models can process them.
How It Works
Blue → 0
Green → 1
Red → 2
import pandas as pd
# Sample data
df = pd.DataFrame(data)
encoder = LabelEncoder()
df['Color_Encoded'] = encoder.fit_transform(df['Color'])
print(df)
🔹 Output:
Color Color_Encoded
0 Red 2
1 Blue 0
2 Green 1
3 Blue 0
4 Red 2
✅ When the categorical variable is ordinal (e.g., Low < Medium < High).
🚫 Avoid label encoding for non-ordinal data, as models may assume a numerical relationship (e.g.,
Red > Blue doesn’t make sense).
If the categories are nominal (unordered), One-Hot Encoding is usually a better option.
https://c3.ai/introduction-what-is-machine-learning/regression-performance/
https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-
regression-model/
A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. It
works by splitting the data into branches based on feature values, forming a tree-like structure.
2. Choose the best feature to split the data (based on criteria like Gini Impurity, Entropy, or
Mean Squared Error).
o A stopping condition is met (e.g., max depth, minimum samples per leaf).
2. Splitting Criteria
For Classification:
Gini Impurity: Measures how often a randomly chosen element would be misclassified.
For Regression:
To address these issues, methods like pruning and Random Forests (ensemble learning) are used.
Let's say we want to predict if a person will buy a laptop based on age and income.
22 High No
30 Low Yes
40 Medium Yes
35 High No
50 Low Yes
/ \
Yes No
/ \
[Income?] Buys: Yes
/ \
High Low/Medium
The goal of splitting a node in a Decision Tree is to create purer child nodes. The best split is the one
that minimizes Gini Impurity or Entropy.
Gini Impurity:
Optimal Value: 0 (indicating a pure node, where all samples belong to one class).
If no perfect split is possible, we choose the split that results in the lowest Gini or Entropy.
K-Nearest Neighbors (KNN) is a supervised learning algorithm used for classification and regression.
It works by finding the K closest data points (neighbors) to a given input and making predictions
based on them.
A large K smooths the decision boundary but may ignore local patterns.
Step 2: Compute the distance between the input and all points in the dataset
For classification: Use majority voting (the most common class among the K neighbors).
2 4 A
4 6 A
7 8 B
9 10 B
Too large (K = dataset size) → Underfitting (loses the ability to capture local patterns).
✅ Advantages:
❌ Disadvantages:
Unsupervised Learning
Unsupervised Learning is a type of machine learning where the model learns patterns and structures
in the data without labeled outputs. Unlike supervised learning, which requires labeled data,
unsupervised learning finds hidden patterns or structures in the dataset.
Clustering algorithms group similar data points into clusters based on their similarity.
Common algorithms:
Common algorithms:
Common methods:
o One-Class SVM: Learns the normal data distribution and flags deviations.
✅ Advantages:
❌ Disadvantages:
KMeans++ is an improved method for initializing cluster centroids in K-Means clustering. Instead of
selecting centroids randomly, it strategically selects them to be far apart, leading to faster
convergence and better clustering results.
📌 Why is KMeans++ Needed?
The standard K-Means algorithm initializes centroids randomly. If centroids start close together, the
algorithm:
KMeans++ fixes this issue by ensuring that the initial centroids are spread out, improving the
efficiency of the algorithm.
Calculate the Euclidean distance between each remaining data point and the nearest
already chosen centroid.
A new centroid is chosen with a higher probability for points that are far from existing
centroids.
🔢 Example Walkthrough
🔴 Issue: If initial centroids are close, K-Means may converge slowly or fail to find the best clusters.
🟢 Benefit: Initial centroids are far apart → Better separation → Faster convergence.
💡 Advantages of KMeans++
Would you like a Python code demo showing the difference between random initialization vs
KMeans++? 🚀
Classification Metrics 📊
Classification metrics evaluate the performance of classification models (e.g., Logistic Regression,
Decision Trees, SVM, Neural Networks). These metrics help determine how well a model predicts
categories (e.g., spam or not spam, disease or no disease).
🔹 1. Accuracy
Example:
95% accuracy sounds good, but if 95% of the dataset is "No Disease" and 5% is "Disease," a
dumb model predicting "No Disease" always will still get 95% accuracy!
Definition: Out of all predicted positives, how many were actually positive?
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
✅ Useful when False Positives are costly (e.g., Fraud Detection, Spam Filtering).
🚫 Not useful when False Negatives matter more (e.g., Disease Detection).
Definition: Out of all actual positives, how many did we correctly predict?
ROC Curve: Plots True Positive Rate (Recall) vs. False Positive Rate.
📌 Summary Table
Metric Best For Formula
A confusion matrix is a table used to evaluate the performance of a classification model. It helps
visualize how well the model distinguishes between different classes, showing the counts of true and
false predictions.
📌 Example
Suppose we are predicting whether an email is Spam (1) or Not Spam (0), and the model produces
the following predictions:
3️⃣ Recall (Sensitivity / True Positive Rate) (How many actual positives were predicted correctly?):
Recall=TPTP+FN=5050+5=90.9%\text{Recall} = \frac{TP}{TP + FN} = \frac{50}{50 + 5} = 90.9\%
5️⃣ Specificity (True Negative Rate) (How many actual negatives were predicted correctly?):
📌 Summary Table
Avoiding False
Precision TPTP+FP\frac{TP}{TP + FP}
Positives
Avoiding False
Recall TPTP+FN\frac{TP}{TP + FN}
Negatives
Correctly
Specificity TNTN+FP\frac{TN}{TN + FP} identifying
negatives
Would you like a Python code example to compute the confusion matrix? 🚀
🔹 Key Components:
Data Visualization
Statistical Analysis
Python, R, SQL
Hadoop, Spark
🔹 Definition: Machine Learning is a subset of Data Science that focuses on developing algorithms
that allow computers to learn patterns from data and make predictions or decisions without being
explicitly programmed.
🔹 Types of ML:
1️⃣ Supervised Learning → Uses labeled data (e.g., spam detection, fraud detection).
2️⃣ Unsupervised Learning → Finds patterns in unlabeled data (e.g., customer segmentation).
3️⃣ Reinforcement Learning → Learning through rewards & penalties (e.g., self-driving cars).
🔹 Common Algorithms:
Scikit-learn, XGBoost
Broader (includes ML, stats, data wrangling, Narrower (focuses on building predictive
Scope
visualization) models)
Example Analyzing customer behavior trends Predicting future sales using past data
📌 Data Science → Prepares the data, performs analysis, and visualizes trends.
📌 Machine Learning → Uses the processed data to train models and make predictions.
Example:
A Data Scientist might analyze customer purchasing behavior, while a Machine Learning Engineer
builds a model that predicts future purchases based on this data.
Would you like examples in Python for Data Science & ML? 🚀
Real-world data is rarely perfect—it often contains missing values, outliers, inconsistencies, and
noise. Proper data handling is crucial for accurate analysis and model performance.
2️⃣ Outliers
👉 Issue: Multiple identical records inflate data size and distort insights.
🔹 Causes: Data merging, scraping errors.
🔹 Handling Methods:
📌 Key Takeaways
Time Series Analysis (TSA) involves analyzing data points collected over time to identify trends,
patterns, and dependencies. It is widely used in finance, weather forecasting, stock market analysis,
demand forecasting, and healthcare.
📌 Example:
Stock prices have a trend but are also affected by market cycles.
📌 Example (Python)
import pandas as pd
df_resampled = df.resample('M').mean()
plt.figure(figsize=(10,5))
plt.legend()
plt.show()
result = adfuller(df["Value"])
decomposition.plot()
plt.show()
model_fit = model.fit()
forecast = model_fit.forecast(steps=10)
plt.plot(df, label="Actual")
plt.legend()
plt.show()
Would you like help with implementing forecasting models for your project? 🚀
https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-
beginners/
LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) designed to handle
sequential data while solving the problem of vanishing gradients in standard RNNs.
1. Forget Gate (fₜ): Decides what information should be discarded from the cell state.
o Uses a sigmoid activation to determine whether to keep (1) or forget (0) past
information.
2. Input Gate (iₜ): Decides what new information should be added to the cell state.
o Uses sigmoid activation for selection and tanh activation for value adjustment.
3. Output Gate (oₜ): Determines the next hidden state (hₜ), which will be passed to the next
time step.
Mathematical Representation
Forget Gate:
Input Gate:
Applications of LSTM
L1 and L2 regularization are techniques used to prevent overfitting by penalizing large coefficients in
machine learning models.
Key Features:
o When you expect sparse features (i.e., many features are irrelevant).
Mathematical Formula:
Key Features:
o When you need a stable model that does not remove features.
Would you like an example of using L1 and L2 in Logistic Regression or Linear Regression? 🚀
L1 regularization adds a penalty to the loss function to enforce sparsity in the model.
where:
mm = Number of samples
nn = Number of features
2. Example Calculation
Without L1 Regularization
MSE=4\text{MSE} = 4
J(θ)=4J(\theta) = 4
L1 regularization minimizes the cost function by reducing the values of θj\theta_j. When the
optimization algorithm (e.g., Gradient Descent) tries to reduce the penalty term:
∑∣θj∣\sum |\theta_j|
Some weights become exactly zero (unlike L2 regularization, which just makes them small)
x1x_1 3 2.5
x2x_2 5 0
x3x_3 7 6
Here, x2x_2's weight has become zero, meaning it is removed from the model.
Key Takeaways
A learning curve is a plot that shows how a model's performance (accuracy, loss, etc.) changes as
the size of training data increases. It helps in diagnosing underfitting, overfitting, and model
improvements.
1. Training Curve (Blue) → Shows how well the model fits the training data.
2. Validation/Test Curve (Orange) → Shows how well the model generalizes to unseen data.
Symptoms:
Solution:
Reduce regularization.
Symptoms:
Solution:
Symptoms:
import numpy as np
# Define model
model = LogisticRegression(max_iter=1000)
plt.title("Learning Curve")
plt.ylabel("Accuracy")
plt.legend()
plt.grid()
plt.show()
Type I Error (False Rejecting a true null hypothesis (incorrectly A medical test wrongly detects a
Positive) detecting an effect that doesn't exist). disease in a healthy person.
Type II Error (False Failing to reject a false null hypothesis A medical test fails to detect a
Negative) (missing an actual effect). disease in a sick person.
✅ Possible outcomes:
Let’s simulate a binary classification problem using sklearn and compute these errors:
type1_error = fp
type2_error = fn
🔹 Key Takeaways
Lowering Type I Error (false positives) makes the test more conservative, but may increase
Type II Error.
Lowering Type II Error (false negatives) makes the test more sensitive, but may increase Type
I Error.
The trade-off between Type I & Type II Errors is controlled by the significance level (α) in
hypothesis testing.
Would you like a deeper dive into how to balance these errors? 😊
The ROC Curve is a graphical representation that illustrates the performance of a classification
model at different threshold values. It is especially useful for binary classification problems.
📌 Definitions
A model with perfect predictions would have an ROC curve that reaches the top-left corner
(TPR=1, FPR=0).
AUC (Area Under the Curve) represents the overall performance of the classifier:
o AUC < 0.5 → Worse than random (flipping labels may improve results!)
Let's compute and plot an ROC curve for a logistic regression classifier.
import numpy as np
model = LogisticRegression()
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
plt.figure(figsize=(8,6))
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.legend(loc='lower right')
plt.grid()
plt.show()
Would you like an example using a different classifier (e.g., Decision Tree, Random Forest)? 😊
For each minority class sample, it finds the k nearest neighbors using Euclidean distance.
A new sample is created along the line connecting the minority sample and one of its
nearest neighbors.
This is done by choosing a random point between the original point and its neighbor using:
The process is repeated until the minority and majority classes are balanced.
📊 Example Calculation
🔧 Python Implementation
import numpy as np
# Sample dataset (imbalanced)
X = np.array([[2, 3], [4, 5], [5, 6], [8, 8]]) # Feature values
# Applying SMOTE
🔹 Advantages of SMOTE