0% found this document useful (0 votes)

6 views

MachineLearning

The document outlines the Data Science Lifecycle, detailing each step from problem definition to model deployment and monitoring. It emphasizes the importance of data collection, exploration, feature selection, model training, evaluation, and continuous improvement. Additionally, it highlights techniques for exploratory data analysis and feature engineering to enhance model performance.

Uploaded by

funtooshda1

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

MachineLearning

Uploaded by

funtooshda1

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Cheatsheets- https://www.emmading.com/free-data-science-interview-resources?

cid=091a9f1e-
48c3-4e52-9936-4a568ab0cc30

Also in EDA find out relationship, see below

The Data Science Lifecycle consists of several structured steps, starting from obtaining raw data to
deploying a model and monitoring its performance. Below is a detailed breakdown of the entire
lifecycle of a Data Science project:

1. Problem Definition & Business Understanding

Before diving into the data, it's crucial to define the problem statement and understand business
objectives.

Key Steps:

 Identify the business problem (e.g., predicting customer churn, fraud detection).

 Define KPIs (Key Performance Indicators) to measure success.

 Engage with stakeholders (business teams, clients, etc.) to understand requirements.

 Define constraints (e.g., computing resources, accuracy expectations).

📌 Example: A bank wants to predict whether a customer will default on a loan. The success metric
could be improving loan approval accuracy while minimizing false positives.

2. Data Collection (Raw Data Acquisition)

Gather raw data from various sources. The quality of data affects model performance.

Data Sources:

 Databases: SQL, NoSQL (MySQL, PostgreSQL, MongoDB).

 APIs & Web Scraping: Using REST APIs, BeautifulSoup, Scrapy.

 Files: CSV, Excel, JSON, XML.

 Logs & Sensor Data: System logs, IoT data.

 Third-Party Data: Public datasets (Kaggle, UCI Machine Learning, Open Data).

📌 Example: For a loan default prediction model, data might be collected from bank transaction
records, credit scores, customer demographics, and employment history.

3. Data Exploration & Cleaning (Data Preprocessing)

Raw data is often messy! Cleaning and structuring the data is essential.

Key Steps:

✅ Handling Missing Data: Impute (mean/median/mode) or drop missing values.

✅ Handling Outliers: Use box plots, Z-score, IQR methods to detect and remove them.
✅ Data Type Conversion: Convert categorical to numerical (one-hot encoding, label encoding).
✅ Dealing with Duplicates: Remove redundant entries.
✅ Data Transformation: Scaling (MinMax, StandardScaler), normalization, log transformation.
📌 Example: If a customer’s income is missing, we might fill it using the median salary for similar
profiles.

4. Exploratory Data Analysis (EDA)

EDA helps understand patterns, relationships, and distributions within data.

Key Techniques:

📊 Univariate Analysis – Histograms, box plots, KDE plots.

📉 Bivariate Analysis – Correlation heatmaps, scatter plots.
📈 Multivariate Analysis – Pairplots, PCA for dimensionality reduction.
📌 Feature Engineering – Creating new variables (e.g., credit-to-income ratio).

📌 Example: We might discover that customers with higher credit scores rarely default—a key insight
for model training.

5. Feature Selection & Engineering

Choosing the right features improves model performance.

Feature Selection Techniques:

 Statistical Methods: Correlation, Chi-square test, ANOVA.

 Dimensionality Reduction: PCA, t-SNE, LDA.

 Domain Knowledge: Using business insights to engineer new features.

📌 Example: Instead of using raw salary and loan amount, we create a new feature: Debt-to-Income
Ratio.

6. Model Selection & Training

Choose the right model based on the problem type.

Types of Models:

 Supervised Learning:

o Classification: Logistic Regression, Random Forest, XGBoost, Neural Networks.

o Regression: Linear Regression, Decision Trees, Gradient Boosting.

 Unsupervised Learning:

o Clustering: K-Means, DBSCAN.

o Anomaly Detection: Isolation Forest, Autoencoders.

 Deep Learning:

o CNNs for image classification, LSTMs for time series.

📌 Example: For loan default prediction (binary classification), we might start with Logistic Regression
and later try Random Forest for better performance.

7. Model Evaluation

Evaluate the model’s performance using appropriate metrics.

Common Metrics:

 Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.

 Regression: RMSE (Root Mean Squared Error), R² (Coefficient of Determination).

 Clustering: Silhouette Score, Davies–Bouldin Index.

📌 Example: A high recall is crucial in fraud detection since missing fraud cases is costly.

8. Hyperparameter Tuning & Optimization

Optimize the model by tuning hyperparameters.

Techniques:

 Grid Search: Exhaustive search over parameter combinations.

 Random Search: Randomly selects parameter values for faster tuning.

 Bayesian Optimization: Uses probability-based methods for efficient tuning.

 AutoML: Automated hyperparameter tuning using libraries like TPOT, AutoKeras.

📌 Example: Optimizing the number of trees in a Random Forest model to improve accuracy.

9. Model Deployment

Deploy the trained model into a real-world environment.

Deployment Methods:

🚀 Web Services: Flask, FastAPI, Django REST API.

☁️Cloud Deployment: AWS SageMaker, Google AI Platform, Azure ML.
Edge Deployment: Deploying on mobile devices or IoT devices.

📌 Example: A fraud detection model might be integrated into a banking system to flag suspicious
transactions in real time.

10. Monitoring & Maintenance

Once deployed, the model must be monitored for performance degradation.

Key Tasks:
📉 Drift Detection: Monitor if input data distribution changes over time.
📊 Performance Tracking: Log model accuracy, precision, recall, etc.
🔄 Retraining: Periodically retrain the model with new data.

📌 Example: If a credit scoring model starts underperforming, we update it with the latest customer
data.

11. Model Explainability & Ethical Considerations

Ensure transparency and fairness in AI models.

Techniques:

 SHAP & LIME: Explain how the model makes decisions.

 Bias Detection: Check if the model unfairly favors certain groups.

 Regulatory Compliance: GDPR, AI Ethics Guidelines.

📌 Example: If a loan approval model discriminates based on gender, we need to adjust feature
weights.

12. Continuous Improvement

The model lifecycle never stops!

 Gather feedback from business users.

 Improve the feature set with new data.

 Optimize performance with better models or tuning.

📌 Example: A recommendation system like Netflix’s movie recommendations continuously updates

based on user interactions.

Exploratory Data Analysis (EDA) _Detailed

EDA is about understanding your data before modeling. It helps uncover patterns, spot anomalies,
and guide feature selection.

Practical Steps:

✅ Understand Data Types

 Identify numerical (e.g., age, salary) vs. categorical (e.g., gender, country) columns.

 Check if any column is wrongly classified (e.g., dates stored as text).

✅ Check for Missing Values

 Identify missing values in each column.

 Decide whether to remove rows/columns or fill missing data (e.g., mean/median for
numbers, mode for categories).

Scenario Best Strategy

<5% missing values Mode, mean, or median imputation
>50% missing in a column Drop the column
Categorical missing values Fill with mode or "Unknown"
Time-series data Forward fill, backward fill, or interpolation
High correlation with other features Predictive imputation (ML models)

✅ Look for Duplicates

 Check for repeated rows and remove them if necessary.

✅ Identify Outliers

 Use box plots or histograms to detect extreme values.

 Decide whether to remove, transform, or cap them.

✅ Check Distribution of Data

 Use histograms or KDE plots to understand how numerical data is spread.

 See if data is skewed (right or left).

✅ Analyze Relationships Between Variables

 Use scatter plots for numerical variables (e.g., income vs. spending).

 Use correlation heatmaps to see which variables are strongly related.

 Use pivot tables or group-by functions to analyze categorical variables.

✅ Segment Data for Better Insights

 Group data based on categories (e.g., analyze customer behavior by region or age group).

✅ Check for Imbalanced Data (for Classification Problems)

 See if one class dominates the others (e.g., 90% "No Fraud" vs. 10% "Fraud").

 Consider balancing techniques (oversampling, undersampling).

Feature Selection
Once you understand your data, the next step is choosing the right features and creating new ones
to improve model performance.

Feature Selection (Keeping Only Useful Variables)

✅ Remove Unnecessary Columns

 Drop columns that don’t contribute (e.g., ID numbers, unnecessary text fields).
✅ Check for High Correlation

 Remove redundant features (e.g., if "height" and "BMI" are strongly correlated, keep one).

✅ Use Statistical Tests

 Use techniques like ANOVA, chi-square tests, or mutual information to pick the best
features.

✅ Use Automated Methods

 Try feature selection methods like Recursive Feature Elimination (RFE) or LASSO Regression.

Feature Engineering (Creating New Features)

✅ Convert Categorical Data to Numeric

 Use One-Hot Encoding (turn categories into multiple columns).

 Use Label Encoding (assign numbers to categories).

✅ Create Meaningful Features

 Combine existing columns (e.g., "loan amount" ÷ "income" to get debt-to-income ratio).

 Extract insights from dates (e.g., "year of joining" → "years of experience").

✅ Transform Features for Better Distribution

 Apply log transformation if a feature is highly skewed.

 Use scaling techniques (MinMaxScaler, StandardScaler) for better model performance.

✅ Handle Text Data (If Needed)

 Convert text into numerical data using TF-IDF or Word Embeddings.

Commonly Used Statistical Terms
100% (2)
Commonly Used Statistical Terms
4 pages
Analysis of Covariance-ANCOVA-with Two Groups
No ratings yet
Analysis of Covariance-ANCOVA-with Two Groups
41 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
data science
No ratings yet
data science
8 pages
Scanned 20241018-1707 Page2 Image2
No ratings yet
Scanned 20241018-1707 Page2 Image2
7 pages
UNIT-1,2,3
No ratings yet
UNIT-1,2,3
30 pages
phase2_rep
No ratings yet
phase2_rep
5 pages
Python Theory Notes
No ratings yet
Python Theory Notes
28 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
Steps for Data Analytics
No ratings yet
Steps for Data Analytics
6 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Project
No ratings yet
Project
2 pages
ML GTU Solution
No ratings yet
ML GTU Solution
83 pages
statistics for data science
No ratings yet
statistics for data science
4 pages
Steps to create data sets and developing a machine learning model
No ratings yet
Steps to create data sets and developing a machine learning model
3 pages
Predictive Analytics Strategy
No ratings yet
Predictive Analytics Strategy
4 pages
ML notes
No ratings yet
ML notes
16 pages
DATA_ANALYTICS_CHAPTER_2
No ratings yet
DATA_ANALYTICS_CHAPTER_2
16 pages
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
DL
No ratings yet
DL
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
DWDM U3
No ratings yet
DWDM U3
12 pages
nimish
No ratings yet
nimish
4 pages
sibi 5
No ratings yet
sibi 5
27 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
843 Class 12 Competency Based Artificial Intelligence Chap-2 (2024-25)
No ratings yet
843 Class 12 Competency Based Artificial Intelligence Chap-2 (2024-25)
28 pages
ML Revision
No ratings yet
ML Revision
37 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
House price prediction
No ratings yet
House price prediction
5 pages
report
No ratings yet
report
17 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
ml all notes
No ratings yet
ml all notes
62 pages
final project document
No ratings yet
final project document
8 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
6 pages
Unit2_2) How python is deployed and Data Science Process.pptx
No ratings yet
Unit2_2) How python is deployed and Data Science Process.pptx
7 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
21BCE3954 FraudDetectionInBanking
No ratings yet
21BCE3954 FraudDetectionInBanking
26 pages
IMP Questions & Ans on ML & CI Using Python
No ratings yet
IMP Questions & Ans on ML & CI Using Python
21 pages
Advanced Data Analytics Assignment
No ratings yet
Advanced Data Analytics Assignment
6 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
unit 1 ml pdf
No ratings yet
unit 1 ml pdf
19 pages
DM VSAQ
No ratings yet
DM VSAQ
8 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
Steps in Data Science & Analysis
No ratings yet
Steps in Data Science & Analysis
2 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
11 pages
DWM Notes
No ratings yet
DWM Notes
19 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
Data Analytics Assignment
No ratings yet
Data Analytics Assignment
17 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Data processes
No ratings yet
Data processes
4 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Assignment1_LATEX
No ratings yet
Assignment1_LATEX
11 pages
assignment-1
No ratings yet
assignment-1
4 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
1.0 - Laying The Foundation of Business Analytics
No ratings yet
1.0 - Laying The Foundation of Business Analytics
33 pages
shortnjn
No ratings yet
shortnjn
12 pages
Credit Card Default Prediction PRESENTATION
No ratings yet
Credit Card Default Prediction PRESENTATION
12 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Mengistu Hone Final Tss
No ratings yet
Mengistu Hone Final Tss
106 pages
Topic 1 Introduction To Statistics (Student)
No ratings yet
Topic 1 Introduction To Statistics (Student)
95 pages
Hypothesis Defined
100% (1)
Hypothesis Defined
33 pages
DMiner Dashboard Design Mining and Recommendation
No ratings yet
DMiner Dashboard Design Mining and Recommendation
14 pages
Test Bank For Business Statistics in Practice 6th Edition Bowerman
100% (57)
Test Bank For Business Statistics in Practice 6th Edition Bowerman
30 pages
MCQs
No ratings yet
MCQs
24 pages
Datathon Stage 2 - Details
No ratings yet
Datathon Stage 2 - Details
1 page
MLR Bank
No ratings yet
MLR Bank
26 pages
Statistical Analysis With Computer Applications: Dr. Angelica M. Aquino
No ratings yet
Statistical Analysis With Computer Applications: Dr. Angelica M. Aquino
30 pages
Business Research Methods 9th Edition by Zikmund Test Bank Compress
No ratings yet
Business Research Methods 9th Edition by Zikmund Test Bank Compress
36 pages
Final Year Project Work - Nishanth
No ratings yet
Final Year Project Work - Nishanth
69 pages
Final Assessment MBAS901-Sabina.K
No ratings yet
Final Assessment MBAS901-Sabina.K
9 pages
Paper 49
No ratings yet
Paper 49
10 pages
Business Statistics: A First Course: Fifth Edition
No ratings yet
Business Statistics: A First Course: Fifth Edition
20 pages
Collection Organization Presentation Analysis Interpretation
No ratings yet
Collection Organization Presentation Analysis Interpretation
5 pages
Classification of Variables
No ratings yet
Classification of Variables
3 pages
Road Map To Predictive Analytics: Dr. P.K.Viswanathan Professor (Analytics)
No ratings yet
Road Map To Predictive Analytics: Dr. P.K.Viswanathan Professor (Analytics)
17 pages
Data Mining Techniques and Its Application in Industrial Engineerin1
No ratings yet
Data Mining Techniques and Its Application in Industrial Engineerin1
80 pages
Data Management and Data Transformation, Introduction To Machine Learning
No ratings yet
Data Management and Data Transformation, Introduction To Machine Learning
54 pages
Property Rental Price Prediction Using The Extreme Gradient Boosting
No ratings yet
Property Rental Price Prediction Using The Extreme Gradient Boosting
6 pages
Presentation Ed 52
No ratings yet
Presentation Ed 52
145 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Project On Physical Fitness at University
No ratings yet
Project On Physical Fitness at University
52 pages
Practical Research 2 - Las
No ratings yet
Practical Research 2 - Las
12 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
Chi Square
No ratings yet
Chi Square
17 pages
Stat Quiz
0% (1)
Stat Quiz
5 pages
Machine Learning: Dr. Muhammad Asadullah
No ratings yet
Machine Learning: Dr. Muhammad Asadullah
69 pages