0% found this document useful (0 votes)

7 views6 pages

DSBDA2

Uploaded by

This document cleans and preprocesses a dataset with missing values. It handles missing data through imputation, identifies outliers, transforms variables, and calculates statistics like skew.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

DSBDA2

Uploaded by

403 Chaudhari Sanika Sagar

0% found this document useful (0 votes)

7 views6 pages

This document cleans and preprocesses a dataset with missing values. It handles missing data through imputation, identifies outliers, transforms variables, and calculates statistics like skew.

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

This document cleans and preprocesses a dataset with missing values. It handles missing data through imputation, identifies outliers, transforms variables, and calculates statistics like skew.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

0% found this document useful (0 votes)

7 views6 pages

DSBDA2

Uploaded by

403 Chaudhari Sanika Sagar

This document cleans and preprocesses a dataset with missing values. It handles missing data through imputation, identifies outliers, transforms variables, and calculates statistics like skew.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

You are on page 1/ 6

import pandas as pd

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from scipy.stats import zscore
from scipy.stats import zscore, skew, shapiro, probplot
from scipy.stats import zscore, skew, shapiro, probplot
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("Test_Data.csv")

data

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 35760.40000 no southwest
2.0
1 47.000000 male 32.300000 49034.63000 no southwest
1.0
2 54.000000 female 28.880000 45038.93760 no northeast
2.0
3 NaN male 30.568094 0.00000 no northeast
3.0
4 59.130049 male 33.132854 64912.13924 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 39244.88760 no northeast
5.0
488 33.000000 male 42.400000 59326.08000 no southwest
5.0
489 47.769999 male 29.064615 40353.79402 no northeast
5.0
490 41.530738 female 24.260852 24444.53324 no southeast
5.0
491 36.000000 male 33.400000 40160.16000 yes southwest
5.0

[492 rows x 7 columns]

data.isnull().sum()

age 1
sex 0
bmi 1
health_gradient 0
smoker 0
region 0
children 1
dtype: int64

for column in data.columns:

print(f"\nColumn: {column}")
print(data[column].head())

Column: age
0 40.000000
1 47.000000
2 54.000000
3 NaN
4 59.130049
Name: age, dtype: float64

Column: sex
0 male
1 male
2 female
3 male
4 male
Name: sex, dtype: object

Column: bmi
0 29.900000
1 32.300000
2 28.880000
3 30.568094
4 33.132854
Name: bmi, dtype: float64

Column: health_gradient
0 35760.40000
1 49034.63000
2 45038.93760
3 0.00000
4 64912.13924
Name: health_gradient, dtype: float64

Column: smoker
0 no
1 no
2 no
3 no
4 yes
Name: smoker, dtype: object

Column: region
0 southwest
1 southwest
2 northeast
3 northeast
4 northeast
Name: region, dtype: object

Column: children
0 2.0
1 1.0
2 2.0
3 3.0
4 4.0
Name: children, dtype: float64

handle_missing_values_categorical =
SimpleImputer(strategy='most_frequent') #handle strings with mode
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns] =
handle_missing_values_categorical.fit_transform(data_categorical)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

handle_missing_values_numeric_mean = SimpleImputer(strategy='mean')
#handle numeric with mean
data_numeric = data.select_dtypes(include='number')
data[data_numeric.columns] =
handle_missing_values_numeric_mean.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

handle_missing_values_numeric_median =
SimpleImputer(strategy='median') #handle numeric with median
data_numeric = data.select_dtypes(include='number')
data[data_numeric.columns] =
handle_missing_values_numeric_median.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

print(data)

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 35760.40000 no southwest
2.0
1 47.000000 male 32.300000 49034.63000 no southwest
1.0
2 54.000000 female 28.880000 45038.93760 no northeast
2.0
3 38.844276 male 30.568094 0.00000 no northeast
3.0
4 59.130049 male 33.132854 64912.13924 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 39244.88760 no northeast
5.0
488 33.000000 male 42.400000 59326.08000 no southwest
5.0
489 47.769999 male 29.064615 40353.79402 no northeast
5.0
490 41.530738 female 24.260852 24444.53324 no southeast
5.0
491 36.000000 male 33.400000 40160.16000 yes southwest
5.0

[492 rows x 7 columns]

#Calculate Z-Scores:
z_scores = zscore(data.select_dtypes(include='number'), axis=0)

#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)

#Mask Outliers in the DataFrame:

data_no_outliers = data.select_dtypes(include='number').mask(outliers,
np.nan)

for column in data_no_outliers.columns:

print(f"\nColumn: {column}")
print(data_no_outliers[column].head())

Column: age
0 40.000000
1 47.000000
2 54.000000
3 38.844276
4 59.130049
Name: age, dtype: float64

Column: bmi
0 29.900000
1 32.300000
2 28.880000
3 30.568094
4 33.132854
Name: bmi, dtype: float64
Column: health_gradient
0 35760.40000
1 49034.63000
2 45038.93760
3 0.00000
4 64912.13924
Name: health_gradient, dtype: float64

Column: children
0 NaN
1 NaN
2 NaN
3 NaN
4 4.0
Name: children, dtype: float64

skew_before = data_no_outliers['age'].skew()
print(f"\nSkewness before transformation: {skew_before}")

Skewness before transformation: 0.0453252970458881

data

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 35760.40000 no southwest
2.0
1 47.000000 male 32.300000 49034.63000 no southwest
1.0
2 54.000000 female 28.880000 45038.93760 no northeast
2.0
3 38.844276 male 30.568094 0.00000 no northeast
3.0
4 59.130049 male 33.132854 64912.13924 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 39244.88760 no northeast
5.0
488 33.000000 male 42.400000 59326.08000 no southwest
5.0
489 47.769999 male 29.064615 40353.79402 no northeast
5.0
490 41.530738 female 24.260852 24444.53324 no southeast
5.0
491 36.000000 male 33.400000 40160.16000 yes southwest
5.0

[492 rows x 7 columns]

data['health_gradient'] = np.sqrt(data['health_gradient'])

data

age sex bmi health_gradient smoker region

children
0 40.000000 male 29.900000 189.104204 no southwest
2.0
1 47.000000 male 32.300000 221.437644 no southwest
1.0
2 54.000000 female 28.880000 212.223791 no northeast
2.0
3 38.844276 male 30.568094 0.000000 no northeast
3.0
4 59.130049 male 33.132854 254.778608 yes northeast
4.0
.. ... ... ... ... ... ...
...
487 51.000000 male 27.740000 198.103225 no northeast
5.0
488 33.000000 male 42.400000 243.569456 no southwest
5.0
489 47.769999 male 29.064615 200.882538 no northeast
5.0
490 41.530738 female 24.260852 156.347476 no southeast
5.0
491 36.000000 male 33.400000 200.400000 yes southwest
5.0

[492 rows x 7 columns]

Chapter 5 HW PDF
No ratings yet
Chapter 5 HW PDF
25 pages
New Outline For Complete Ielts 5.5-6.5 Pre - Ielts
No ratings yet
New Outline For Complete Ielts 5.5-6.5 Pre - Ielts
6 pages
Linear Regression: Data Exploration
No ratings yet
Linear Regression: Data Exploration
12 pages
Practica 11
No ratings yet
Practica 11
7 pages
RL_EX1.Ipynb - Colab
No ratings yet
RL_EX1.Ipynb - Colab
3 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Predicting Insurance Prices Using Machine Learning
No ratings yet
Predicting Insurance Prices Using Machine Learning
12 pages
Construction of NN
No ratings yet
Construction of NN
14 pages
Python Sklearn Linear Regression
No ratings yet
Python Sklearn Linear Regression
45 pages
insurance
No ratings yet
insurance
7 pages
exp_3_ml
No ratings yet
exp_3_ml
3 pages
Ass 1 Dsbda
No ratings yet
Ass 1 Dsbda
8 pages
1
No ratings yet
1
4 pages
Implementing OLS Regression On Boston Housing Secondary Dataset. Also Check The Data For Missing Values and Outliers.
No ratings yet
Implementing OLS Regression On Boston Housing Secondary Dataset. Also Check The Data For Missing Values and Outliers.
26 pages
Ash Regression
No ratings yet
Ash Regression
11 pages
Medical Cost Prediction
No ratings yet
Medical Cost Prediction
27 pages
ANN - Ipynb (8) - JupyterLab
No ratings yet
ANN - Ipynb (8) - JupyterLab
16 pages
Pract5 1
No ratings yet
Pract5 1
3 pages
Mock - Coding: Numpy NP CSV Sklearn - Linear - Model Pandas PD Matplotlib - Pyplot PLT Sklearn - Metrics
No ratings yet
Mock - Coding: Numpy NP CSV Sklearn - Linear - Model Pandas PD Matplotlib - Pyplot PLT Sklearn - Metrics
2 pages
Lecture 3 Part 1 Understanding Data With Statistics
No ratings yet
Lecture 3 Part 1 Understanding Data With Statistics
7 pages
Assignment3 VidulGarg
No ratings yet
Assignment3 VidulGarg
14 pages
Dsbda 4
No ratings yet
Dsbda 4
4 pages
230103-ECON209_S2025__Lab_2.ipynb-Colab
No ratings yet
230103-ECON209_S2025__Lab_2.ipynb-Colab
10 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
Infosys Stock Trend.ipynb - Colab
No ratings yet
Infosys Stock Trend.ipynb - Colab
16 pages
Bus 308 Week One Assignment
No ratings yet
Bus 308 Week One Assignment
14 pages
AD3411 (2)
No ratings yet
AD3411 (2)
28 pages
Logistic Pima Indians - Ipynb - Colaboratory
No ratings yet
Logistic Pima Indians - Ipynb - Colaboratory
4 pages
Breast Cancer Dataset
No ratings yet
Breast Cancer Dataset
154 pages
linear-reg-signal-and-noise.pdf
No ratings yet
linear-reg-signal-and-noise.pdf
20 pages
dsbda_exp4_part1
No ratings yet
dsbda_exp4_part1
39 pages
ch4 Dummy
No ratings yet
ch4 Dummy
54 pages
Analise Componente Principal
No ratings yet
Analise Componente Principal
22 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Analysing NBA DATA
No ratings yet
Analysing NBA DATA
13 pages
Breast Cancer Classification With Machine Learning
No ratings yet
Breast Cancer Classification With Machine Learning
17 pages
Face Landmark Detection Using CNN, Random Forest & XGBoost
No ratings yet
Face Landmark Detection Using CNN, Random Forest & XGBoost
26 pages
Loading The Dataset: 'Diabetes - CSV'
No ratings yet
Loading The Dataset: 'Diabetes - CSV'
4 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
B - 59 - SMA - Exp 4
No ratings yet
B - 59 - SMA - Exp 4
9 pages
BA - Assignment FINAL
No ratings yet
BA - Assignment FINAL
35 pages
AI Mini Project
No ratings yet
AI Mini Project
6 pages
Insurance
No ratings yet
Insurance
35 pages
Stat 2032 2014 Final Solutions
No ratings yet
Stat 2032 2014 Final Solutions
12 pages
Short Notes on pandas
No ratings yet
Short Notes on pandas
21 pages
Gaurav - Data Mining Lab Assignment
No ratings yet
Gaurav - Data Mining Lab Assignment
36 pages
California 1673295505
No ratings yet
California 1673295505
18 pages
Chi-Square Test of Independence
No ratings yet
Chi-Square Test of Independence
15 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
Stat Is Tika
No ratings yet
Stat Is Tika
9 pages
Practical-5 - Jupyter Notebook
100% (1)
Practical-5 - Jupyter Notebook
8 pages
ADJMX
No ratings yet
ADJMX
8 pages
Analisis Diskriminan: Tugas Individu Analisis Peubah Ganda
No ratings yet
Analisis Diskriminan: Tugas Individu Analisis Peubah Ganda
7 pages
Data Munging - Ipynb - Colaboratory - Yodhi Adhi Sanjaya
No ratings yet
Data Munging - Ipynb - Colaboratory - Yodhi Adhi Sanjaya
4 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
1 Assignment 2: Hypothesis Testing
No ratings yet
1 Assignment 2: Hypothesis Testing
11 pages
group-2-th (1)
No ratings yet
group-2-th (1)
25 pages
Projet de Sciences Des Données Appliquées en Assurance.ipynb - Colab
No ratings yet
Projet de Sciences Des Données Appliquées en Assurance.ipynb - Colab
26 pages
DSP_Lec8
No ratings yet
DSP_Lec8
12 pages
Logistic Regression 007
No ratings yet
Logistic Regression 007
1 page
random-forest-classification
No ratings yet
random-forest-classification
24 pages
Book of Numbers: Number Systems Made Easy
From Everand
Book of Numbers: Number Systems Made Easy
Stan The Man
No ratings yet
Opus Dei Is Vaticans Mafia
100% (2)
Opus Dei Is Vaticans Mafia
4 pages
Tendernotice_1 (2)
No ratings yet
Tendernotice_1 (2)
2 pages
GD TroubleshootingGuide ENGLISH
No ratings yet
GD TroubleshootingGuide ENGLISH
62 pages
4000 Word's Assignment
No ratings yet
4000 Word's Assignment
13 pages
Republic of The Philippines Seventh-Day Adventist Church (Mas-Amicus) Northeastern Mindanao Mission
No ratings yet
Republic of The Philippines Seventh-Day Adventist Church (Mas-Amicus) Northeastern Mindanao Mission
1 page
Romeo and Juliet Podcast Lesson Plan
No ratings yet
Romeo and Juliet Podcast Lesson Plan
5 pages
A 2021 Vision IAS Test 8
No ratings yet
A 2021 Vision IAS Test 8
34 pages
Technical Schedule A To D (4) - Compressed
No ratings yet
Technical Schedule A To D (4) - Compressed
340 pages
Technology Education Students' Script 2010
No ratings yet
Technology Education Students' Script 2010
10 pages
Proposal - Pediatric Cancer Awareness
No ratings yet
Proposal - Pediatric Cancer Awareness
4 pages
Preparing For The GRE
No ratings yet
Preparing For The GRE
3 pages
FINAL Chapter 3 MARKETING STUDY
No ratings yet
FINAL Chapter 3 MARKETING STUDY
24 pages
Modal & Phrasal Verbs
No ratings yet
Modal & Phrasal Verbs
5 pages
Math Methods Physics Fall 24-25
No ratings yet
Math Methods Physics Fall 24-25
7 pages
2019 Summer Model Answer Paper (Msbte Study Resources)
100% (1)
2019 Summer Model Answer Paper (Msbte Study Resources)
38 pages
Histology of The Eye
No ratings yet
Histology of The Eye
48 pages
A Study On Brand Preference Final
No ratings yet
A Study On Brand Preference Final
52 pages
Five Principles of Panchsheel Teachings
No ratings yet
Five Principles of Panchsheel Teachings
2 pages
Keeping Fit: BBC Learning English Weekender
No ratings yet
Keeping Fit: BBC Learning English Weekender
3 pages
John Murtagh's General Practice Companion Handbook 7th Edition John Murtagh all chapter instant download
100% (1)
John Murtagh's General Practice Companion Handbook 7th Edition John Murtagh all chapter instant download
57 pages
CP General Catalog New - Compressed
No ratings yet
CP General Catalog New - Compressed
19 pages
Presentation On Autocad
No ratings yet
Presentation On Autocad
29 pages
10 Common Problems in The Classroom
No ratings yet
10 Common Problems in The Classroom
2 pages
EC 1420 Notes and Summaries Minus 3 Lectures Tarun Preet Singh
No ratings yet
EC 1420 Notes and Summaries Minus 3 Lectures Tarun Preet Singh
145 pages
3.22 Mechanical Properties of Materials: Mit Opencourseware
No ratings yet
3.22 Mechanical Properties of Materials: Mit Opencourseware
8 pages
Vdocument - in Presentation On Spur Gear
No ratings yet
Vdocument - in Presentation On Spur Gear
31 pages
Chef de Cuisine Cruise Ship: Kitchen Department
No ratings yet
Chef de Cuisine Cruise Ship: Kitchen Department
4 pages
Mini Cooper
No ratings yet
Mini Cooper
5 pages