Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
40 views

6.outlier Code - Jupyter Notebook

The document discusses various techniques to remove outliers from a dataset. It loads mba data from a CSV file and calculates the lower and upper boundaries to identify outliers. It then removes outliers by filtering rows that are below the lower or above the upper boundary. Alternatively, it replaces outlier values with the lower or upper boundary values or the mean to handle outliers.

Uploaded by

venkatesh m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

6.outlier Code - Jupyter Notebook

The document discusses various techniques to remove outliers from a dataset. It loads mba data from a CSV file and calculates the lower and upper boundaries to identify outliers. It then removes outliers by filtering rows that are below the lower or above the upper boundary. Alternatively, it replaces outlier values with the lower or upper boundary values or the mean to handle outliers.

Uploaded by

venkatesh m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

In 

[1]: import matplotlib.pyplot as plt


import seaborn as sns
import pandas as pd
import numpy as np

In [2]: mba = pd.read_csv("D:\\Course\\Python\\Datasets\\mba.csv")

In [3]: mba.boxplot()

...

In [4]: mba.boxplot(column='gmat')

Out[4]: <matplotlib.axes._subplots.AxesSubplot at 0x204a6a40f88>

In [5]: # other way to draw boxplot


import seaborn as sns
sns.boxplot(mba['gmat'])

...

In [9]: # Process to Remove Outliers

In [6]: mba.describe()

...

In [6]: #Q11 = 690


#Q33 = 730
#IQR = q3-q1
#low = q1 - 1.5 *IQR
#high = q3+1.5 *IQR

In [7]: q1 = mba['gmat'].quantile(0.25)

In [8]: q1

Out[8]: 690.0

In [9]: q3 = mba['gmat'].quantile(0.75)
#q3 = 730

In [10]: q3

Out[10]: 730.0

In [13]: iqr = q3-q1



iqr

Out[13]: 40.0

In [14]: # Lower Boundary



low = q1-1.5*iqr
low

Out[14]: 630.0

In [15]: # Upper Boundary



high = q3+1.5*iqr
high

Out[15]: 790.0

In [16]: mba.shape

Out[16]: (773, 3)

In [17]: # any value which are greater than lower boundary and less than Upper boundary ar

mba1 = mba.loc[(mba['gmat'] > low) & (mba['gmat'] < high)]

In [15]: mba.shape

Out[15]: (773, 3)

In [14]: mba1.shape # 15 values are removed

Out[14]: (758, 3)
In [20]: import seaborn as sns
sns.boxplot(mba1['gmat'])

...

In [18]: # Method 2 to Remove the outlier

In [23]: out = mba[(mba['gmat'] < low) | (mba['gmat']> high)].index

In [24]: out

Out[24]: Int64Index([189, 337, 392, 403, 478, 491, 653, 766, 768, 770, 771, 772], dtype
='int64')

In [25]: mba2 = mba.drop(out)

In [26]: mba2

...

In [27]: mba.drop(out,inplace=True)

In [28]: mba

...

Method 3
Replace the Outlier

lower outlier is repalced with lower boundary

Upper outlier is replaced with upper boundary

In [37]: mba = pd.read_csv("D:\\Course\\Python\\Datasets\\mba.csv")


mba

...

In [30]: mba[(mba['gmat'] < low)]

...

In [33]: low = q1-1.5*iqr


low

Out[33]: 630.0

In [34]: mba[(mba["gmat"]<low)]

...
In [41]: out1 = mba[(mba["gmat"]<low)].values
out1

...

In [42]: mba['gmat'].replace(out1,low,inplace=True)

In [43]: mba

...

In [49]: sns.boxplot(mba['gmat'])
...

To replace with mean or median


In [46]: mean = mba['gmat'].mean()

In [47]: mean

Out[47]: 711.4230271668823

In [50]: mba = pd.read_csv("D:\\Course\\Python\\Datasets\\mba.csv")


mba

...

In [55]: mba['gmat'].replace(out1,mean,inplace=True)
In [56]: mba

Out[56]: Datasrno workex gmat

0 1 21 720.000000

1 2 107 640.000000

2 3 57 740.000000

3 4 99 690.000000

4 5 208 710.000000

... ... ... ...

768 769 88 711.423027

769 770 132 670.000000

770 771 28 711.423027

771 772 10 711.423027

772 773 52 711.423027

773 rows × 3 columns

In [ ]: ​

You might also like