Learn Data Analysis With Python
Learn Data Analysis With Python
Introduction
If we want to apply for any data analyst or data scientist role, it is necessary to know one of
the programming languages used for such roles. It could be R or Python or Scala etc. To fulfill
The data preparation step is the most important part to win the battle of a data analysis project.
This document will have information about how data cleaning ( missing values, outliers,
Raw data is full of impurity like outliers, missing values, duplicates, etc. To clean this data
Missing Data
This is one of the most common issues. To solve these, there are many methods. Let us find
This is a CSV file with two missing values in the Grade column. Let us now find out practical
import pandas as pd
df = pd.read_csv(“graderecord.csv”)
df.head(10)
df_no_missing = df.dropna()
df_no_missing
df.fillna(0)
Code: Replace Empty rows with Mean or Median or Mode of the column
df[“Grade”].fillna(df[“Grade”].mean(), inplace=True)
df[“Grade”].fillna(df[“Grade”].median(), inplace=True)
df[“Grade”].fillna(df[“Grade”].mode(), inplace=True)
df[df[‘Grade’].notnull()]
Code: If a column has all empty values. drop the completely empty column
import numpy as np
df[‘newcol’] = np.nan
df.head()
df.dropna(axis=1, how=”all”)
Outlier Treatment
Considering, confidence interval of 95% that means z-score is 1.96 for 5% alpha value. In
conclusion, 95% of data is distributed within 1.96 standard deviations of the mean. So we can
import pandas as pd
df = pd.read_csv(“gradedata.csv”)
df
meangrade = df[‘grade’].mean()
stdgrade = df[‘grade’].std()
df
Method 2: In this method, we use the interquartile range (IQR). IQR is the difference between
25% (Q1 ) of the quantile and 75% (Q3) of the quantile. Any value lower than Q1–1.5*IQR or
q3 = df[‘grade’].quantile(.75)
iqr = q3-q1
df
Finding Duplicates
Using Python libraries, we can identify duplicate rows as well as unique rows of a data set.
import pandas as pd
Emp = [‘Jane’,’Johny’,’Boby’,’Jane’,’Mary’,’Jony’,’Melica’,’Melica’]
Salary = [9500,7800,7600,9500,7700,7800,9900,10000]
SalaryList = zip(Emp,Salary)
df.duplicated()
df.drop_duplicates()
df.drop_duplicates([‘Emp’], keep=”last”)