Lesson 2 - Data Preprocessing
Lesson 2 - Data Preprocessing
Data acquisition
Typecasting
Learning Objectives
Code
df = pandas.read_csv("/home/simpy/Datasets/BostonHousing.csv")
Path to file
Loading Data to .csv File
Below is the code for loading the data within an existing csv file:
Code
df.to_csv("/home/simpy/Datasets/BostonHousing.csv")
Path to file
Loading .xlsx File in Python
Below is the code for loading an xlsx file within python:
Code
df = pandas.read_excel("/home/simpy/Datasets/BostonHousing.xlsx")
Loading Data to .xlsx File
Below is the code for loading program data into an existing xlsx file:
Code
df.to_excel("/home/simpy/Datasets/BostonHousing.xlsx")
Assisted Practice
Data Exploration Duration: 5 mins.
Problem Statement: Extract data from the given SalaryGender CSV file and store the data from
each column in a separate NumPy array.
Objective: Import the dataset (csv) in/from your Python notebook to local system.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter the
username and password in the respective fields, and click Login.
Data Exploration Techniques
Dimensionality Check The shape attribute returns a two-item tuple (number of rows and the
number of columns) for the data frame. For a Series, it returns a one-item
Type of Dataset tuple.
Code
Slicing and Indexing
df.shape
Identifying Unique
Elements
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
You can use the type ( ) in python to return the type of object.
Type of Dataset
Checking the type of data frame:
Value Extraction
Checking the type of a column (çhas) within a data frame:
Feature Mean
Code
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
You can use the : operator with the start index on left and end index on
right of it to output the corresponding slice.
Type of Dataset
Slicing a list: list = [1,2,3,4,5]
Slicing and Indexing Code
df.iloc[:,1:3]
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
Using unique ( ) on the column of interest will return a numpy array with
unique values of the column.
Type of Dataset
Extracting all unique values out of ‘’crim” column:
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
Using value ( ) on the column of interest will return a numpy array with all
the values of the column.
Type of Dataset
Extracting values out of ‘’crim” column:
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
Using mean( ) on the data frame will return mean of the data frame across
all the columns.
Type of Dataset
Code
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
Using median( ) on the data frame will return median values of the data
frame across all the columns.
Type of Dataset
Code
Value Extraction
Feature Mean
Feature Median
Feature Mode
Data Exploration Techniques (Contd.)
Dimensionality Check
Using mode( ) on the data frame will return mode values of the data frame
across all the columns, rows with axis=0 and axis = 1, respectively.
Type of Dataset
Code
Value Extraction
Feature Mean
Feature Median
Feature Mode
Let’s now consider multiple features and understand the effect of one over other with
respect to correlation (using seaborn)
Code
plt.yticks(rotation=0)
plt.xticks(rotation=90)
Minimum correlation
Maximum correlation
Assisted Practice
Data Exploration Duration: 15 mins.
Problem Statement: Suppose you are a public school administrator. Some schools in your state of Tennessee
are performing below average academically. Your superintendent under pressure from frustrated parents and
voters approached you with the task of understanding why these schools are under-performing. To improve
school performance, you need to learn more about these schools and their students, just as a business needs to
understand its own strengths and weaknesses and its customers. The data includes various demographic, school
faculty, and income variables.
Objective: Perform exploratory data analysis which includes: determining the type of the data, correlation
analysis over the same. You need to convert the data into useful information:
▪ Read the data in pandas data frame
▪ Describe the data to find more details
▪ Find the correlation between ‘reduced_lunch’ and ‘school_rating’
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that
are generated. Click on the Launch Lab button. On the page that appears, enter the username and password in
the respective fields, and click Login.
Unassisted Practice
Data Exploration Duration: 15
mins.
Problem Statement: Mtcars, an automobile company in Chambersburg, United States has recorded the
production of its cars within a dataset. With respect to some of the feedback given by their customers they are
coming up with a new model. As a result of it they have to explore the current dataset to derive further insights out if
it.
Objective: Import the dataset, explore for dimensionality, type and average value of the horsepower across all the
cars. Also, identify few of mostly correlated features which would help in modification.
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-
world problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login
Data Import
The first step is to import the data as a part of exploration.
Code
df1 = pandas.read_csv(“mtcars.csv“)
Data Exploration
The shape property is usually used to get the current shape of an
array/df.
df1.shape
Type of Dataset
type(df1)
Type of Dataset
df1[‘hp’].mean()
Type of Dataset
Code
plt.yticks(rotation=0)
plt.xticks(rotation=90)
Identifying Correlation Using a Heatmap
Graphical representation of data where the individual values contained in a
matrix are represented in colors.
The process of manually converting or mapping data from one raw format into another format is called data wrangling. This
includes munging and data visualization.
Discovering Structuring
Different Tasks
in Data Cleaning
Wrangling
Enrichment
Validating
Need of Data Wrangling
Following are the problems that can be avoided with wrangled data:
Inconsistent data
Consider a dataset below, imported as df1 within Python, having some missing values.
Code
Detecting
missing df1.isna().any()
values
Missing Value Treatment
Code
Mean Imputation: Replace the missing value from sklearn.preprocessing import Imputer
with variable’s mean mean_imputer =
Imputer(missing_values=np.nan,strategy='mean',axis=1)
mean_imputer = mean_imputer.fit(df1)
imputed_df = mean_imputer.transform(df1.values)
df1 = pd.DataFrame(data=imputed_df,columns=cols)
df1
Missing Value Treatment (Contd.)
Code
Mean Imputation: Replace the missing value from sklearn.preprocessing import Imputer
with variable’s mean median_imputer=Imputer(missing_values=np.nan,strategy
=‘median',axis=1)
median_imputer = median_imputer.fit(df1)
imputed_df = median_imputer.transform(df1.values)
df1 = pd.DataFrame(data=imputed_df,columns=cols)
Median Imputation: Replace the missing df1
value with variable’s median
Note: Mean imputation/Median imputation is again model dependent and is valid only on numerical data.
Outlier Values in a Dataset
4
FREQUENCY
3 OUTLIER?
x MIDPOINT
13
12
11
10
Y3
9
8
7
6
5
5.0 7.5 10.0 12.5 15.0
X1
Note: Outliers skew the data when you are trying to do any type of average.
Dealing with an Outlier
Outlier Detection
Code
Outliers:
Values < 60
Dealing with an Outlier
Outlier Detection
Code
filter=df1['Assignment'].values>60
Outlier Treatment df1_outlier_rem=df1[filter]
df1_outlier_rem
Assisted Practice
Data Wrangling Duration: 15 mins.
Problem Statement: Load the load_diabetes datasets internally from sklearn and check for any missing value or
outlier data in the ‘data’ column. If any irregularities found treat them accordingly.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Unassisted Practice
Data Wrangling Duration: 5 mins.
Problem Statement: Mtcars, the automobile company in the United States have planned to rework on optimizing
the horsepower of their cars, as most of the customers feedbacks were centred around horsepower. However, while
developing a ML model with respect to horsepower, the efficiency of the model was compromised. Irregularity might
be one of the causes.
Objective: Check for missing values and outliers within the horsepower column and remove them.
Note: This practice is not graded. It is only intended for you to apply the knowledge you have gained to solve real-
world problems.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Check for Irregularities
Code Code
df1['hp'].isna().any() sns.boxplot(x=df1['hp'])
Outlier
Outlier Treatment
Data with hp>250 is the outlier data. Therefore, you can filter it accordingly.
Code
filter = df1['hp']<250
df1_out_rem = df1[filter]
sns.boxplot(x=df2_out_rem['hp'])
Outlier filtered
data
Data Preprocessing
Topic 3: Data Manipulation
Functionalities of Data Object in Python
A data object is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns.
head( )
tail( )
values( )
groupby( )
Concatenation
Merging
Functionalities of Data Object in Python (Contd.)
tail( ) Code
Concatenation
Merging
Functionalities of Data Object in Python (Contd.)
tail( ) Code
groupby( ) df=pd.Series(np.arange(1,51))
print(df.tail(6))
Concatenation
Merging
Functionalities of Data Object in Python (Contd.)
head( ) values( ) returns the actual data in the series of the array
tail( ) Code
groupby( ) df=pd.Series(np.arange(1,51))
print(df.values)
Concatenation
Merging
Functionalities of Data Object using Python (Contd.)
head( ) The Data Frame is grouped according to the ‘Team’ and ‘ICC_Rank’ columns
Code
tail( )
import pandas as pd
values( ) world_cup={'Team':['West Indies','West
indies','India','Australia','Pakistan','Sri
Lanka','Australia','Australia','Australia','
Insia','Australia'],
groupby( )
'Rank':[7,7,2,1,6,4,1,1,1,2,1],
'Year':[1975,1979,1983,1987,1992,1996,1999,2003
,2007,2011,2015]}
Concatenation df=pd.DataFrame(world_cup)
print(df.groupby(['Team','Rank’]).groups)
Merging
Functionalities of Data Object in Python (Contd.)
tail( )
import pandas
values( ) world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka’],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
groupby( ) 'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],
Concatenation 'Points':[895,764,656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.concat([df1,df2],axis=1))
Merging
Functionalities of Data Object in Python (Contd.)
tail( )
values( )
groupby( )
Concatenation
Merging
Functionalities of Data Object in Python (Contd.)
tail( )
import pandas
values( ) champion_stats={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
groupby( ) 'Points':[874,787,753,673,855]}
match_stats={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
Concatenation 'World_cup_played':[11,10,11,9,8],
'ODIs_played':[733,988,712,679,662]}
df1=pandas.DataFrame(champion_stats)
df2=pandas.DataFrame(match_stats)
Merging
print(df1)
print(df2)
print(pandas.merge(df1,df2,on='Team'))
Functionalities of Data Object in Python (Contd.)
head( )
tail( )
values( )
groupby( )
Concatenation
Joins are used to combine records from two or more tables in a database. Below
are the four most commonly used joins:
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],
'ICC_rank':[1,5,9],'Points':[895,764,656]}
Returns all rows from df1=pandas.DataFrame(world_champions)
the left table, even if df2=pandas.DataFrame(chokers)
there are no matches in print(pandas.merge(df1,df2,on='Team',how='left'))
the right table
Right Join
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],'Points':[89
5,764,656]}
Preserves the unmatched df1=pandas.DataFrame(world_champions)
rows from the second df2=pandas.DataFrame(chokers)
(right) table, joining them print(pandas.merge(df1,df2,on='Team',how=‘right'))
with a NULL in the shape
of the first (left) table
Inner Join
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa','New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],'Points':[89
5,764,656]}
Selects all rows from df1=pandas.DataFrame(world_champions)
both participating tables df2=pandas.DataFrame(chokers)
if there is a match print(pandas.merge(df1,df2,on='Team',how=‘inner'))
between the columns
Full Outer Join
Code
Full Outer Join
import pandas
world_champions={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[874,787,753,673,855]}
chokers={'Team':['South Africa',’New
Zealand','Zimbabwe'],'ICC_rank':[1,5,9],'Points':[89
5,764,656]}
df1=pandas.DataFrame(world_champions)
Returns all records when
df2=pandas.DataFrame(chokers)
there is a match in either
print(pandas.merge(df1,df2,on='Team',how=‘outer'))
left (table1) or right
(table2) table records
Typecasting
It converts the data type of an object to the required data
type.
string( ) Int( )
Returns string from any Returns an integer object
numeric object or converts from any number or string.
any number to string
float( )
Returns a floating-point
number from a number or a
string
Typecasting Using Int, float and string( )
Few typecasted data types
Code Code
int(12.32) float(23)
Code
int(12.32)
Code Code
float('21.43
int(‘43’)
')
Assisted Practice
Data Manipulation Duration: 10 mins.
Problem Statement: As a macroeconomic analyst at the Organization for Economic Cooperation and Development
(OECD), your job is to collect relevant data for analysis. It looks like you have three countries in the north_america data
frame and one country in the south_america data frame. As these are in two separate plots, it's hard to compare the
average labor hours between North America and South America. If all the countries were into the same data frame, it
would be much easier to do this comparison.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Unassisted Practice
Data Manipulation Duration: 10 mins.
Problem Statement: SFO Public Department - referred to as SFO has captured all the salary data of its employees
from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to
rationalize employee cost to save payroll budget. You have to do data manipulation and answer the below questions:
1. How much total salary cost has increased from year 2011 to 2014?
2. Who was the top earning employee across all the years?
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and password that are
generated. Click on the Launch Lab button. On the page that appears, enter the username and password in the
respective fields, and click Login.
Answer 1
Check the mean salary cost per year and see how it has increased per
year.
Code
salary = pd.read_csv('Salaries.csv')
mean_year =
salary.groupby('Year').mean()['TotalPayBenefits']
print ( mean_year)
Answer 2
Code
top_sal =
salary.groupby('EmployeeName').sum()['TotalPayBenefi
ts']
print((top_sal.sort_values(axis=0)))
Key Takeaways
a. Boxplot
b. Histogram
c. Scatter plot
a. Boxplot
b. Histogram
c. Scatter plot
a. float32
b. float
c. int32
d. float64
Knowledge What is the output of the below Python code?
Check import numpy as np
percentiles = [98, 76.37, 55.55, 69, 88]
2 first_subject = np.array(percentiles)
print first_subject.dtype
a. float32
b. float
c. int32
d. float64
Problem Statement: From the raw data below create a data frame:
'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, ".", "."],'postTestScore': ["25,000", "94,000", 57, 62, 70]
Access: Click the Labs tab in the left side panel of the LMS. Copy or note the username and password that are
generated. Click the Launch Lab button. On the page that appears, enter the username and password in the
respective fields and click Login.
Thank You