Unit 5 Python
Unit 5 Python
Data cleaning, is the act of first identifying any issues or bad data, then
systematically correcting these issues. If the data is unfixable, we will need to
remove the bad elements to properly clean your data.
Data preparation is the process of cleaning and transforming raw data prior
to processing and analysis. It is an important step prior to processing and often
involves reformatting data, making corrections to data, and combining datasets to
enrich data.
Missing Data can occur when no information is provided for one or more
items or for a whole unit. Missing Data is a very big problem in a real-life
Output:
Program:
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
Program:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
Program:
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# will replace Nan value in dataframe with value -99
data.replace(to_replace = np.nan, value = -99)
print(data)
Output:
Program:
import pandas as pd
import numpy as np
df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
(np.nan, 2.0, np.nan, np.nan),
(2.0, 3.0, np.nan, 9.0),
(np.nan, 4.0, -4.0, 16.0)],
columns=list('abcd'))
print('Original dataset\n',df)
inter=df.interpolate(method='linear', limit_direction='both', axis=0,limit=1)
print('After applying\n',inter)
Output:
Program 1:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
print(df)
Output:
Now we drop rows with at least one Nan value (Null value)
Program 2:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
Output:
Program 3:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
Output:
DATA TRANSFORMATION
1. Removing Duplicates
Program 1:
import pandas as pd
data =pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data)
Output:
Program 2:
import pandas as pd
data =pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data.duplicated())
Output:
Program 3:
import pandas as pd
data =pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data.drop_duplicates())
Output:
For many data sets, you may wish to perform some transformation based
on the values in an array, Series, or column in a DataFrame. Consider the
following hypothetical data collected about some kinds of meat:
Program:
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
print(data)
Output:
Suppose you wanted to add a column indicating the type of animal that
each food came from. Let’s write down a mapping of each distinct meat type to
the kind of animal:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
The map method on a Series accepts a function or dict-like object
containing a mapping, but here we have a small problem in that some of the
meats above are capitalized and others are not. Thus, we also need to convert
each value to lower case
Program:
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
print(data)
Output:
3. Replacing Values
Filling in missing data with the fillna method can be thought of as a special
case of more general value replacement. While map, as you’ve seen above, can
be used to modify a subset of values in an object, replace provides a simpler and
more flexible way to do so.
Program 1:
import pandas as pd
data = pd.Series([1., -999., 2., -999., -1000., 3.])
print(data)
Output:
The -999 values might be sentinel values for missing data. To replace
these with NA values that pandas understands, we can use replace, producing a
new Series:
Program 2:
import pandas as pd
import numpy as np
data = pd.Series([1., -999., 2., -999., -1000., 3.])
d=data.replace(-999, np.nan)
print(d)
Output:
If you want to replace multiple values at once, you instead pass a list then
the substitute value
Program 3:
import pandas as pd
import numpy as np
data = pd.Series([1., -999., 2., -999., -1000., 3.])
d=data.replace([-999, -1000], np.nan)
print(d)
Output:
STRING MANIPULATION
Python has long been a popular data munging language in part due to its
ease-of-use for string and text processing. Most text operations are made simple
with the string object’s built-in methods.
pandas adds to the mix by enabling you to apply string and regular
expressions concisely on whole arrays of data, additionally handling the
annoyance of missing data.
Example:
val = 'a,b, guido'
print(val.split(','))
Output:
['a', 'b', ' guido']
Example:
val = 'a,b, guido'
pieces = [x.strip() for x in val.split(',')]
print(pieces)
Output:
['a', 'b', ' guido']
Output:
a::b::guido
Example:
B V RAJU COLLEGE Page 16
PYTHON UNIT - V
Output:
a::b::guido
True
1
-1
String methods
Regular expressions
Program:
# import RE module
import re
Output:
[2,5]
Program:
# import the RE module
import re
target_string = "Jessa salary is 8000$"
# compile regex pattern
# pattern to match any character
str_pattern = r"\w"
pattern = re.compile(str_pattern)
# match regex pattern at start of the string
res = pattern.match(target_string)
# match character
print(res.group())
# Output 'J'
# search regex pattern anywhere inside string
# pattern to search any digit
res = re.search(r"\d", target_string)
print(res.group())
# Output 8
# pattern to find all digits
res = re.findall(r"\d", target_string)
print(res)
# Output ['8', '0', '0', '0']
# regex to split string on whitespaces
res = re.split(r"\s", target_string)
print("All tokens:", res)
# Output ['Jessa', 'salary', 'is', '8000$']
# regex for replacement
# replace space with hyphen
res = re.sub(r"\s", "-", target_string)
# string after replacement:
print(res)
Output:
J
8
['8', '0', '0', '0']
All tokens: ['Jessa', 'salary', 'is', '8000$']
Jessa-salary-is-8000$
B V RAJU COLLEGE Page 19
PYTHON UNIT - V
Program:
import pandas as pd
import numpy as np
Output:
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object
Program 1:
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob':
'rob@gmail.com', 'Wes': np.nan}
df=pd.Series(data)
print(data)
print("\n")
B V RAJU COLLEGE Page 20
PYTHON UNIT - V
print(pd.isnull(df))
Output:
{'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob':
'rob@gmail.com', 'Wes': nan}
Dave False
Steve False
Rob False
Wes True
dtype: bool
Program 2:
Output:
{'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob':
'rob@gmail.com', 'Wes': nan}
Dave False
Steve False
Rob False
Wes True
dtype: bool
Dave False
Steve True
Rob True
Wes NaN
B V RAJU COLLEGE Page 21
PYTHON UNIT - V
dtype: object
Vectorized string methods
Pandas is one of the most popular Python packages used in data science.
Pandas offer a powerful, and flexible data structure ( Dataframe & Series ) to
manipulate, and analyze the data. Visualization is the best way to interpret the
data.
Python has many popular plotting libraries that make visualization easy.
Some of them are matplotlib, seaborn, and plotly. It has great integration with
matplotlib. We can plot a dataframe using the plot() method. But we need a
dataframe to plot.
Plots
There are a number of plots available to interpret the data. Each graph is
used for a purpose. Some of the plots are Line Plots, Bar Plots, Scatter Plots, and
Histograms, etc.
1.Line Plot:
A line chart is a form of graphical representation of data in the form of
points that are joined continuously with the help of a line. The line can either be
straight or curved depending on the data being researched. Line charts are the
simplest form of representing quantitative data between two variables that are
shown with the help of a line that can either be straight or curved.
kind=’line’,x= ‘some_column’,y=’some_colum’,color=’somecolor’,ax=’someaxes’
Program:
# importing required library
import pandas as pd
import matplotlib.pyplot as plt
Output:
2.Bar Plot:
A bar chart is a basic visualization for comparing values between data
groups and representing categorical data with rectangular bars. This plot may
include the count of a specific category or any defined value, and the lengths of
the bars correspond to the values they represent.
kind='bar',x= 'some_column',y='some_colum',color='somecolor'
Program:
# importing required library
import pandas as pd
import matplotlib.pyplot as plt
# A dictionary which represents data
data_dict = { 'name':['p1','p2','p3','p4','p5','p6'],
'age':[20,20,21,20,21,20],
'math_marks':[100,90,91,98,92,95],
'physics_marks':[90,100,91,92,98,95],
'chem_marks' :[93,89,99,92,94,92]
}
# bar plot
df.plot(kind = 'bar',x = 'name',y = 'physics_marks',color = 'green')
Output:
3. Histogram
The histogram and bar graph is quite similar but there is a minor difference
them. A histogram is used to represent the distribution, and bar chart is used to
compare the different entities. A histogram is generally used to plot the
frequency of a number of values compared to a set of values ranges.
Program:
# Create DataFrame
df = pd.DataFrame({
Density plots can be made using pandas, seaborn, etc. , we will generate
density plots using Pandas. We will be using two datasets of the Seaborn Library
namely – ‘car_crashes’ and ‘tips’.
Syntax: pandas.DataFrame.plot.density | pandas.DataFrame.plot.kde
Given the dataset ‘car_crashes’, let’s find out using the density plot which is the
most common speed due to which most of the car crashes happened.
output
Using a density plot, we can figure out that the speed between 4-5 (kmph)
was the most common for crash crashes in the dataset because of it being high
density (high peak) region.
5.Scatter Plot:
These plots are similar to line plots but here the coordinates of each point
are defined by two dataframe columns. The presentation is usually a filled circle.
These circles are not connected to each other via lines like in the line plot. This
helps in understanding the correlation between two variables.
kind='scatter',x= 'some_column',y='some_colum',color='somecolor'
Program: