Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Unit 5 Python

Uploaded by

prasannadp04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 5 Python

Uploaded by

prasannadp04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

PYTHON UNIT - V

Data Cleaning and Preparation: Handling Missing Data - Data Transformation:


Removing Duplicates, Transforming Data Using a Function or Mapping,
Replacing Values, Detecting and Filtering Outliers- String Manipulation:
Vectorized String Functions in pandas. Plotting with pandas: Line Plots, Bar
Plots, Histograms and Density Plots, Scatter or Point Plots.

Data Cleaning and Preparation:

Data cleaning, is the act of first identifying any issues or bad data, then
systematically correcting these issues. If the data is unfixable, we will need to
remove the bad elements to properly clean your data.

Data preparation is the process of cleaning and transforming raw data prior
to processing and analysis. It is an important step prior to processing and often
involves reformatting data, making corrections to data, and combining datasets to
enrich data.

HANDLING MISSING DATA IN PANDAS

Missing Data can occur when no information is provided for one or more
items or for a whole unit. Missing Data is a very big problem in a real-life

Pandas treat None and NaN as essentially interchangeable for indicating


missing or null values.
To facilitate this convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame.
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()

B V RAJU COLLEGE Page 1


PYTHON UNIT - V

1.Checking for missing values using isnull() and notnull()

(a) Checking for missing values using isnull()


In order to check null values in Pandas DataFrame, we use isnull() function
this function return dataframe of Boolean values which are True for NaN values.
Program:
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)
print('original data frame\n',df)

# using isnull() function


check=df.isnull()
print('displaying null values\n',check)

Output:

B V RAJU COLLEGE Page 2


PYTHON UNIT - V

(b)Checking for missing values using notnull()

In order to check null values in Pandas Dataframe, we use notnull()


function this function return dataframe of Boolean values which are False for
NaN values.

Program:
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)
print('original data frame\n',df)

# using isnull() function


check=df.notnull()
print('displaying null values\n',check)
Output:

B V RAJU COLLEGE Page 3


PYTHON UNIT - V

2.Filling missing values using fillna(), replace() and interpolate()


In order to fill null values in a datasets, we use fillna(), replace() and
interpolate() function these function replace NaN values with some value of
their own. All these function help in filling a null values in datasets of a
DataFrame. Interpolate() function is basically used to fill NA values in the
dataframe but it uses various interpolation technique to fill the missing values.

(a)Filling null values with a single value

Program:
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling missing value using fillna()


check=df.fillna(0)
print('After filling\n',check)
Output:

B V RAJU COLLEGE Page 4


PYTHON UNIT - V

(b)Filling a null values using replace() method

Program:
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# will replace Nan value in dataframe with value -99
data.replace(to_replace = np.nan, value = -99)
print(data)
Output:

(c) Filling a null value using interpolate() method:

The interpolate() method replaces the NULL values based on a specified


method.

Syntax: dataframe.interpolate(method, axis, inplace, limit, limit_direction)

B V RAJU COLLEGE Page 5


PYTHON UNIT - V

Parameter Value Description


Method 'linear' Optional, default 'linear' . Specifies
'barycentric' the method to use when replacing
'cubic’ NULL values
'time'
'zero'
'pad'

Axis 0 Optional, default 0. The axis to fill


1 the NULL values along
'index'
'columns'

Inplace True Optional, default False. If True: the


False replacing is done on the current
DataFrame. If False: returns a copy
where the replacing is done.

Limit Number Optional, default None. Specifies


None the maximum number of NULL
values to fill (if method is specified)

limit_direction 'forward' Optional, default 'forward', (if the


'backward' method is backfill or bfill, the
'both' default limit_direction is 'backward'.
Specifies the direction of the filling.

B V RAJU COLLEGE Page 6


PYTHON UNIT - V

Program:
import pandas as pd
import numpy as np
df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
(np.nan, 2.0, np.nan, np.nan),
(2.0, 3.0, np.nan, 9.0),
(np.nan, 4.0, -4.0, 16.0)],
columns=list('abcd'))

print('Original dataset\n',df)
inter=df.interpolate(method='linear', limit_direction='both', axis=0,limit=1)
print('After applying\n',inter)

Output:

3.Dropping missing values using dropna()

In order to drop a null values from a dataframe, we used dropna()


function this function drop Rows/Columns of datasets with Null values in
different ways.

B V RAJU COLLEGE Page 7


PYTHON UNIT - V

Program 1:
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

print(df)
Output:

Now we drop rows with at least one Nan value (Null value)

Program 2:
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

B V RAJU COLLEGE Page 8


PYTHON UNIT - V

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# using dropna() function


print(df.dropna())

Output:

Now we drop a columns which have at least 1 missing values

Program 3:
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

B V RAJU COLLEGE Page 9


PYTHON UNIT - V

# using dropna() function


print(df.dropna(axis = 1))

Output:

DATA TRANSFORMATION

Filtering, cleaning, and other transformations are another class of


important operations.

1. Removing Duplicates

Duplicate rows may be found in a DataFrame for any number of reasons.

Program 1:
import pandas as pd
data =pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data)

Output:

B V RAJU COLLEGE Page 10


PYTHON UNIT - V

The DataFrame method duplicated returns a boolean Series indicating


whether each row is a duplicate or not

Program 2:
import pandas as pd
data =pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data.duplicated())

Output:

Relatedly, drop_duplicates returns a DataFrame with rows where


the duplicated array is False filtered out

Program 3:
import pandas as pd
data =pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data.drop_duplicates())

Output:

B V RAJU COLLEGE Page 11


PYTHON UNIT - V

2. Transforming Data Using a Function or Mapping

For many data sets, you may wish to perform some transformation based
on the values in an array, Series, or column in a DataFrame. Consider the
following hypothetical data collected about some kinds of meat:

Program:
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
print(data)

Output:

Suppose you wanted to add a column indicating the type of animal that
each food came from. Let’s write down a mapping of each distinct meat type to
the kind of animal:

B V RAJU COLLEGE Page 12


PYTHON UNIT - V

meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
The map method on a Series accepts a function or dict-like object
containing a mapping, but here we have a small problem in that some of the
meats above are capitalized and others are not. Thus, we also need to convert
each value to lower case

Program:
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
'corned beef', 'Bacon', 'pastrami', 'honey ham','nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
print(data)

Output:

B V RAJU COLLEGE Page 13


PYTHON UNIT - V

3. Replacing Values
Filling in missing data with the fillna method can be thought of as a special
case of more general value replacement. While map, as you’ve seen above, can
be used to modify a subset of values in an object, replace provides a simpler and
more flexible way to do so.

Program 1:
import pandas as pd
data = pd.Series([1., -999., 2., -999., -1000., 3.])
print(data)

Output:

The -999 values might be sentinel values for missing data. To replace
these with NA values that pandas understands, we can use replace, producing a
new Series:
Program 2:
import pandas as pd
import numpy as np
data = pd.Series([1., -999., 2., -999., -1000., 3.])
d=data.replace(-999, np.nan)
print(d)

Output:

B V RAJU COLLEGE Page 14


PYTHON UNIT - V

If you want to replace multiple values at once, you instead pass a list then
the substitute value

Program 3:
import pandas as pd
import numpy as np
data = pd.Series([1., -999., 2., -999., -1000., 3.])
d=data.replace([-999, -1000], np.nan)
print(d)

Output:

STRING MANIPULATION
Python has long been a popular data munging language in part due to its
ease-of-use for string and text processing. Most text operations are made simple
with the string object’s built-in methods.

For more complex pattern matching and text manipulations ,regular


expressions may be needed.

pandas adds to the mix by enabling you to apply string and regular
expressions concisely on whole arrays of data, additionally handling the
annoyance of missing data.

String Object Methods

B V RAJU COLLEGE Page 15


PYTHON UNIT - V

In many string munging and scripting applications, built-in string methods


are sufficient. As an example, a comma-separated string can be broken into
pieces with split:

Example:
val = 'a,b, guido'
print(val.split(','))

Output:
['a', 'b', ' guido']

split is often combined with strip to trim whitespace (including newlines):

Example:
val = 'a,b, guido'
pieces = [x.strip() for x in val.split(',')]
print(pieces)

Output:
['a', 'b', ' guido']

These substrings could be concatenated together with a two-colon


delimiter using addition:

val = 'a,b, guido'


pieces = [x.strip() for x in val.split(',')]
first, second, third = pieces
print(first + '::' + second + '::' + third)

Output:
a::b::guido

Using Python’s in keyword isthe best way to detect a substring, though


index and find can also be used:

Example:
B V RAJU COLLEGE Page 16
PYTHON UNIT - V

val = 'a,b, guido'


pieces = [x.strip() for x in val.split(',')]
first, second, third = pieces
print(first + '::' + second + '::' + third)
print('guido' in val)
print(val.index(','))
print(val.find(':'))

Output:

a::b::guido
True
1
-1

String methods

B V RAJU COLLEGE Page 17


PYTHON UNIT - V

Regular expressions

Regular expressions provide a flexible way to search or match string


patterns in text. A single expression, commonly called a regex, is a string formed
according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings.

The re module functions fall into three categories: pattern matching,


substitution, and splitting. Naturally these are all related; a regex describes a
pattern to locate in the text,which can then be used for many purposes.

Program:

# import RE module
import re

target_str = "My roll number is 25"


res = re.findall(r"\d", target_str)
# extract mathing value
print(res)

Output:
[2,5]

Regular expression methods

B V RAJU COLLEGE Page 18


PYTHON UNIT - V

Program:
# import the RE module
import re
target_string = "Jessa salary is 8000$"
# compile regex pattern
# pattern to match any character
str_pattern = r"\w"
pattern = re.compile(str_pattern)
# match regex pattern at start of the string
res = pattern.match(target_string)
# match character
print(res.group())
# Output 'J'
# search regex pattern anywhere inside string
# pattern to search any digit
res = re.search(r"\d", target_string)
print(res.group())
# Output 8
# pattern to find all digits
res = re.findall(r"\d", target_string)
print(res)
# Output ['8', '0', '0', '0']
# regex to split string on whitespaces
res = re.split(r"\s", target_string)
print("All tokens:", res)
# Output ['Jessa', 'salary', 'is', '8000$']
# regex for replacement
# replace space with hyphen
res = re.sub(r"\s", "-", target_string)
# string after replacement:
print(res)
Output:
J
8
['8', '0', '0', '0']
All tokens: ['Jessa', 'salary', 'is', '8000$']
Jessa-salary-is-8000$
B V RAJU COLLEGE Page 19
PYTHON UNIT - V

Vectorized string functions in pandas

Pandas builds on this and provides a comprehensive set of vectorized


string operations that become an essential piece of the type of munging
required when working with (read: cleaning up) real-world data.

Program:

import pandas as pd
import numpy as np

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob':


'rob@gmail.com', 'Wes': np.nan}
print(pd.Series(data))

Output:
Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Wes NaN
dtype: object

String and regular expression methods can be applied (passing a lambda


or other function)to each value using data.map, but it will fail on the NA. To
cope with this, Series has concise methods for string operations that skip NA
values. These are accessed through Series’s str attribute; for example, we could
check whether each email address has 'gmail' in it with str.contains:

Program 1:
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob':
'rob@gmail.com', 'Wes': np.nan}
df=pd.Series(data)
print(data)
print("\n")
B V RAJU COLLEGE Page 20
PYTHON UNIT - V

print(pd.isnull(df))

Output:
{'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob':
'rob@gmail.com', 'Wes': nan}
Dave False
Steve False
Rob False
Wes True
dtype: bool

Program 2:

from pandas import DataFrame,Series


import pandas as pd
import numpy as np

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob':


'rob@gmail.com', 'Wes': np.nan}
df=pd.Series(data)
print(data)
print("\n")
print(pd.isnull(df))
print(df.str.contains('gmail'))

Output:
{'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob':
'rob@gmail.com', 'Wes': nan}
Dave False
Steve False
Rob False
Wes True
dtype: bool
Dave False
Steve True
Rob True
Wes NaN
B V RAJU COLLEGE Page 21
PYTHON UNIT - V

dtype: object
Vectorized string methods

PLOTTING WITH PANDAS

Pandas is one of the most popular Python packages used in data science.
Pandas offer a powerful, and flexible data structure ( Dataframe & Series ) to
manipulate, and analyze the data. Visualization is the best way to interpret the
data.

Python has many popular plotting libraries that make visualization easy.
Some of them are matplotlib, seaborn, and plotly. It has great integration with
matplotlib. We can plot a dataframe using the plot() method. But we need a
dataframe to plot.

We can create a dataframe by just passing a dictionary to


the DataFrame() method of the pandas library.

Plots

B V RAJU COLLEGE Page 22


PYTHON UNIT - V

There are a number of plots available to interpret the data. Each graph is
used for a purpose. Some of the plots are Line Plots, Bar Plots, Scatter Plots, and
Histograms, etc.
1.Line Plot:
A line chart is a form of graphical representation of data in the form of
points that are joined continuously with the help of a line. The line can either be
straight or curved depending on the data being researched. Line charts are the
simplest form of representing quantitative data between two variables that are
shown with the help of a line that can either be straight or curved.
kind=’line’,x= ‘some_column’,y=’some_colum’,color=’somecolor’,ax=’someaxes’
Program:
# importing required library
import pandas as pd
import matplotlib.pyplot as plt

# A dictionary which represents data


data_dict = { 'name':['p1','p2','p3','p4','p5','p6'],
'age':[20,20,21,20,21,20],
'math_marks':[100,90,91,98,92,95],
'physics_marks':[90,100,91,92,98,95],
'chem_marks' :[93,89,99,92,94,92]
}

# creating a data frame object


df = pd.DataFrame(data_dict)
# show the dataframe
# bydefault head() show
# first five rows from top
print(df.head())

#Get current axis


ax = plt.gca()
# line plot for math marks
df.plot(kind = 'line',x = 'name',y = 'math_marks',color = 'green',ax = ax)
# line plot for physics marks

B V RAJU COLLEGE Page 23


PYTHON UNIT - V

df.plot(kind = 'line',x = 'name',y = 'physics_marks',color = 'blue',ax = ax)

# line plot for chemistry marks


df.plot(kind = 'line',x = 'name',y = 'chem_marks',color = 'black',ax = ax)

# set the title


plt.title('LinePlots')
# show the plot
plt.show()

Output:

2.Bar Plot:
A bar chart is a basic visualization for comparing values between data
groups and representing categorical data with rectangular bars. This plot may
include the count of a specific category or any defined value, and the lengths of
the bars correspond to the values they represent.

Similarly, we have to specify some parameters for plot() method to get


the bar plot.
B V RAJU COLLEGE Page 24
PYTHON UNIT - V

kind='bar',x= 'some_column',y='some_colum',color='somecolor'
Program:
# importing required library
import pandas as pd
import matplotlib.pyplot as plt
# A dictionary which represents data
data_dict = { 'name':['p1','p2','p3','p4','p5','p6'],
'age':[20,20,21,20,21,20],
'math_marks':[100,90,91,98,92,95],
'physics_marks':[90,100,91,92,98,95],
'chem_marks' :[93,89,99,92,94,92]
}

# creating a data frame object


df = pd.DataFrame(data_dict)

# show the dataframe


# bydefault head() show
# first five rows from top
print(df.head())

# bar plot
df.plot(kind = 'bar',x = 'name',y = 'physics_marks',color = 'green')

# set the title


plt.title('BarPlot')

# show the plot


plt.show()

B V RAJU COLLEGE Page 25


PYTHON UNIT - V

Output:

3. Histogram
The histogram and bar graph is quite similar but there is a minor difference
them. A histogram is used to represent the distribution, and bar chart is used to
compare the different entities. A histogram is generally used to plot the
frequency of a number of values compared to a set of values ranges.

In order to plot a histogram in pandas using hist() function, DataFrame can


call the hist(). It will return the histogram of each numeric column in the pandas
DataFrame.

Program:

# Create Pandas DataFrame


import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({

B V RAJU COLLEGE Page 26


PYTHON UNIT - V

'Maths': [80.4, 50.6, 70.4, 50.2, 80.9],


'Physics': [70.4, 50.4, 60.4, 90.1, 90.1],
'Chemistry': [40, 60.5, 70.8, 90.88, 40],
'Students': ['Student1', 'Studen2', 'Student3', 'Student4', 'Student5']
})
print(df)

# Plot the histogram from DataFrame


df.hist()
Output:

4.Density or KDE Plot


Density Plot is a type of data visualization tool. It is a variation of the
histogram that uses ‘kernel smoothing’ while plotting the values. It is a
continuous and smooth version of a histogram inferred from a data.
Density plots uses Kernel Density Estimation which is a probability density
function. The region of plot with a higher peak is the region with maximum data
points residing between those values.

B V RAJU COLLEGE Page 27


PYTHON UNIT - V

Density plots can be made using pandas, seaborn, etc. , we will generate
density plots using Pandas. We will be using two datasets of the Seaborn Library
namely – ‘car_crashes’ and ‘tips’.
Syntax: pandas.DataFrame.plot.density | pandas.DataFrame.plot.kde

where pandas -> the dataset of the type ‘pandas dataframe’


Dataframe -> the column for which the density plot is to be drawn
plot -> keyword directing to draw a plot/graph for the given column
density -> for plotting a density graph
kde -> to plot a density graph using the Kernel Density Estimation function

Given the dataset ‘car_crashes’, let’s find out using the density plot which is the
most common speed due to which most of the car crashes happened.

# importing the libraries


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# loading the dataset


# from seaborn library
data = sns.load_dataset('car_crashes')

# viewing the dataset


print(data.head(4))

# plotting the density plot


# for 'speeding' attribute
# using plot.density()
data.speeding.plot.density(color='green')
plt.title('Density plot for Speeding')
plt.show()

output

B V RAJU COLLEGE Page 28


PYTHON UNIT - V

Using a density plot, we can figure out that the speed between 4-5 (kmph)
was the most common for crash crashes in the dataset because of it being high
density (high peak) region.

5.Scatter Plot:
These plots are similar to line plots but here the coordinates of each point
are defined by two dataframe columns. The presentation is usually a filled circle.
These circles are not connected to each other via lines like in the line plot. This
helps in understanding the correlation between two variables.

To get the scatterplot of a dataframe all we have to do is to just call


the plot() method by specifying some parameters.

kind='scatter',x= 'some_column',y='some_colum',color='somecolor'
Program:

B V RAJU COLLEGE Page 29


PYTHON UNIT - V

# importing required library


import pandas as pd
import matplotlib.pyplot as plt
# A dictionary which represents data
data_dict = { 'name':['p1','p2','p3','p4','p5','p6'],
'age':[20,20,21,20,21,20],
'math_marks':[100,90,91,98,92,95],
'physics_marks':[90,100,91,92,98,95],
'chem_marks' :[93,89,99,92,94,92]
}
# creating a data frame object
df = pd.DataFrame(data_dict)
# show the dataframe
# bydefault head() show
# first five rows from top
print(df.head())
# scatter plot
df.plot(kind ='scatter',x ='math_marks',y ='physics_marks', color = 'red')
# set the title
plt.title('ScatterPlot')
# show the plot
plt.show()
Output:

B V RAJU COLLEGE Page 30

You might also like