Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

unit-ii-eda-using-python

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

unit-ii-eda-using-python

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

lOMoARcPSD|50930822

UNIT II EDA using python

Exploratory Data Analysis (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Mohideen Abdul (mohideengamer@gmail.com)
lOMoARcPSD|50930822

EXPLORATORY DATA ANALYSIS

UNIT II EDA USING PYTHON

Data Manipulation using Pandas – Pandas Objects – Data Indexing


and Selection – Operating on Data – Handling Missing Data –
Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables –
Vectorized String Operations.

2.1 Data manipulation with Pandas


Pandas are a newer package built on top of NumPy, and provide an
efficient implementation of a DataFrame. DataFrames are essentially
multidimensional arrays with attached row and column labels, and
often with heterogeneous types and/or missing data. As well as
offering a convenient storage interface for labeled data, Pandas
implements a number of powerful data operations familiar to users of
both database frameworks and spreadsheet programs.

As we saw, NumPy's ndarray data structure provides essential features

for the type of clean, well-organized data typically seen in numerical


computing tasks. While it serves this purpose very well, its limitations
become clear when we need more flexibility (e.g., attaching labels to data,
working with missing data, etc.) and when attempting operations that do
not map well to element-wise broadcasting (e.g., groupings, pivots, etc.),
each of which is an important piece of analyzing the less structured data
available in many forms in the world around us. Pandas, and in
particular its Series and DataFrame objects, builds on the NumPy array
structure and provides efficient access to these sorts of "data munging"
tasks that occupy much of a data scientist's time.

In this chapter, we will focus on the mechanics of using Series,

DataFrame, and related structures effectively. We will use examples


drawn from real datasets where appropriate, but these examples are not
necessarily the focus.

Installing and Using Pandas


Installation of Pandas on your system requires NumPy to be installed, and
if building the library from source, requires the appropriate tools to
compile the C and Python sources on which Pandas is built. Once Pandas
is installed, you can import it and check the version:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

import pandas
pandas.__version__

Out[1]:

'0.18.1'

Just as we generally import NumPy under the alias np, we will import

Pandas under the alias pd:

In [2]:

import pandas as pd
Reminder about Built-In Documentation

For example, to display all the contents of the pandas namespace, you
can type

In [3]: pd.<TAB>

And to display Pandas's built-in documentation, you can use this:

In [4]: pd?

Data indexing and selection

Data indexing includes (e.g., arr[2, 1]), slicing (e.g., arr[:, 1:5]),
masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and
combinations thereof (e.g., arr[:, [1, 5]]). Here we'll look at similar means
of accessing and modifying values in

Pandas Series and DataFrame objects. If you have used the NumPy

patterns, the corresponding patterns in Pandas will feel very familiar,


though there are a few quirks to be aware of.

Data Selection in Series

A Series object acts in many ways like a one-dimensional NumPy array,


and in many ways like a standard Python dictionary. If we keep these two
overlapping analogies in mind, it will help us to understand the patterns
of data indexing and selection in these arrays.

Series as dictionary Like a dictionary, the Series object provides a


mapping from a collection of keys to a collection of values:

EXAMPLE 1

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

PROGRAM OUTPUT

import pandas as pd a 0.25


data = pd.Series([0.25, 0.5, b 0.50
0.75, 1. 0], index=['a', 'b', 'c', c 0.75
'd']) d 1.00
data dtype: float64

0.5
data['b']

True
'a' in data

data.keys() Index(['a', 'b', 'c', 'd'],


dtype='object')
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75),
('d', 1 .0)]
data['e'] = 1.25

data a 0.25

b 0.50

c 0.75

d 1.00

e 1.25

dtype: float64

We can also use dictionary-like Python expressions and methods to

examine the keys/indices and values: Series objects can even be modified

with a dictionary-like syntax. Just as you can extend a dictionary by

assigning to a new key, you can extend a Series by assigning to a new

index value:

This easy mutability of the objects is a convenient feature: under the

hood, Pandas is making decisions about memory layout and data copying

that might need to take place; the user generally does not need to worry

about these issues.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style

item selection via the same basic mechanisms as NumPy arrays – that is,

slices, masking, and fancy indexing. Examples of these are as follows:


EXAMPLE 2

PROGRAM OUTPUT

data['a':'c'] a 0.25
b 0.50

c 0.75

dtype: float64

data[0:2] a 0.25

data[(data > 0.3) & (data < 0.8)] b 0.50

dtype: float64

b 0.50
# fancy indexing c 0.75
data[['a', 'e']] dtype: float64

a 0.25

e 1.25

dtype: float64

Among these, slicing may be the source of the most confusion. Notice that
when slicing with an explicit index (i.e., data['a':'c']), the final index is
included in the slice, while when slicing with an implicit index (i.e.,
data[0:2]), the final index is excluded from the slice.
Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion. For


example, if your Series has an explicit integer index, an indexing
operation such as data[1] will use the explicit indices, while a slicing
operation like data[1:3] will use the implicit Python-style index.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

EXAMPLE 3

PROGRAM OUTPUT

data = pd.Series(['a', 'b', 'c'], 1 a


index=[ 1, 3, 5]) 3 b

data 5 c

dtype: object

# explicit index when 'a'


indexing data[1]
3 b

# implicit index when slicing 5 c

data[1:3] dtype: object

Because of this potential confusion in the case of integer indexes, Pandas

provides some special indexer attributes that explicitly expose certain

indexing schemes. These are not functional methods, but attributes that

expose a particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references

the explicit index:

EXAMPLE 4

PROGRAM OUTPUT

data.loc[1] 'a'

data.loc[1:3] 1 a
3 b

dtype: object

The iloc attribute allows indexing and slicing that always references the

implicit Python-style index:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

EXAMPLE 4

PROGRAM OUTPUT

data.iloc[1] 'b'

data.iloc[1:3] 3 b

5 c

dtype: object

A third indexing attribute, ix, is a hybrid of the two, and for Series

objects is equivalent to standard []-based indexing. The purpose of the ix

indexer will become more apparent in the context of DataFrame objects,

which we will discuss in a moment.

One guiding principle of Python code is that "explicit is better than

implicit." The explicit nature of loc and iloc make them very useful in

maintaining clean and readable code; especially in the case of integer

indexes, I recommend using these both to make code easier to read and

understand, and to prevent subtle bugs due to the mixed indexing/slicing

convention.

Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or

structured array, and in other ways like a dictionary of Series structures

sharing the same index. These analogies can be helpful to keep in mind

as we explore data selection within this structure.

DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of

related Series objects. Let's return to our example of areas and

populations of states:

EXAMPLE 4

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

PROGRAM OUTPUT

area = pd.Series({'California':
42396 7, 'Texas': 695662,

'New York': 141297, 'Flo rida':

170312,

'Illinois': 149995}) pop =

pd.Series({'California': 383325
21, 'Texas': 26448193,

'New York': 19651127, 'F

lorida': 19552860,

'Illinois': 12882135}) data =


pd.DataFrame({'area':area, 'p
op':pop})

data

The individual Series that make up the columns of the DataFrame can be

accessed via dictionary-style indexing of the column name:

EXAMPLE 4

PROGRAM OUTPUT

data['area']

Equivalently, we can use attribute-style access with column names that

are strings:

EXAMPLE 4

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

PROGRAM OUTPUT

data.area

data.area is data['area'] True

data.pop is data['pop']
False

This attribute-style column access actually accesses the exact same

object as the dictionary-style access: Though this is a useful shorthand,

keep in mind that it does not work for all cases! For example, if the

column names are not strings, or if the column names conflict with

methods of the DataFrame, this attribute-style access is not possible. For

example, the DataFrame has a pop() method, so data.pop will point to

this rather than the "pop" column:

In particular, you should avoid the temptation to try column assignment

via attribute (i.e., use data['pop'] = z rather than data.pop = z).

Like with the Series objects discussed earlier, this dictionary-style syntax

can also be used to modify the object, in this case adding a new column:

EXAMPLE 4

PROGRAM OUTPUT

data['density'] = data['pop'] /
data['a rea']

data

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

This shows a preview of the straightforward syntax of element-by element

arithmetic between Series objects; we'll dig into this further in Operating

on Data in Pandas.

DataFrame as two-dimensional array

As mentioned previously, we can also view the DataFrame as an

enhanced two-dimensional array. We can examine the raw underlying

data array using the values attribute:

EXAMPLE 4

PROGRAM OUTPUT

data.values array([[ 4.23967000e+05,


3.83325210e+0 7, 9.04139261e+01],

[ 1.70312000e+05, 1.95528600e+07,
1.14806121e+02],

[ 1.49995000e+05, 1.28821350e+07,
8.58837628e+01],

[ 1.41297000e+05, 1.96511270e+07,
1.39076746e+02],

[ 6.95662000e+05, 2.64481930e+07,
3.80187404e+01]])

With this picture in mind, many familiar array-like observations can be

done on the DataFrame itself. For example, we can transpose the full

DataFrame to swap rows and columns:

EXAMPLE 5

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

data.T

When it comes to indexing of DataFrame objects, however, it is clear that

the dictionary-style indexing of columns precludes our ability to simply

treat it as a NumPy array. In particular, passing a single index to an array

accesses a row and passing a single "index" to a DataFrame accesses a

column:

EXAMPLE 5

PROGRAM OUTPUT

data.values[0] array([ 4.23967000e+05,


3.83325210e+0 7, 9.04139261e+01])

data['area'] California 423967

Florida 170312

Illinois 149995

New York 141297

Texas 695662

Name: area, dtype: int64

Thus for array-style indexing, we need another convention. Here Pandas

again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc

indexer, we can index the underlying array as if it is a simple NumPy

array (using the implicit Python-style index), but the DataFrame index

and column labels are maintained in the result:

EXAMPLE 5

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

data.iloc[:3, :2]

Similarly, using the loc indexer we can index the underlying data in an

array-like style but using the explicit index and column names:

EXAMPLE 5

PROGRAM OUTPUT

data.loc[:'Illinois',
:'pop']

Keep in mind that for integer indices, the ix indexer is subject to the

same potential sources of confusion as discussed for integer indexed

Series objects.

2.2 Operating on data

One of the essential pieces of NumPy is the ability to perform quick


element-wise operations, both with basic arithmetic (addition,
subtraction, multiplication, etc.) and with more sophisticated operations
(trigonometric functions, exponential and logarithmic functions, etc.).
Pandas inherit much of this functionality from NumPy, and the ufuncs
that we introduced in “Computation on NumPy Arrays: Universal
Functions” on page 50 are key to this. Pandas includes a couple useful
twists, however: for unary operations like negation and trigonometric
functions, these ufuncs will preserve index and column labels in the
output, and for binary operations such as addition and multiplication,

Pandas will automatically align indices when passing the objects to the
ufunc. This means that keeping the context of data and combining data

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

from different sources—both potentially

error -prone tasks with raw NumPy arrays—become essentially foolproof


ones

with Pandas. We will additionally see that there are well-defined


operations between

one-dimensional Series structures and two-dimensional DataFrame


structures.

Ufuncs: Index Preservation


Because Pandas is designed to work with NumPy, any NumPy ufunc will

work on Pandas Series and DataFrame objects. Let’s start by defining a


simple Series and DataFrame on which to demonstrate this:

EXAMPLE 5

PROGRAM OUTPUT

import pandas as pd
import numpy as np
rng =

np.random.RandomState(42)
ser = pd.Series(rng.randint(0,
10, 4)) ser

df = pd.DataFrame(rng.randint(0, A B C D
10, (3, 4)), 0 6 9 2 6
columns=['A', 'B', 'C', 'D']) 1 7 4 3 7
df 2 7 2 5 4

If we apply a NumPy ufunc on either of these objects, the result will be


another Pandas object with the indices preserved:
EXAMPLE 5

PROGRAM OUTPUT

np.exp(ser)

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

np.sin(df * np.pi / 4)

Or, for a slightly more complex calculation:

Any of the ufuncs discussed in “Computation on NumPy Arrays: Universal

Functions”

UFuncs: Index Alignment


For binary operations on two Series or DataFrame objects, Pandas will

align indices in the process of performing the operation.

Index alignment in Series


As an example, suppose we are combining two different data sources,
and find only the top three US states by area and the top three US states
by population:

EXAMPLE 5

PROGRAM OUTPUT

area = pd.Series({'Alaska':
1723337, 'Texas': 695662,

'California': 423967},

name='area') population =
pd.Series({'California':
38332521, 'Texas': 26448193,
'New York': 19651127},
name='population')

population / area

The resulting array contains the union of indices of the two input arrays,
which we could determine using standard Python set arithmetic on these
indices:

EXAMPLE 5

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

PROGRAM OUTPUT

area.index | population.index Index(['Alaska', 'California',


'New York', 'Texas'],
dtype='object')

Any item for which one or the other does not have an entry is marked

with NaN, or “Not a Number,” which is how Pandas marks missing data .

This index matching is imple mented this way for any of Python’s built-in
arithmetic expressions; any missing values are filled in with NaN by
default:

EXAMPLE 5

PROGRAM OUTPUT

A = pd.Series([2, 4, 6], index=[0, 0 NaN


1, 2]) B = pd.Series([1, 3, 5], 1 5.0
index=[1, 2, 3]) A + B 2 9.0
3 NaN

dtype: float64

If using NaN values is not the desired behavior, we can modify the fill
value using appropriate object methods in place of the operators. For
example, calling A.add(B) is equivalent to calling A + B, but allows
optional explicit specification of the fill value for any elements in A or B
that might be missing:

EXAMPLE 6

PROGRAM OUTPUT

A.add(B, fill_value=0)

Index alignment in DataFrame


A similar type of alignment takes place for both columns and indices

when you are performing operations on DataFrames:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

EXAMPLE 6

PROGRAM OUTPUT

A = pd.DataFrame(rng.randint(0,
20, (2, 2)),

columns=list('AB'))

B = pd.DataFrame(rng.randint(0,
10, (3, 3)),

columns=list('BAC'))

A + B

Notice that indices are aligned correctly irrespective of their order in the
two objects, and indices in the result are sorted. As was the case with
Series, we can use the associated object’s arithmetic method and pass
any desired fill_value to be used in place of missing entries. Here we’ll fill
with the mean of all values in A (which we compute by first stacking the
rows of A):

EXAMPLE 6

PROGRAM OUTPUT

fill = A.stack().mean()
A.add(B, fill_value=fill)

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Table 3-1 lists Python operators and their equivalent Pandas object
methods.

Table 3-1. Mapping between Python operators and Pandas


methods Python operator Pandas method(s)
SYMBOL OPERATOR

+ add()

- sub(), subtract()

* mul(), multiply()

/ truediv(), div(), divide()

// floordiv()

% mod()

** pow()

Ufuncs: Operations Between DataFrame and Series

When you are performing operations between a DataFrame and a Series,


the index and column alignment is similarly maintained. Operations
between a DataFrame and a Series are similar to operations between a
two-dimensional and one-dimensional NumPy array. Consider one
common operation, where we find the difference of a two-dimensional
array and one of its rows:

EXAMPLE 6

PROGRAM OUTPUT

A = rng.randint(10, array([[1, 7, 5, 1],


[4, 0, 9, 5],
size=(3, 4)) A
[8, 0, 9, 2]])
array([[ 0, 0, 0, 0],
A - A[0] [ 3, -7, 4, 4],
[ 7, -7, 4, 1]])

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

According to NumPy’s broadcasting rules subtraction between a


two-dimensional array and one of its rows is applied row-wise. In Pandas,
the convention similarly operates row-wise by default:

EXAMPLE 6

PROGRAM OUTPUT

df=pd.DataFrame(A,columns=list('Q
RST')) print(df)

df - df.iloc[0]

df.subtract(df['R'], axis=0)

If you would instead like to operate column-wise, you can use the object
methods mentioned earlier, while specifying the axis keyword:

Note that these DataFrame/Series operations, like the operations

discussed before, will automatically align indices between the two


elements:

EXAMPLE 6

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

halfrow = df.iloc[0, ::2]

halfrow

EXAMPLE 6

PROGRAM OUTPUT

df - halfrow

This preservation and alignment of indices and columns means that


operations on data in Pandas will always maintain the data context,
which prevents the types of silly errors that might come up when you are
working with heterogeneous and/or misaligned data in raw NumPy
arrays.

3.10 Missing Data

The difference between data found in many tutorials and data in the real

world is that real-world data is rarely clean and homogeneous. In

particular, many interesting datasets will have some amount of data

missing. To make matters even more complicated, different data sources

may indicate missing data in different ways.

In this section, we will discuss some general considerations for missing

data, discuss how Pandas chooses to represent it, and demonstrate some

built-in Pandas tools for handling missing data in Python. Here and

throughout the book, we'll refer to missing data in general as null, NaN,
or NA values.
3.10.1 Pythonic missing data

The first sentinel value used by Pandas is None, a Python singleton object

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

that is often used for missing data in Python code. Because it is a Python

object, None cannot be used in any arbitrary NumPy/Pandas array, but

only in arrays with data type 'object' (i.e., arrays of Python objects):

EXAMPLE 1

PROGRAM OUTPUT

import numpy as np array([1, None, 3, 4],


dtype=object )
import pandas as pd

vals1 = np.array([1, None,


3, 4]) vals1

This dtype=object means that the best common type representation

NumPy could infer for the contents of the array is that they are Python

objects. While this kind of object array is useful for some purposes, any

operations on the data will be done at the Python level, with much more

overhead than the typically fast operations seen for arrays with native

types:

EXAMPLE 1

PROGRAM OUTPUT

vals2 = np.array([1, np.nan, 3, 4]) array([ 1., nan, 3., 4.])

Notice that NumPy chose a native floating-point type for this array: this

means that unlike the object array from before, this array supports fast

operations pushed into compiled code. You should be aware that NaN is a

bit like a data virus–it infects any other object it touches. Regardless of

the operation, the result of arithmetic with NaN will be another NaN:

EXAMPLE 1

PROGRAM OUTPUT

1 + np.nan nan

0 * np.nan nan

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Note that this means that aggregates over the values are well defined

(i.e., they don't result in an error) but not always useful:

EXAMPLE 1

PROGRAM OUTPUT

vals2.sum(), vals2.min(), (nan, nan, nan)


vals2.max()

NumPy does provide some special aggregations that will ignore these

missing values:

EXAMPLE 1

PROGRAM OUTPUT

vals2.sum(), vals2.min(), (nan, nan, nan)


vals2.max()
(8.0, 1.0, 4.0)
np.nansum(vals2),
np.nanmin(vals2),
np.nanmax(vals2)

Keep in mind that NaN is specifically a floating-point value; there is no

equivalent NaN value for integers, strings, or other types.

NaN and None in Pandas

NaN and None both have their place, and Pandas is built to handle the
two of them nearly interchangeably, converting between them where
appropriate:

EXAMPLE 1

PROGRAM OUTPUT

pd.Series([1, np.nan, 2, None])

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

For types that don't have an available sentinel value, Pandas

automatically type-casts when NA values are present. For example, if we

set a value in an integer array to np.nan, it will automatically be upcast

to a floating-point type to accommodate the NA:

EXAMPLE 1

PROGRAM OUTPUT

x=pd.Series(range(2),
dtype=int) x

x[0] = None

Notice that in addition to casting the integer array to floating point,

Pandas automatically converts the None to a NaN value. (Be aware that

there is a proposal to add a native integer NA to Pandas in the future; as

of this writing, it has not been included).

While this type of magic may feel a bit hackish compared to the more

unified approach to NA values in domain-specific languages like R, the

Pandas sentinel/casting approach works quite well in practice and in my

experience only rarely causes issues.

The following table lists the upcasting conventions in Pandas when NA

values are introduced:

TypeclassConversion When Storing NAsNA Sentinel

Value floating No change np.nan

object No change None or np.nan integer Cast to float64

np.nan

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

boolean Cast to object None or np.nan

Keep in mind that in Pandas, string data is always stored with an object

dtype.

3.10.2 Operating on Null Values

As we have seen, Pandas treats None and NaN as essentially

interchangeable for indicating missing or null values. To facilitate this

convention, there are several useful methods for detecting, removing, and

replacing null values in Pandas data structures. They are:

∙ isnull(): Generate a boolean mask indicating missing values ∙


notnull(): Opposite of isnull()

∙ dropna(): Return a filtered version of the data


∙ fillna(): Return a copy of the data with missing values filled or imputed

It will conclude this section with a brief exploration and demonstration of

these routines.

3.10.3 Detecting null values

Pandas data structures have two useful methods for detecting null data:

isnull() and notnull(). Either one will return a Boolean mask over the

data. For example:

EXAMPLE 1

PROGRAM OUTPUT

data = pd.Series([1, np.nan,


'hello', None])

data.isnull()

data[data.notnull()]

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

As mentioned in Data Indexing and Selection, Boolean masks can be

used directly as a Series or DataFrame index:

The isnull() and notnull() methods produce similar Boolean results for
DataFrames.

3.10.4 Dropping null values

In addition to the masking used before, there are the convenience

methods, dropna() (which removes NA values) and fillna() (which fills in

NA values). For a Series, the result is straightforward:

EXAMPLE 1

PROGRAM OUTPUT

df = pd.DataFrame([[1, np.nan,
2], [2, 3, 5],

[np.nan, 4, 6]]) df

df.dropna()

df.dropna(axis='columns')

It cannot drop single values from a DataFrame; we can only drop full rows

or full columns. Depending on the application, you might want one or the

other, so dropna() gives a number of options for a DataFrame.

By default, dropna() will drop all rows in which any null value is present:

Alternatively, you can drop NA values along a different axis; axis=1 drops

all columns containing a null value:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Filling null values

Sometimes rather than dropping NA values, you'd rather replace them

with a valid value. This value might be a single number like zero, or it

might be some sort of imputation or interpolation from the good values.

You could do this in-place using the isnull() method as a mask, but

because it is such a common operation Pandas provides the fillna()

method, which returns a copy of the array with the null values replaced.

Consider the following Series:

EXAMPLE 1

PROGRAM OUTPUT

data = pd.Series([1, np.nan, 2,


None, 3], index=list('abcde'))
data

We can fill NA entries with a single value, such as zero:

EXAMPLE 1

PROGRAM OUTPUT

data.fillna(0)

We can specify a forward-fill to propagate the previous value forward:

EXAMPLE 1

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

data.fillna(method='ffill')

data.fillna(method='bfill')

df.fillna(method='ffill', axis=1)

Or we can specify a back-fill to propagate the next values backward: For


DataFrames, the options are similar, but we can also specify an axis
along which the fills take place:

Notice that if a previous value is not available during a forward fill, the NA
value remains.

3.11 Hierarchical indexing

Up to this point we've been focused primarily on one-dimensional and

two-dimensional data, stored in Pandas Series and DataFrame objects,

respectively. Often it is useful to go beyond this and store

higher -dimensional data–that is, data indexed by more than one or two

keys. While Pandas does provide Panel and Panel4D objects that natively

handle three-dimensional and four -dimensional data (see Aside: Panel

Data), a far more common pattern in practice is to make use of

hierarchical indexing (also known as multi-indexing) to incorporate

multiple index levels within a single index. In this way,

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

higher -dimensional data can be compactly represented within the

familiar one-dimensional Series and two

dimensional DataFrame objects.

In this section, we'll explore the direct creation of MultiIndex objects,

considerations when indexing, slicing, and computing statistics across

multiply indexed data, and useful routines for converting between simple

and hierarchically indexed representations of your data.

Begin with the standard imports:

import pandas as pd

import numpy as np

A Multiply Indexed Series

Let's start by considering how we might represent two-dimensional data

within a one-dimensional Series. For concreteness, we will consider a

series of data where each point has a character and numerical key.

The bad way

Suppose you would like to track data about states from two different

years. Using the Pandas tools we've already covered, you might be

tempted to simply use Python tuples as keys:

EXAMPLE 1

PROGRAM OUTPUT

index = [('California', 2000),


('California', 2010),('New York',
2000), ('New York', 2010), ('Texas',
2000), ('Texas', 2010)] populations
= [33871648, 37253956,18
976457, 19378102,20851820,
2514556 1]

pop = pd.Series(populations,
index=inde x)

pop

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

With this indexing scheme, you can straightforwardly index or slice the

series based on this multiple index:

EXAMPLE 2

PROGRAM OUTPUT

pop[('California', 2010):('Texas',
2000)]

But the convenience ends there. For example, if you need to select all

values from 2010, you'll need to do some messy (and potentially slow)

munging to make it happen:

EXAMPLE 3

PROGRAM OUTPUT

pop[[i for i in pop.index if i[1] ==


2010]]

This produces the desired result, but is not as clean (or as efficient for

large datasets) as the slicing syntax we've grown to love in Pandas.

The Better Way: Pandas MultiIndex

Fortunately, Pandas provides a better way. Our tuple-based indexing is

essentially a rudimentary multi-index, and the Pandas MultiIndex type

gives us the type of operations we wish to have. We can create a multi

index from the tuples as follows:

EXAMPLE 4

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

index=pd.MultiIndex.from_tuples(i
ndex) index

Notice that the MultiIndex contains multiple levels of indexing–in this


case, the state names and the years, as well as multiple labels for each
data point which encode these levels.

If re-index our series with this MultiIndex, we see the hierarchical


representation of the data:

EXAMPLE 5

PROGRAM OUTPUT

pop = pop.reindex(index)

pop

Here the first two columns of the Series representation show the multiple

index values, while the third column shows the data. Notice that some

entries are missing in the first column: in this multi-index representation,

any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2010, we can simply

use the Pandas slicing notation:

EXAMPLE 6

PROGRAM OUTPUT

pop[:, 2010]

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

The result is a singly indexed array with just the keys we're interested in.

This syntax is much more convenient (and the operation is much more

efficient!) than the home-spun tuple-based multi-indexing solution that

we started with. We'll now further discuss this sort of indexing operation

on hieararchically indexed data.

MultiIndex as extra dimension

You might notice something else here: we could easily have stored the

same data using a simple DataFrame with index and column labels. In

fact, Pandas is built with this equivalence in mind. The unstack() method

will quickly convert a multiply indexed Series into a conventionally

indexed DataFrame:

EXAMPLE 7

PROGRAM OUTPUT

pop_df = pop.unstack()

pop_df

pop_df.stack()

Naturally, the stack() method provides the opposite operation:

Seeing this, you might wonder why we would bother with hierarchical

indexing at all. The reason is simple: just as we were able to use

multi-indexing to represent two-dimensional data within a one

dimensional Series, we can also use it to represent data of three or more

dimensions in a Series or DataFrame. Each extra level in a multi-index

represents an extra dimension of data; taking advantage of this property

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

gives us much more flexibility in the types of data we can represent.

EXAMPLE 8

PROGRAM OUTPUT

pop_df = pd.DataFrame({'total':
pop, 'under18': [9267089,
9284094, 4687374, 4318033,

5906301, 6879014]}) pop_df

3.12 Combining datasets

One essential feature offered by Pandas is its high-performance, in

memory join and merge operations. If you have ever worked with

databases, you should be familiar with this type of data interaction. The

main interface for this is the pd.merge function, and few examples of how

this can work in practice.

3.12.1 Relational Algebra

The behavior implemented in pd.merge() is a subset of what is known as

relational algebra, which is a formal set of rules for manipulating

relational data, and forms the conceptual foundation of operations

available in most databases. The strength of the relational algebra

approach is that it proposes several primitive operations, which become

the building blocks of more complicated operations on any dataset. With

this lexicon of fundamental operations implemented efficiently in a

database or other program, a wide range of fairly complicated composite

operations can be performed.

3.12.2 Categories of Joins

The pd.merge() function implements a number of types of joins: the

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

one-to-one, many-to-one, and many-to-many joins. All three types of joins

are accessed via an identical call to the pd.merge() interface; the type of

join performed depends on the form of the input data. Here we will show

simple examples of the three types of merges, and discuss detailed

options further below.

3.12.2.1 One-to-one joins

Perhaps the simplest type of merge expresion is the one-to one join,

which is in many ways very similar to the column-wise concatenation

seen in Combining Datasets: Concat & Append. As a concrete example,

consider the following two DataFrames which contain information on

several employees in a company:

EXAMPLE 1

PROGRAM OUTPUT

df1 = pd.DataFrame({'employee':
['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'En

gineering', 'Engineering', 'HR']})


df2 = pd.DataFrame({'employee':
['Lisa' , 'Bob', 'Jake', 'Sue'],

'hire_date': [2004, 2008, 2012,

2014]})

print('df1’)

print( 'df2')

To combine this information into a single DataFrame, we can use the

pd.merge() function:

EXAMPLE 2

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

df3 = pd.merge(df1, df2)

df3

The pd.merge() function recognizes that each DataFrame has an

"employee" column, and automatically joins using this column as a key.

The result of the merge is a new DataFrame that combines the

information from the two inputs. Notice that the order of entries in each

column is not necessarily maintained: in this case, the order of the

"employee" column differs between df1 and df2, and the pd.merge()

function correctly accounts for this. Additionally, keep in mind that the

merge in general discards the index, except in the special case of merges

by index

3.12.2.2 Many-to-one joins

Many-to-one joins are joins in which one of the two key columns

contains duplicate entries. For the many-to-one case, the resulting

DataFrame will preserve those duplicate entries as appropriate. Consider

the following example of a many-to-one join:

EXAMPLE 3

PROGRAM OUTPUT

df4=pd.DataFrame({'group':
['Accounting', 'Engineering',
'HR'],

'supervisor': ['Carly',

'Guido', 'St eve']})

display('df3', 'df4',
'pd.merge(df3, df4)')

3.12.2.3 Many-to-many joins

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Many-to-many joins are a bit confusing conceptually, but are

nevertheless well defined. If the key column in both the left and right

array contains duplicates, then the result is a many-to-many merge. This

will be perhaps most clear with a concrete example. Consider the

following, where we have a DataFrame showing one or more skills

associated with a particular group. By performing a many-to-many join,

we can recover the skills associated with any individual person:

EXAMPLE 4

PROGRAM OUTPUT

df5= pd.DataFrame({'group':
['Accounting', 'Accounting',
'Engineering', 'Eng ineering',

'HR', 'HR'],

'skills': ['math', 'spreadsh eets',

'coding', 'linux',

'spreadsheets', 'or ganization']})

display('df1', 'df5', "pd.merge(df1,

df5)" )

These three types of joins can be used with other Pandas tools to

implement a wide array of functionality. But in practice, datasets are

rarely as clean as the one we're working with here. In the following section

we'll consider some of the options provided by pd.merge() that enable you

to tune how the join operations work.

Specification of the Merge Key

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

We've already seen the default behavior of pd.merge(): it looks for one or

more matching column names between the two inputs, and uses this as

the key. However, often the column names will not match so nicely, and

pd.merge() provides a variety of options for handling this.

3.12.3 The on keyword

Most simply, you can explicitly specify the name of the key column using

the on keyword, which takes a column name or a list of column names:

EXAMPLE 5

PROGRAM OUTPUT

display('df1', 'df2', "pd.merge(df1,


df2, on='employee')")

df1

This option works only if both the left and right DataFrames have the

specified column name.

3.12.4 Specifying Set Arithmetic for Joins

In all the preceding examples have glossed over one important

consideration in performing a join: the type of set arithmetic used in the

join. This comes up when a value appears in one key column but not the

other. Consider this example:

EXAMPLE 6

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

PROGRAM OUTPUT

df6 = pd.DataFrame({'name':
['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},

columns=['name', 'food']) df7 =

pd.DataFrame({'name': ['Mary',
'Joseph'],'drink': ['pepsi', 'coke']},
columns=['name', 'drink'])

display('df6', 'df7', 'pd.merge(df6,

df7)') print(df6)

print(df7)

print(pd.merge(df6, df7))

pd.merge(df6, df7, how='inner')

Here we have merged two datasets that have only a single "name" entry in
common: Mary. By default, the result contains the intersection of the two
sets of inputs; this is what is known as an inner join.
Other options for the how keyword are 'outer', 'left', and 'right'. An outer
join returns a join over the union of the input columns, and fills in all
missing values with NAs

EXAMPLE 7

PROGRAM OUTPUT

display('df6', 'df7', "pd.merge(df6,


df7, how='outer')")

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

The left join and right join return joins over the left entries and right
entries, respectively. For example:

EXAMPLE 8

PROGRAM OUTPUT

display('df6', 'df7', "pd.merge(df6,


df7, how='left')")

print(pd.merge(df6, df7,

how='right'))

3. 13 aggregation and grouping

3.13.1 Aggregation and Grouping


An essential piece of analysis of large data is efficient summarization:

computing aggregations like sum(), mean(), median(), min(), and max(), in


which a single number, gives insight into the nature of a potentially large
dataset. explore aggregations in Pandas, from simple operations akin to
what we’ve seen on NumPy arrays, to more sophisticated operations
based on the concept of a groupby.

Planets Data Here we will use the Planets dataset, available via the
Seaborn package. It gives information on planets those astronomers have
discovered around other stars (known as extrasolar planets or exoplanets
for short). It can be downloaded with a simple Seaborn command:

EXAMPLE 1

PROGRAM OUTPUT

import seaborn as sns (1035, 6)


planets=sns.load_dataset('pla
nets') planets.shape

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

EXAMPLE 2

PROGRAM OUTPUT

planets.head()

3.13.2 Simple Aggregation in Pandas


Earlier we explored some of the data aggregations available for NumPy
arrays. As with a onedimensional NumPy array, for a Pandas Series the
aggregates return a single value:

EXAMPLE 3

PROGRAM OUTPUT

rng=np.random.RandomStat 0 0.374540
e(42) ser = 1 0.950714
pd.Series(rng.rand(5)) ser 2 0.731994

3 0.598658

4 0.156019

dtype: float64

EXAMPLE 4

PROGRAM OUTPUT

ser.sum() 2.8119254917081569

ser.mean() 0.56238509834163142

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

For a DataFrame, by default the aggregates return results within each

column:

EXAMPLE 5

PROGRAM OUTPUT

df = pd.DataFrame({'A':
rng.rand(5), 'B': rng.rand(5)})

df

df.mean()

df.mean(axis='columns')

By specifying the axis argument, you can instead aggregate within each
row: Pandas Series and DataFrames include all of the common aggregates
mentioned in “Aggregations: Min, Max, and Everything in Between” n
addition, there is a convenience method describe() that computes several
common aggregates for each column and returns the result. Let’s use this
on the Planets data, for now dropping rows with missing values:

EXAMPLE 6

PROGRAM OUTPUT

planets.dropna().descri
be()

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Listing of Pandas aggregation methods

S Aggregation Description
.
N
O

1 count() Total number of items

2 first(), last() First and last item

3 mean(),median() Mean and median

4 min(), max() Minimum and maximum

5 std(), var() Standard deviation and variance

6 mad() Mean absolute deviation

7 prod() Product of all items

8 sum() Sum of all items

These are all methods of DataFrame and Series objects.


To go deeper into the data, however, simple aggregates are often not
enough. The next level of data summarization is the group by operation,
which allows you to quickly and efficiently compute aggregates on subsets
of data.

3.13.3 GroupBy
Simple aggregations can give you a flavor of your dataset, but often we

would prefer to aggregate conditionally on some label or index: this is


implemented in the so called groupby operation. The name “group by”
comes from a command in the SQL database language, but it is perhaps
more illuminative to think of it in the terms first coined by Hadley
Wickham of Rstats fame: split, apply, combine.
Split, apply, combine
A canonical example of this split-apply-combine operation, where the

“apply” is a summation aggregation, is illustrated in Figure 3-1. Figure 3-1


makes clear what the GroupBy accomplishes:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

• The split step involves breaking up and grouping a DataFrame


depending on the value of the specified key.

• The apply step involves computing some function, usually an aggregate,


transformation, or filtering, within the individual groups.

• The combine step merges the results of these operations into an output

array.

While we could certainly do this manually using some combination of the


masking, aggregation, and merging commands covered earlier, it’s
important to realize that the intermediate splits do not need to be explicitly
instantiated. Rather, the GroupBy can (often) do this in a single pass over
the data, updating the sum, mean, count, min, or other aggregate for
each group along the way. The power of the GroupBy is that it abstracts
away these steps: the user need not think abouthow the computation is
done under the hood, but rather thinks about the operation as a whole. As

a concrete example, let’s take a look at using Pandas for the computation
shown in Figure 3-1. We’ll start by creating the input DataFrame:

EXAMPLE 7

PROGRAM OUTPUT

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A',


'B', 'C'], 'data': range(6)}, columns=['key',
'data'])

df

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

We can compute the most basic split-apply-combine operation with the


groupby() method of DataFrames, passing the name of the desired key
column:

EXAMPLE 8

PROGRAM OUTPUT

df.groupby('key').sum() A 3

B 5

C 7

The sum() method is just one possibility here; you can apply virtually any

common Pandas or NumPy aggregation function, as well as virtually any


valid DataFrame operation, as we will see in the following discussion.

EXAMPLE 9

PROGRAM OUTPUT

import numpy as np

import pandas as pd

rng = np.random.RandomState(0)

df = pd.DataFrame({'key':['A', 'B', 'C', 'A', 'B',

'C'], 'data1': range(6), 'data2':


rng.randint(0, 10, 6)},
columns=['key','data1','data2'],

index=['AA','BB','cc','dd','ee','ff'])

df

Aggregation. We’re now familiar with GroupBy aggregations with sum(),


median(), and the like, but the aggregate() method allows for even more
flexibility. It can take a string, a function, or a list thereof, and compute
all the aggregates at once. Here is a quick example combining all these:

EXAMPLE 10

PROGRAM OUTPUT

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

df.groupby('key').aggregate(['
min', np.median, max])

Another useful pattern is to pass a dictionary mapping column names to


operations to be applied on that column.

PIVOT TABLE
A pivot table is a table of grouped values that aggregates all the individual items of a much bigger table.
A pivot table provides a summary of discrete categories, such as sum, averages, as well as various
statistics of interests.
A pivot table serves as a very useful tool for you to explore and analyze your data, and makes it easy for
you to perform comparisons and view trends.

Loading the data


Let’s load the data into a Pandas DataFrame:

import pandas as pd
data_url='https://raw.githubusercontent.com/resbaz/r-novice-gapminder-fil
es/master/data/gapminder-FiveYearData.csv'

df=pd.read_csv(data_url)

df

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Finding the mean values for each country

To start off, let’s find the mean of the various statistics for each country using the pivot_table() function:

pd.pivot_table(df,

index=’country’,

aggfunc=’mean’)

The index parameter specifies the index to use for the result of the function. The aggfunc parameter specifies
the function to apply on the numeric columns of the dataframe. The following figure shows the result and how
the various parameters dictate the outcome:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

The default aggfunc value is mean (a Pandas function) if you do not specify its value. You can also supply a
NumPy function such as np.mean, or any function that returns an aggregated value .

pd.pivot_table(df,

index=’country’,

aggfunc=’np.mean’)

Finding the mean GDP and mean, max, and min of life expectancies

From the previous result you see that it does not really make sense to calculate the mean of the year column.
Also, you might want to know the minimum, maximum, and average life expectancies of each country. To do
so, you can specify a dictionary for the aggfunc parameter and indicate what function to apply to which column:

import numpy as np

pd.pivot_table(df,

index=’country’

aggfunc={‘gdpPercap’:np.mean,

‘lifeExp”:[np.mean,np.max,np.min]})

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

The above code snippet produces the following output:

Observe that since we did not specify the pop and year columns in the dictionary, they will no longer appear in
the result.

Finding the mean values of each countries for each year

The index parameters also accept a list of columns that will result in a multi-index dataframe result. For
example, I want to know the mean GDP, life expectancies, and population of each county for each year from
1952 to 2007. I can do it like this:

pd.pivot_table(df,

index=[’country’,’year’],

aggfunc=’mean’)

The result is a multi-index dataframe with country and year as the index, and the rest of the numeric fields as
columns :

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Finding the mean values for each continent

If you want to find the mean values for each continent, simply set the index parameter to continent:

pd.pivot_table(df,

index=’continent’)

You will now see the following result:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Finding the population of each country

If you want to know the mean population for each country from 1952 to 2007, set the index to country and
values to pop:

pd.pivot_table(df,

index=’continent’,

values=’pop’,

aggfunc=’mean’)

The following shows the use of the values parameter:

Finding the mean life expectancies for each continent

To find the mean life expectancies for each continent, set the index and values parameters as follows:

pd.pivot_table(df,

index=’continent’,

values=’lifeExp’,

aggfunc=’mean’)

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

You will see the result as follows:

What if you want to flip the columns and rows of the result? Easy, change the index parameter to columns:

pd.pivot_table(df,

columns=’continent’,

values=’lifeExp’,

aggfunc=’mean’)

The following figure shows the result and the use of the various parameters:

Finding the life expectancies of each country in the various continents

Next, we want to know the life expectancies of each country in each of the five continents. We could do this:

pd.pivot_table(df,

index=’country’,

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

columns=’continent’,

values=’lifeExp’,

aggfunc=’mean’)

Now the life expectancies of each country will be displayed in the respective continent that the country belongs
to:

Notice the NaNs scattered in the result. If you do not want to see the NaNs, you can set the fill_value parameter
to fill them with some values, such as 0s:

pd.pivot_table(df,

index=’country’,

columns=’continent’,

values=’lifeExp’,

aggfunc=’mean’,

fill_value=0)

You should now see 0s instead of NaNs:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Finding the mean life expectancies of each continent by year

Finally, let’s find the mean life expectancies of each continent and group them by year:

pd.pivot_table(df,

index=’year’,

columns=’continent’,

values=’lifeExp’,

aggfunc=’mean’)

The figure below shows the result and the use of the various parameters:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Vectorized string operations


Vectorized string operations are an essential part of data analysis, especially when dealing with datasets that
have text data.

Traditionally, when dealing with string data in a dataset, programmers have to loop over the data and perform
operations on each element one at a time. This can be time-consuming, especially when dealing with large
datasets. Vectorized string operations solve this problem by allowing programmers to perform operations on
entire arrays of string data at once.

Advantages of vectorized string operations

1. Speed: As mentioned earlier, vectorized string operations are faster than traditional string operations as they
allow operations to be performed on entire arrays of string data at once.

2. Code simplification: Using vectorized string operations can lead to simpler and more concise code, as
programmers no longer need to loop over the data and perform operations on each element one at a time.

3. Ease of use: Vectorized string operations are easy to use, and programmers don’t need to have advanced
knowledge of string manipulation to use them.

Operations that can be performed using vectorized string operations

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

1. Concatenation: Concatenation is the process of joining two or more strings together.

2. Splitting: Splitting is the process of dividing a string into multiple parts based on a specific delimiter.

3. Substring extraction: Substring extraction is the process of extracting a part of a string.

4. Case conversion: Case conversion is the process of converting the case of a string to uppercase or lowercase.

5. Search and replace: Search and replace is the process of finding a specific substring in a string and replacing
it with a different substring.

Load the titanic dataset

Example1: Splitting

To split the name to First name and Last Name into separate columns we can use the vectorized str.split()
method:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Example 2: Concatenation

To concatenate the first name and last name columns to create a full name column, we can use the vectorized
str.cat() method:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Example 3: Substring extraction

To extract the title of each passenger from the name column, we can use the vectorized str.extract() method:

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Example 4: Replacing substrings

The str.replace() method can be used to replace specific substrings with other substrings within a string column.

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Example 5: Filtering

The str.contains() method can be used to filter a dataframe based on whether a string column contains any of a
list of substrings.

Filter out all the passengers whose name starts with “B” and ends with “e”

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Example 6: Slicing

Vectorized string methods in Pandas also allow us to slice strings in a Series using the familiar syntax of
Python’s built-in slicing notation: str[start:stop:step]. The start and stop indices are inclusive, while the step
argument specifies the stride or interval of the slice.

Extract the first 3 characters of each name

Extract the last 5 characters of each name

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Reverse each name

Example 7: Case Conversion

str.lower() method to convert all text to lowercase.

str.upper() method to convert all text to uppercase

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

str.capitalize() method to capitalize the first letter of the text

str.title() method to title case each name, which means to capitalize the first letter of each
word .

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)


lOMoARcPSD|50930822

Downloaded by Mohideen Abdul (mohideengamer@gmail.com)

You might also like