unit-ii-eda-using-python
unit-ii-eda-using-python
import pandas
pandas.__version__
Out[1]:
'0.18.1'
Just as we generally import NumPy under the alias np, we will import
In [2]:
import pandas as pd
Reminder about Built-In Documentation
For example, to display all the contents of the pandas namespace, you
can type
In [3]: pd.<TAB>
In [4]: pd?
Data indexing includes (e.g., arr[2, 1]), slicing (e.g., arr[:, 1:5]),
masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and
combinations thereof (e.g., arr[:, [1, 5]]). Here we'll look at similar means
of accessing and modifying values in
Pandas Series and DataFrame objects. If you have used the NumPy
EXAMPLE 1
PROGRAM OUTPUT
0.5
data['b']
True
'a' in data
data a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
examine the keys/indices and values: Series objects can even be modified
index value:
hood, Pandas is making decisions about memory layout and data copying
that might need to take place; the user generally does not need to worry
item selection via the same basic mechanisms as NumPy arrays – that is,
PROGRAM OUTPUT
data['a':'c'] a 0.25
b 0.50
c 0.75
dtype: float64
data[0:2] a 0.25
dtype: float64
b 0.50
# fancy indexing c 0.75
data[['a', 'e']] dtype: float64
a 0.25
e 1.25
dtype: float64
Among these, slicing may be the source of the most confusion. Notice that
when slicing with an explicit index (i.e., data['a':'c']), the final index is
included in the slice, while when slicing with an implicit index (i.e.,
data[0:2]), the final index is excluded from the slice.
Indexers: loc, iloc, and ix
EXAMPLE 3
PROGRAM OUTPUT
data 5 c
dtype: object
indexing schemes. These are not functional methods, but attributes that
First, the loc attribute allows indexing and slicing that always references
EXAMPLE 4
PROGRAM OUTPUT
data.loc[1] 'a'
data.loc[1:3] 1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the
EXAMPLE 4
PROGRAM OUTPUT
data.iloc[1] 'b'
data.iloc[1:3] 3 b
5 c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series
implicit." The explicit nature of loc and iloc make them very useful in
indexes, I recommend using these both to make code easier to read and
convention.
sharing the same index. These analogies can be helpful to keep in mind
DataFrame as a dictionary
populations of states:
EXAMPLE 4
PROGRAM OUTPUT
area = pd.Series({'California':
42396 7, 'Texas': 695662,
170312,
pd.Series({'California': 383325
21, 'Texas': 26448193,
lorida': 19552860,
data
The individual Series that make up the columns of the DataFrame can be
EXAMPLE 4
PROGRAM OUTPUT
data['area']
are strings:
EXAMPLE 4
PROGRAM OUTPUT
data.area
data.pop is data['pop']
False
keep in mind that it does not work for all cases! For example, if the
column names are not strings, or if the column names conflict with
Like with the Series objects discussed earlier, this dictionary-style syntax
can also be used to modify the object, in this case adding a new column:
EXAMPLE 4
PROGRAM OUTPUT
data['density'] = data['pop'] /
data['a rea']
data
arithmetic between Series objects; we'll dig into this further in Operating
on Data in Pandas.
EXAMPLE 4
PROGRAM OUTPUT
[ 1.70312000e+05, 1.95528600e+07,
1.14806121e+02],
[ 1.49995000e+05, 1.28821350e+07,
8.58837628e+01],
[ 1.41297000e+05, 1.96511270e+07,
1.39076746e+02],
[ 6.95662000e+05, 2.64481930e+07,
3.80187404e+01]])
done on the DataFrame itself. For example, we can transpose the full
EXAMPLE 5
PROGRAM OUTPUT
data.T
column:
EXAMPLE 5
PROGRAM OUTPUT
Florida 170312
Illinois 149995
Texas 695662
again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc
array (using the implicit Python-style index), but the DataFrame index
EXAMPLE 5
PROGRAM OUTPUT
data.iloc[:3, :2]
Similarly, using the loc indexer we can index the underlying data in an
array-like style but using the explicit index and column names:
EXAMPLE 5
PROGRAM OUTPUT
data.loc[:'Illinois',
:'pop']
Keep in mind that for integer indices, the ix indexer is subject to the
Series objects.
Pandas will automatically align indices when passing the objects to the
ufunc. This means that keeping the context of data and combining data
EXAMPLE 5
PROGRAM OUTPUT
import pandas as pd
import numpy as np
rng =
np.random.RandomState(42)
ser = pd.Series(rng.randint(0,
10, 4)) ser
df = pd.DataFrame(rng.randint(0, A B C D
10, (3, 4)), 0 6 9 2 6
columns=['A', 'B', 'C', 'D']) 1 7 4 3 7
df 2 7 2 5 4
PROGRAM OUTPUT
np.exp(ser)
np.sin(df * np.pi / 4)
Functions”
EXAMPLE 5
PROGRAM OUTPUT
area = pd.Series({'Alaska':
1723337, 'Texas': 695662,
'California': 423967},
name='area') population =
pd.Series({'California':
38332521, 'Texas': 26448193,
'New York': 19651127},
name='population')
population / area
The resulting array contains the union of indices of the two input arrays,
which we could determine using standard Python set arithmetic on these
indices:
EXAMPLE 5
PROGRAM OUTPUT
Any item for which one or the other does not have an entry is marked
with NaN, or “Not a Number,” which is how Pandas marks missing data .
‐
This index matching is imple mented this way for any of Python’s built-in
arithmetic expressions; any missing values are filled in with NaN by
default:
EXAMPLE 5
PROGRAM OUTPUT
dtype: float64
If using NaN values is not the desired behavior, we can modify the fill
value using appropriate object methods in place of the operators. For
example, calling A.add(B) is equivalent to calling A + B, but allows
optional explicit specification of the fill value for any elements in A or B
that might be missing:
EXAMPLE 6
PROGRAM OUTPUT
A.add(B, fill_value=0)
EXAMPLE 6
PROGRAM OUTPUT
A = pd.DataFrame(rng.randint(0,
20, (2, 2)),
columns=list('AB'))
B = pd.DataFrame(rng.randint(0,
10, (3, 3)),
columns=list('BAC'))
A + B
Notice that indices are aligned correctly irrespective of their order in the
two objects, and indices in the result are sorted. As was the case with
Series, we can use the associated object’s arithmetic method and pass
any desired fill_value to be used in place of missing entries. Here we’ll fill
with the mean of all values in A (which we compute by first stacking the
rows of A):
EXAMPLE 6
PROGRAM OUTPUT
fill = A.stack().mean()
A.add(B, fill_value=fill)
Table 3-1 lists Python operators and their equivalent Pandas object
methods.
+ add()
- sub(), subtract()
* mul(), multiply()
// floordiv()
% mod()
** pow()
EXAMPLE 6
PROGRAM OUTPUT
EXAMPLE 6
PROGRAM OUTPUT
df=pd.DataFrame(A,columns=list('Q
RST')) print(df)
df - df.iloc[0]
df.subtract(df['R'], axis=0)
If you would instead like to operate column-wise, you can use the object
methods mentioned earlier, while specifying the axis keyword:
EXAMPLE 6
PROGRAM OUTPUT
halfrow
EXAMPLE 6
PROGRAM OUTPUT
df - halfrow
The difference between data found in many tutorials and data in the real
data, discuss how Pandas chooses to represent it, and demonstrate some
built-in Pandas tools for handling missing data in Python. Here and
throughout the book, we'll refer to missing data in general as null, NaN,
or NA values.
3.10.1 Pythonic missing data
The first sentinel value used by Pandas is None, a Python singleton object
that is often used for missing data in Python code. Because it is a Python
only in arrays with data type 'object' (i.e., arrays of Python objects):
EXAMPLE 1
PROGRAM OUTPUT
NumPy could infer for the contents of the array is that they are Python
objects. While this kind of object array is useful for some purposes, any
operations on the data will be done at the Python level, with much more
overhead than the typically fast operations seen for arrays with native
types:
EXAMPLE 1
PROGRAM OUTPUT
Notice that NumPy chose a native floating-point type for this array: this
means that unlike the object array from before, this array supports fast
operations pushed into compiled code. You should be aware that NaN is a
bit like a data virus–it infects any other object it touches. Regardless of
the operation, the result of arithmetic with NaN will be another NaN:
EXAMPLE 1
PROGRAM OUTPUT
1 + np.nan nan
0 * np.nan nan
Note that this means that aggregates over the values are well defined
EXAMPLE 1
PROGRAM OUTPUT
NumPy does provide some special aggregations that will ignore these
missing values:
EXAMPLE 1
PROGRAM OUTPUT
NaN and None both have their place, and Pandas is built to handle the
two of them nearly interchangeably, converting between them where
appropriate:
EXAMPLE 1
PROGRAM OUTPUT
EXAMPLE 1
PROGRAM OUTPUT
x=pd.Series(range(2),
dtype=int) x
x[0] = None
Pandas automatically converts the None to a NaN value. (Be aware that
While this type of magic may feel a bit hackish compared to the more
np.nan
Keep in mind that in Pandas, string data is always stored with an object
dtype.
convention, there are several useful methods for detecting, removing, and
these routines.
Pandas data structures have two useful methods for detecting null data:
isnull() and notnull(). Either one will return a Boolean mask over the
EXAMPLE 1
PROGRAM OUTPUT
data.isnull()
data[data.notnull()]
The isnull() and notnull() methods produce similar Boolean results for
DataFrames.
EXAMPLE 1
PROGRAM OUTPUT
df = pd.DataFrame([[1, np.nan,
2], [2, 3, 5],
[np.nan, 4, 6]]) df
df.dropna()
df.dropna(axis='columns')
It cannot drop single values from a DataFrame; we can only drop full rows
or full columns. Depending on the application, you might want one or the
By default, dropna() will drop all rows in which any null value is present:
Alternatively, you can drop NA values along a different axis; axis=1 drops
with a valid value. This value might be a single number like zero, or it
You could do this in-place using the isnull() method as a mask, but
method, which returns a copy of the array with the null values replaced.
EXAMPLE 1
PROGRAM OUTPUT
EXAMPLE 1
PROGRAM OUTPUT
data.fillna(0)
EXAMPLE 1
PROGRAM OUTPUT
data.fillna(method='ffill')
data.fillna(method='bfill')
df.fillna(method='ffill', axis=1)
Notice that if a previous value is not available during a forward fill, the NA
value remains.
higher -dimensional data–that is, data indexed by more than one or two
keys. While Pandas does provide Panel and Panel4D objects that natively
multiply indexed data, and useful routines for converting between simple
import pandas as pd
import numpy as np
series of data where each point has a character and numerical key.
Suppose you would like to track data about states from two different
years. Using the Pandas tools we've already covered, you might be
EXAMPLE 1
PROGRAM OUTPUT
pop = pd.Series(populations,
index=inde x)
pop
With this indexing scheme, you can straightforwardly index or slice the
EXAMPLE 2
PROGRAM OUTPUT
pop[('California', 2010):('Texas',
2000)]
But the convenience ends there. For example, if you need to select all
values from 2010, you'll need to do some messy (and potentially slow)
EXAMPLE 3
PROGRAM OUTPUT
This produces the desired result, but is not as clean (or as efficient for
EXAMPLE 4
PROGRAM OUTPUT
index=pd.MultiIndex.from_tuples(i
ndex) index
EXAMPLE 5
PROGRAM OUTPUT
pop = pop.reindex(index)
pop
Here the first two columns of the Series representation show the multiple
index values, while the third column shows the data. Notice that some
any blank entry indicates the same value as the line above it.
Now to access all data for which the second index is 2010, we can simply
EXAMPLE 6
PROGRAM OUTPUT
pop[:, 2010]
The result is a singly indexed array with just the keys we're interested in.
This syntax is much more convenient (and the operation is much more
we started with. We'll now further discuss this sort of indexing operation
You might notice something else here: we could easily have stored the
same data using a simple DataFrame with index and column labels. In
fact, Pandas is built with this equivalence in mind. The unstack() method
indexed DataFrame:
EXAMPLE 7
PROGRAM OUTPUT
pop_df = pop.unstack()
pop_df
pop_df.stack()
Seeing this, you might wonder why we would bother with hierarchical
EXAMPLE 8
PROGRAM OUTPUT
pop_df = pd.DataFrame({'total':
pop, 'under18': [9267089,
9284094, 4687374, 4318033,
memory join and merge operations. If you have ever worked with
databases, you should be familiar with this type of data interaction. The
main interface for this is the pd.merge function, and few examples of how
are accessed via an identical call to the pd.merge() interface; the type of
join performed depends on the form of the input data. Here we will show
Perhaps the simplest type of merge expresion is the one-to one join,
EXAMPLE 1
PROGRAM OUTPUT
df1 = pd.DataFrame({'employee':
['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'En
2014]})
print('df1’)
print( 'df2')
pd.merge() function:
EXAMPLE 2
PROGRAM OUTPUT
df3
information from the two inputs. Notice that the order of entries in each
"employee" column differs between df1 and df2, and the pd.merge()
function correctly accounts for this. Additionally, keep in mind that the
merge in general discards the index, except in the special case of merges
by index
Many-to-one joins are joins in which one of the two key columns
EXAMPLE 3
PROGRAM OUTPUT
df4=pd.DataFrame({'group':
['Accounting', 'Engineering',
'HR'],
'supervisor': ['Carly',
display('df3', 'df4',
'pd.merge(df3, df4)')
nevertheless well defined. If the key column in both the left and right
EXAMPLE 4
PROGRAM OUTPUT
df5= pd.DataFrame({'group':
['Accounting', 'Accounting',
'Engineering', 'Eng ineering',
'HR', 'HR'],
'coding', 'linux',
df5)" )
These three types of joins can be used with other Pandas tools to
rarely as clean as the one we're working with here. In the following section
we'll consider some of the options provided by pd.merge() that enable you
We've already seen the default behavior of pd.merge(): it looks for one or
more matching column names between the two inputs, and uses this as
the key. However, often the column names will not match so nicely, and
Most simply, you can explicitly specify the name of the key column using
EXAMPLE 5
PROGRAM OUTPUT
df1
This option works only if both the left and right DataFrames have the
join. This comes up when a value appears in one key column but not the
EXAMPLE 6
PROGRAM OUTPUT
df6 = pd.DataFrame({'name':
['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
pd.DataFrame({'name': ['Mary',
'Joseph'],'drink': ['pepsi', 'coke']},
columns=['name', 'drink'])
df7)') print(df6)
print(df7)
print(pd.merge(df6, df7))
Here we have merged two datasets that have only a single "name" entry in
common: Mary. By default, the result contains the intersection of the two
sets of inputs; this is what is known as an inner join.
Other options for the how keyword are 'outer', 'left', and 'right'. An outer
join returns a join over the union of the input columns, and fills in all
missing values with NAs
EXAMPLE 7
PROGRAM OUTPUT
The left join and right join return joins over the left entries and right
entries, respectively. For example:
EXAMPLE 8
PROGRAM OUTPUT
print(pd.merge(df6, df7,
how='right'))
Planets Data Here we will use the Planets dataset, available via the
Seaborn package. It gives information on planets those astronomers have
discovered around other stars (known as extrasolar planets or exoplanets
for short). It can be downloaded with a simple Seaborn command:
EXAMPLE 1
PROGRAM OUTPUT
EXAMPLE 2
PROGRAM OUTPUT
planets.head()
EXAMPLE 3
PROGRAM OUTPUT
rng=np.random.RandomStat 0 0.374540
e(42) ser = 1 0.950714
pd.Series(rng.rand(5)) ser 2 0.731994
3 0.598658
4 0.156019
dtype: float64
EXAMPLE 4
PROGRAM OUTPUT
ser.sum() 2.8119254917081569
ser.mean() 0.56238509834163142
column:
EXAMPLE 5
PROGRAM OUTPUT
df = pd.DataFrame({'A':
rng.rand(5), 'B': rng.rand(5)})
df
df.mean()
df.mean(axis='columns')
By specifying the axis argument, you can instead aggregate within each
row: Pandas Series and DataFrames include all of the common aggregates
mentioned in “Aggregations: Min, Max, and Everything in Between” n
addition, there is a convenience method describe() that computes several
common aggregates for each column and returns the result. Let’s use this
on the Planets data, for now dropping rows with missing values:
EXAMPLE 6
PROGRAM OUTPUT
planets.dropna().descri
be()
S Aggregation Description
.
N
O
3.13.3 GroupBy
Simple aggregations can give you a flavor of your dataset, but often we
• The combine step merges the results of these operations into an output
array.
a concrete example, let’s take a look at using Pandas for the computation
shown in Figure 3-1. We’ll start by creating the input DataFrame:
EXAMPLE 7
PROGRAM OUTPUT
df
EXAMPLE 8
PROGRAM OUTPUT
df.groupby('key').sum() A 3
B 5
C 7
The sum() method is just one possibility here; you can apply virtually any
EXAMPLE 9
PROGRAM OUTPUT
import numpy as np
import pandas as pd
rng = np.random.RandomState(0)
index=['AA','BB','cc','dd','ee','ff'])
df
EXAMPLE 10
PROGRAM OUTPUT
df.groupby('key').aggregate(['
min', np.median, max])
PIVOT TABLE
A pivot table is a table of grouped values that aggregates all the individual items of a much bigger table.
A pivot table provides a summary of discrete categories, such as sum, averages, as well as various
statistics of interests.
A pivot table serves as a very useful tool for you to explore and analyze your data, and makes it easy for
you to perform comparisons and view trends.
import pandas as pd
data_url='https://raw.githubusercontent.com/resbaz/r-novice-gapminder-fil
es/master/data/gapminder-FiveYearData.csv'
df=pd.read_csv(data_url)
df
To start off, let’s find the mean of the various statistics for each country using the pivot_table() function:
pd.pivot_table(df,
index=’country’,
aggfunc=’mean’)
The index parameter specifies the index to use for the result of the function. The aggfunc parameter specifies
the function to apply on the numeric columns of the dataframe. The following figure shows the result and how
the various parameters dictate the outcome:
The default aggfunc value is mean (a Pandas function) if you do not specify its value. You can also supply a
NumPy function such as np.mean, or any function that returns an aggregated value .
pd.pivot_table(df,
index=’country’,
aggfunc=’np.mean’)
Finding the mean GDP and mean, max, and min of life expectancies
From the previous result you see that it does not really make sense to calculate the mean of the year column.
Also, you might want to know the minimum, maximum, and average life expectancies of each country. To do
so, you can specify a dictionary for the aggfunc parameter and indicate what function to apply to which column:
import numpy as np
pd.pivot_table(df,
index=’country’
aggfunc={‘gdpPercap’:np.mean,
‘lifeExp”:[np.mean,np.max,np.min]})
Observe that since we did not specify the pop and year columns in the dictionary, they will no longer appear in
the result.
The index parameters also accept a list of columns that will result in a multi-index dataframe result. For
example, I want to know the mean GDP, life expectancies, and population of each county for each year from
1952 to 2007. I can do it like this:
pd.pivot_table(df,
index=[’country’,’year’],
aggfunc=’mean’)
The result is a multi-index dataframe with country and year as the index, and the rest of the numeric fields as
columns :
If you want to find the mean values for each continent, simply set the index parameter to continent:
pd.pivot_table(df,
index=’continent’)
If you want to know the mean population for each country from 1952 to 2007, set the index to country and
values to pop:
pd.pivot_table(df,
index=’continent’,
values=’pop’,
aggfunc=’mean’)
To find the mean life expectancies for each continent, set the index and values parameters as follows:
pd.pivot_table(df,
index=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
What if you want to flip the columns and rows of the result? Easy, change the index parameter to columns:
pd.pivot_table(df,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
The following figure shows the result and the use of the various parameters:
Next, we want to know the life expectancies of each country in each of the five continents. We could do this:
pd.pivot_table(df,
index=’country’,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
Now the life expectancies of each country will be displayed in the respective continent that the country belongs
to:
Notice the NaNs scattered in the result. If you do not want to see the NaNs, you can set the fill_value parameter
to fill them with some values, such as 0s:
pd.pivot_table(df,
index=’country’,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’,
fill_value=0)
Finally, let’s find the mean life expectancies of each continent and group them by year:
pd.pivot_table(df,
index=’year’,
columns=’continent’,
values=’lifeExp’,
aggfunc=’mean’)
The figure below shows the result and the use of the various parameters:
Traditionally, when dealing with string data in a dataset, programmers have to loop over the data and perform
operations on each element one at a time. This can be time-consuming, especially when dealing with large
datasets. Vectorized string operations solve this problem by allowing programmers to perform operations on
entire arrays of string data at once.
1. Speed: As mentioned earlier, vectorized string operations are faster than traditional string operations as they
allow operations to be performed on entire arrays of string data at once.
2. Code simplification: Using vectorized string operations can lead to simpler and more concise code, as
programmers no longer need to loop over the data and perform operations on each element one at a time.
3. Ease of use: Vectorized string operations are easy to use, and programmers don’t need to have advanced
knowledge of string manipulation to use them.
2. Splitting: Splitting is the process of dividing a string into multiple parts based on a specific delimiter.
4. Case conversion: Case conversion is the process of converting the case of a string to uppercase or lowercase.
5. Search and replace: Search and replace is the process of finding a specific substring in a string and replacing
it with a different substring.
Example1: Splitting
To split the name to First name and Last Name into separate columns we can use the vectorized str.split()
method:
Example 2: Concatenation
To concatenate the first name and last name columns to create a full name column, we can use the vectorized
str.cat() method:
To extract the title of each passenger from the name column, we can use the vectorized str.extract() method:
The str.replace() method can be used to replace specific substrings with other substrings within a string column.
Example 5: Filtering
The str.contains() method can be used to filter a dataframe based on whether a string column contains any of a
list of substrings.
Filter out all the passengers whose name starts with “B” and ends with “e”
Example 6: Slicing
Vectorized string methods in Pandas also allow us to slice strings in a Series using the familiar syntax of
Python’s built-in slicing notation: str[start:stop:step]. The start and stop indices are inclusive, while the step
argument specifies the stride or interval of the slice.
str.title() method to title case each name, which means to capitalize the first letter of each
word .