Data Preprocessing Python Tome III
Data Preprocessing Python Tome III
Note that the filtered result is itself a DataFrame, so you can work with its columns just like any
other DataFrame.
For example, let’s find the average grade for students who undertook more than the average amount
of study time.
Let’s assume that the passing grade for the course is 60.
We can use that information to add a new column to the DataFrame, indicating whether or not
each student passed.
First, we’ll create a Pandas Series containing the pass/fail indicator (True or False), and then we’ll
concatenate that series as a new column (axis 1) in the DataFrame.
df_students
DataFrames are designed for tabular data, and you can use them to perform many of the kinds
of data analytics operation you can do in a relational database; such as grouping and aggregating
tables of data.
For example, you can use the groupby method to group the student data into groups based on
the Pass column you added previously, and count the number of names in each group - in other
words, you can determine how many students passed and failed.
[ ]: print(df_students.groupby(df_students.Pass).Name.count())
You can aggregate multiple fields in a group using any available aggregation function. For example,
you can find the mean study time and grade for the groups of students who passed and failed the
course.
[ ]: print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())
DataFrames are amazingly versatile, and make it easy to manipulate data. Many DataFrame
operations return a new copy of the DataFrame; so if you want to modify a DataFrame but keep
11
the existing variable, you need to assign the result of the operation to the existing variable. For
example, the following code sorts the student data into descending order of Grade, and assigns the
resulting sorted DataFrame to the original df_students variable.
12
Well, that worked; but the chart could use some improvements to make it clearer what we’re looking
at.
Note that you used the pyplot class from Matplotlib to plot the chart. This class provides a whole
bunch of ways to improve the visual elements of the plot. For example, the following code:
• Specifies the color of the bar chart.
• Adds a title to the chart (so we know what it represents)
• Adds labels to the X and Y (so we know which axis shows which data)
• Adds a grid (to make it easier to determine the values for the bars)
• Rotates the X markers (so we can read them)
13
A plot is technically contained with a Figure. In the previous examples, the figure was created
implicitly for you; but you can create it explicitly. For example, the following code creates a figure
with a specific size.
14
A figure can contain multiple subplots, each on its own axis.
For example, the following code creates a figure with two subplots - one is a bar chart showing
student grades, and the other is a pie chart comparing the number of passing grades to non-passing
grades.
15
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key,␣
,→method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.
,→PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.
,→PyObjectHashTable.get_item()
KeyError: 'Pass'
The above exception was the direct cause of the following exception:
KeyError: 'Pass'
16
Until now, you’ve used methods of the Matplotlib.pyplot object to plot charts. However, Mat-
plotlib is so foundational to graphics in Python that many packages, including Pandas, provide
methods that abstract the underlying Matplotlib functions and simplify plotting. For example, the
DataFrame provides its own methods for plotting data, as shown in the following example to plot
a bar chart of study hours.
[42]: <AxesSubplot:xlabel='Name'>
17
1.4 Getting started with statistical analysis
Now that you know how to use Python to manipulate and visualize data, you can start analyzing
it.
A lot of data science is rooted in statistics, so we’ll explore some basic statistical techniques.
Note: This is not intended to teach you statistics - that’s much too big a topic for this
notebook. It will however introduce you to some statistical concepts and techniques that
data scientists use as they explore data in preparation for machine learning modeling.
# Create a Figure
fig = plt.figure(figsize=(10,4))
18
# Plot a histogram
plt.hist(var_data)
The histogram for grades is a symmetric shape, where the most frequently occurring grades tend
to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.
Measures of central tendency To understand the distribution better, we can examine so-
called measures of central tendency; which is a fancy way of describing statistics that represent the
“middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define
the middle of the data include:
• The mean: A simple average based on adding together all of the values in the sample set,
and then dividing the total by the number of samples.
• The median: The value in the middle of the range of all of the sample values.
• The mode: The most commonly occuring value in the sample set*.
Let’s calculate these values, along with the minimum and maximum values for comparison, and
19
show them on the histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in
which case the dataset is described as bimodal or even multimodal.
# Get statistics
min_val = var.min()
max_val = var.max()
mean_val = var.mean()
med_val = var.median()
mod_val = var.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.
,→2f}\n'.format(min_val,
␣
,→ mean_val,
␣
,→ med_val,
␣
,→ mod_val,
␣
,→ max_val))
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var)
Minimum:3.00
20
Mean:49.18
Median:49.50
Mode:50.00
Maximum:97.00
For the grade data, the mean, median, and mode all seem to be more or less in the middle of the
minimum and maximum, at around 50.
Another way to visualize the distribution of a variable is to use a box plot (sometimes called a
box-and-whiskers plot). Let’s create one for the grade data.
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.boxplot(var)
21
<ipython-input-45-3a87a30ff398>:14: UserWarning: Matplotlib is currently using
module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()
The box plot shows the distribution of the grade values in a different format to the histogram. The
box part of the plot shows where the inner two quartiles of the data reside - so in this case, half of
the grades are between approximately 36 and 63. The whiskers extending from the box show the
outer two quartiles; so the other half of the grades in this case are between 0 and 36 or 63 and 100.
The line in the box indicates the median value.
It’s often useful to combine histograms and box plots, with the box plot’s orientation changed to
align it with the histogram (in some ways, it can be helpful to think of the histogram as a “front
elevation” view of the distribution, and the box plot as a “plan” view of the distribution from
above.)
# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.
,→ 2f}\n'.format(min_val,
␣
,→ mean_val,
22