Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Data Preprocessing Python Tome III

The document discusses analyzing student data using Pandas and Matplotlib in Python. It shows how to load data into a DataFrame, calculate summary statistics, filter rows, add new columns, sort values, group and aggregate data, and visualize it with bar plots, pie charts, and other plots. Key steps include finding the average grade of students who studied more than average, adding a "Pass" column, sorting by grade, grouping by pass/fail to count names and aggregate study hours and grades, and creating single and multi-plot figures to visualize the data.

Uploaded by

Elisée TEGUE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Data Preprocessing Python Tome III

The document discusses analyzing student data using Pandas and Matplotlib in Python. It shows how to load data into a DataFrame, calculate summary statistics, filter rows, add new columns, sort values, group and aggregate data, and visualize it with bar plots, pie charts, and other plots. Key steps include finding the average grade of students who studied more than average, adding a "Pass" column, sorting by grade, grouping by pass/fail to count names and aggregate study hours and grades, and creating single and multi-plot figures to visualize the data.

Uploaded by

Elisée TEGUE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

[32]: Name StudyHours Grade

1 Joann 11.50 50.0


3 Rosie 16.00 97.0
6 Frederic 11.50 53.0
9 Giovanni 14.50 74.0
10 Francesca 15.50 82.0
11 Rajab 13.75 62.0
14 Jenny 15.50 70.0
19 Skye 12.00 52.0
20 Daniel 12.50 63.0
21 Aisha 12.00 64.0

Note that the filtered result is itself a DataFrame, so you can work with its columns just like any
other DataFrame.
For example, let’s find the average grade for students who undertook more than the average amount
of study time.

[ ]: # What was their mean grade?


df_students[df_students.StudyHours > mean_study].Grade.mean()

Let’s assume that the passing grade for the course is 60.
We can use that information to add a new column to the DataFrame, indicating whether or not
each student passed.
First, we’ll create a Pandas Series containing the pass/fail indicator (True or False), and then we’ll
concatenate that series as a new column (axis 1) in the DataFrame.

[ ]: passes = pd.Series(df_students['Grade'] >= 60)


df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

df_students

DataFrames are designed for tabular data, and you can use them to perform many of the kinds
of data analytics operation you can do in a relational database; such as grouping and aggregating
tables of data.
For example, you can use the groupby method to group the student data into groups based on
the Pass column you added previously, and count the number of names in each group - in other
words, you can determine how many students passed and failed.

[ ]: print(df_students.groupby(df_students.Pass).Name.count())

You can aggregate multiple fields in a group using any available aggregation function. For example,
you can find the mean study time and grade for the groups of students who passed and failed the
course.

[ ]: print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())

DataFrames are amazingly versatile, and make it easy to manipulate data. Many DataFrame
operations return a new copy of the DataFrame; so if you want to modify a DataFrame but keep

11
the existing variable, you need to assign the result of the operation to the existing variable. For
example, the following code sorts the student data into descending order of Grade, and assigns the
resulting sorted DataFrame to the original df_students variable.

[ ]: # Create a DataFrame with the data sorted by Grade (descending)


df_students = df_students.sort_values('Grade', ascending=False)

# Show the DataFrame


df_students

1.3 Visualizing data with Matplotlib


DataFrames provide a great way to explore and analyze tabular data, but sometimes a picture is
worth a thousand rows and columns. The Matplotlib library provides the foundation for plotting
data visualizations that can greatly enhance your ability to analyze the data.
Let’s start with a simple bar chart that shows the grade of each student.

[33]: # Ensure plots are displayed inline in the notebook


%matplotlib inline

from matplotlib import pyplot as plt

# Create a bar plot of name vs grade


plt.bar(x=df_students.Name, height=df_students.Grade)

# Display the plot


plt.show()

12
Well, that worked; but the chart could use some improvements to make it clearer what we’re looking
at.
Note that you used the pyplot class from Matplotlib to plot the chart. This class provides a whole
bunch of ways to improve the visual elements of the plot. For example, the following code:
• Specifies the color of the bar chart.
• Adds a title to the chart (so we know what it represents)
• Adds labels to the X and Y (so we know which axis shows which data)
• Adds a grid (to make it easier to determine the values for the bars)
• Rotates the X markers (so we can read them)

[39]: # Create a bar plot of name vs grade


plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')

# Customize the chart


plt.title('Student Grades')
plt.xlabel('Student')
plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=90)

# Display the plot


plt.show()

13
A plot is technically contained with a Figure. In the previous examples, the figure was created
implicitly for you; but you can create it explicitly. For example, the following code creates a figure
with a specific size.

[40]: # Create a Figure


fig = plt.figure(figsize=(8,3))

# Create a bar plot of name vs grade


plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')

# Customize the chart


plt.title('Student Grades')
plt.xlabel('Student')
plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=90)

# Show the figure


plt.show()

14
A figure can contain multiple subplots, each on its own axis.
For example, the following code creates a figure with two subplots - one is a bar chart showing
student grades, and the other is a pie chart comparing the number of passing grades to non-passing
grades.

[41]: # Create a figure for 2 subplots (1 row, 2 columns)


fig, ax = plt.subplots(1, 2, figsize = (10,4))

# Create a bar plot of name vs grade on the first axis


ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
ax[0].set_title('Grades')
ax[0].set_xticklabels(df_students.Name, rotation=90)

# Create a pie chart of pass counts on the second axis


pass_counts = df_students['Pass'].value_counts()
ax[1].pie(pass_counts, labels=pass_counts)
ax[1].set_title('Passing Grades')
ax[1].legend(pass_counts.keys().tolist())

# Add a title to the Figure


fig.suptitle('Student Data')

# Show the figure


fig.show()

<ipython-input-41-4eea5c60d58f>:7: UserWarning: FixedFormatter should only be


used together with FixedLocator
ax[0].set_xticklabels(df_students.Name, rotation=90)

15
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key,␣
,→method, tolerance)

3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.
,→PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.
,→PyObjectHashTable.get_item()

KeyError: 'Pass'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)


<ipython-input-41-4eea5c60d58f> in <module>
8
9 # Create a pie chart of pass counts on the second axis
---> 10 pass_counts = df_students['Pass'].value_counts()
11 ax[1].pie(pass_counts, labels=pass_counts)
12 ax[1].set_title('Passing Grades')

~\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)


3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key,␣


,→method, tolerance)

3080 return self._engine.get_loc(casted_key)


3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:

KeyError: 'Pass'

16
Until now, you’ve used methods of the Matplotlib.pyplot object to plot charts. However, Mat-
plotlib is so foundational to graphics in Python that many packages, including Pandas, provide
methods that abstract the underlying Matplotlib functions and simplify plotting. For example, the
DataFrame provides its own methods for plotting data, as shown in the following example to plot
a bar chart of study hours.

[42]: df_students.plot.bar(x='Name', y='StudyHours', color='teal', figsize=(6,4))

[42]: <AxesSubplot:xlabel='Name'>

17
1.4 Getting started with statistical analysis
Now that you know how to use Python to manipulate and visualize data, you can start analyzing
it.
A lot of data science is rooted in statistics, so we’ll explore some basic statistical techniques.
Note: This is not intended to teach you statistics - that’s much too big a topic for this
notebook. It will however introduce you to some statistical concepts and techniques that
data scientists use as they explore data in preparation for machine learning modeling.

1.4.1 Descriptive statistics and data distribution


When examining a variable (for example a sample of student grades), data scientists are particularly
interested in its distribution (in other words, how are all the different grade values spread across
the sample). The starting point for this exploration is often to visualize the data as a histogram,
and see how frequently each value for the variable occurs.

[43]: # Get the variable to examine


var_data = df_students['Grade']

# Create a Figure
fig = plt.figure(figsize=(10,4))

18
# Plot a histogram
plt.hist(var_data)

# Add titles and labels


plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the figure


fig.show()

<ipython-input-43-b1caf6e2331a>:16: UserWarning: Matplotlib is currently using


module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

The histogram for grades is a symmetric shape, where the most frequently occurring grades tend
to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.

Measures of central tendency To understand the distribution better, we can examine so-
called measures of central tendency; which is a fancy way of describing statistics that represent the
“middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define
the middle of the data include:
• The mean: A simple average based on adding together all of the values in the sample set,
and then dividing the total by the number of samples.
• The median: The value in the middle of the range of all of the sample values.
• The mode: The most commonly occuring value in the sample set*.
Let’s calculate these values, along with the minimum and maximum values for comparison, and

19
show them on the histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in
which case the dataset is described as bimodal or even multimodal.

[44]: # Get the variable to examine


var = df_students['Grade']

# Get statistics
min_val = var.min()
max_val = var.max()
mean_val = var.mean()
med_val = var.median()
mod_val = var.mode()[0]

print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.
,→2f}\n'.format(min_val,


,→ mean_val,

,→ med_val,

,→ mod_val,

,→ max_val))

# Create a Figure
fig = plt.figure(figsize=(10,4))

# Plot a histogram
plt.hist(var)

# Add lines for the statistics


plt.axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
plt.axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
plt.axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
plt.axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

# Add titles and labels


plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the figure


fig.show()

Minimum:3.00

20
Mean:49.18
Median:49.50
Mode:50.00
Maximum:97.00

<ipython-input-44-d5e08e974966>:36: UserWarning: Matplotlib is currently using


module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

For the grade data, the mean, median, and mode all seem to be more or less in the middle of the
minimum and maximum, at around 50.
Another way to visualize the distribution of a variable is to use a box plot (sometimes called a
box-and-whiskers plot). Let’s create one for the grade data.

[45]: # Get the variable to examine


var = df_students['Grade']

# Create a Figure
fig = plt.figure(figsize=(10,4))

# Plot a histogram
plt.boxplot(var)

# Add titles and labels


plt.title('Data Distribution')

# Show the figure


fig.show()

21
<ipython-input-45-3a87a30ff398>:14: UserWarning: Matplotlib is currently using
module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

The box plot shows the distribution of the grade values in a different format to the histogram. The
box part of the plot shows where the inner two quartiles of the data reside - so in this case, half of
the grades are between approximately 36 and 63. The whiskers extending from the box show the
outer two quartiles; so the other half of the grades in this case are between 0 and 36 or 63 and 100.
The line in the box indicates the median value.
It’s often useful to combine histograms and box plots, with the box plot’s orientation changed to
align it with the histogram (in some ways, it can be helpful to think of the histogram as a “front
elevation” view of the distribution, and the box plot as a “plan” view of the distribution from
above.)

[46]: # Create a function that we can re-use


def show_distribution(var_data):
from matplotlib import pyplot as plt

# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]

print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.
,→ 2f}\n'.format(min_val,

,→ mean_val,

22

You might also like