0% found this document useful (0 votes)

16 views

Data Preprocessing Python Tome III

The document discusses analyzing student data using Pandas and Matplotlib in Python. It shows how to load data into a DataFrame, calculate summary statistics, filter rows, add new columns, sort values, group and aggregate data, and visualize it with bar plots, pie charts, and other plots. Key steps include finding the average grade of students who studied more than average, adding a "Pass" column, sorting by grade, grouping by pass/fail to count names and aggregate study hours and grades, and creating single and multi-plot figures to visualize the data.

Uploaded by

Elisée TEGUE

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Data Preprocessing Python Tome III

Uploaded by

Elisée TEGUE

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

[32]: Name StudyHours Grade

1 Joann 11.50 50.0

3 Rosie 16.00 97.0
6 Frederic 11.50 53.0
9 Giovanni 14.50 74.0
10 Francesca 15.50 82.0
11 Rajab 13.75 62.0
14 Jenny 15.50 70.0
19 Skye 12.00 52.0
20 Daniel 12.50 63.0
21 Aisha 12.00 64.0

Note that the filtered result is itself a DataFrame, so you can work with its columns just like any
other DataFrame.
For example, let’s find the average grade for students who undertook more than the average amount
of study time.

[ ]: # What was their mean grade?

df_students[df_students.StudyHours > mean_study].Grade.mean()

Let’s assume that the passing grade for the course is 60.
We can use that information to add a new column to the DataFrame, indicating whether or not
each student passed.
First, we’ll create a Pandas Series containing the pass/fail indicator (True or False), and then we’ll
concatenate that series as a new column (axis 1) in the DataFrame.

[ ]: passes = pd.Series(df_students['Grade'] >= 60)

df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

df_students

DataFrames are designed for tabular data, and you can use them to perform many of the kinds
of data analytics operation you can do in a relational database; such as grouping and aggregating
tables of data.
For example, you can use the groupby method to group the student data into groups based on
the Pass column you added previously, and count the number of names in each group - in other
words, you can determine how many students passed and failed.

[ ]: print(df_students.groupby(df_students.Pass).Name.count())

You can aggregate multiple fields in a group using any available aggregation function. For example,
you can find the mean study time and grade for the groups of students who passed and failed the
course.

[ ]: print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())

DataFrames are amazingly versatile, and make it easy to manipulate data. Many DataFrame
operations return a new copy of the DataFrame; so if you want to modify a DataFrame but keep

11
the existing variable, you need to assign the result of the operation to the existing variable. For
example, the following code sorts the student data into descending order of Grade, and assigns the
resulting sorted DataFrame to the original df_students variable.

[ ]: # Create a DataFrame with the data sorted by Grade (descending)

df_students = df_students.sort_values('Grade', ascending=False)

# Show the DataFrame

df_students

1.3 Visualizing data with Matplotlib

DataFrames provide a great way to explore and analyze tabular data, but sometimes a picture is
worth a thousand rows and columns. The Matplotlib library provides the foundation for plotting
data visualizations that can greatly enhance your ability to analyze the data.
Let’s start with a simple bar chart that shows the grade of each student.

[33]: # Ensure plots are displayed inline in the notebook

%matplotlib inline

from matplotlib import pyplot as plt

# Create a bar plot of name vs grade

plt.bar(x=df_students.Name, height=df_students.Grade)

# Display the plot

plt.show()

12
Well, that worked; but the chart could use some improvements to make it clearer what we’re looking
at.
Note that you used the pyplot class from Matplotlib to plot the chart. This class provides a whole
bunch of ways to improve the visual elements of the plot. For example, the following code:
• Specifies the color of the bar chart.
• Adds a title to the chart (so we know what it represents)
• Adds labels to the X and Y (so we know which axis shows which data)
• Adds a grid (to make it easier to determine the values for the bars)
• Rotates the X markers (so we can read them)

[39]: # Create a bar plot of name vs grade

plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')

# Customize the chart

plt.title('Student Grades')
plt.xlabel('Student')
plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=90)

# Display the plot

plt.show()

13
A plot is technically contained with a Figure. In the previous examples, the figure was created
implicitly for you; but you can create it explicitly. For example, the following code creates a figure
with a specific size.

[40]: # Create a Figure

fig = plt.figure(figsize=(8,3))

# Create a bar plot of name vs grade

plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')

# Customize the chart

plt.title('Student Grades')
plt.xlabel('Student')
plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=90)

# Show the figure

plt.show()

14
A figure can contain multiple subplots, each on its own axis.
For example, the following code creates a figure with two subplots - one is a bar chart showing
student grades, and the other is a pie chart comparing the number of passing grades to non-passing
grades.

[41]: # Create a figure for 2 subplots (1 row, 2 columns)

fig, ax = plt.subplots(1, 2, figsize = (10,4))

# Create a bar plot of name vs grade on the first axis

ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
ax[0].set_title('Grades')
ax[0].set_xticklabels(df_students.Name, rotation=90)

# Create a pie chart of pass counts on the second axis

pass_counts = df_students['Pass'].value_counts()
ax[1].pie(pass_counts, labels=pass_counts)
ax[1].set_title('Passing Grades')
ax[1].legend(pass_counts.keys().tolist())

# Add a title to the Figure

fig.suptitle('Student Data')

# Show the figure

fig.show()

<ipython-input-41-4eea5c60d58f>:7: UserWarning: FixedFormatter should only be

used together with FixedLocator
ax[0].set_xticklabels(df_students.Name, rotation=90)

15
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key,␣
,→method, tolerance)

3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.
,→PyObjectHashTable.get_item()

KeyError: 'Pass'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)

<ipython-input-41-4eea5c60d58f> in <module>
8
9 # Create a pie chart of pass counts on the second axis
---> 10 pass_counts = df_students['Pass'].value_counts()
11 ax[1].pie(pass_counts, labels=pass_counts)
12 ax[1].set_title('Passing Grades')

~\anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)

3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key,␣

,→method, tolerance)

3080 return self._engine.get_loc(casted_key)

3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:

KeyError: 'Pass'

16
Until now, you’ve used methods of the Matplotlib.pyplot object to plot charts. However, Mat-
plotlib is so foundational to graphics in Python that many packages, including Pandas, provide
methods that abstract the underlying Matplotlib functions and simplify plotting. For example, the
DataFrame provides its own methods for plotting data, as shown in the following example to plot
a bar chart of study hours.

[42]: df_students.plot.bar(x='Name', y='StudyHours', color='teal', figsize=(6,4))

[42]: <AxesSubplot:xlabel='Name'>

17
1.4 Getting started with statistical analysis
Now that you know how to use Python to manipulate and visualize data, you can start analyzing
it.
A lot of data science is rooted in statistics, so we’ll explore some basic statistical techniques.
Note: This is not intended to teach you statistics - that’s much too big a topic for this
notebook. It will however introduce you to some statistical concepts and techniques that
data scientists use as they explore data in preparation for machine learning modeling.

1.4.1 Descriptive statistics and data distribution

When examining a variable (for example a sample of student grades), data scientists are particularly
interested in its distribution (in other words, how are all the different grade values spread across
the sample). The starting point for this exploration is often to visualize the data as a histogram,
and see how frequently each value for the variable occurs.

[43]: # Get the variable to examine

var_data = df_students['Grade']

# Create a Figure
fig = plt.figure(figsize=(10,4))

18
# Plot a histogram
plt.hist(var_data)

# Add titles and labels

plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the figure

fig.show()

<ipython-input-43-b1caf6e2331a>:16: UserWarning: Matplotlib is currently using

module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

The histogram for grades is a symmetric shape, where the most frequently occurring grades tend
to be in the middle of the range (around 50), with fewer grades at the extreme ends of the scale.

Measures of central tendency To understand the distribution better, we can examine so-
called measures of central tendency; which is a fancy way of describing statistics that represent the
“middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define
the middle of the data include:
• The mean: A simple average based on adding together all of the values in the sample set,
and then dividing the total by the number of samples.
• The median: The value in the middle of the range of all of the sample values.
• The mode: The most commonly occuring value in the sample set*.
Let’s calculate these values, along with the minimum and maximum values for comparison, and

19
show them on the histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in
which case the dataset is described as bimodal or even multimodal.

[44]: # Get the variable to examine

var = df_students['Grade']

# Get statistics
min_val = var.min()
max_val = var.max()
mean_val = var.mean()
med_val = var.median()
mod_val = var.mode()[0]

print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.
,→2f}\n'.format(min_val,

␣
,→ mean_val,
␣
,→ med_val,
␣
,→ mod_val,
␣
,→ max_val))

# Create a Figure
fig = plt.figure(figsize=(10,4))

# Plot a histogram
plt.hist(var)

# Add lines for the statistics

plt.axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
plt.axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
plt.axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
plt.axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

# Add titles and labels

plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the figure

fig.show()

Minimum:3.00

20
Mean:49.18
Median:49.50
Mode:50.00
Maximum:97.00

<ipython-input-44-d5e08e974966>:36: UserWarning: Matplotlib is currently using

module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

For the grade data, the mean, median, and mode all seem to be more or less in the middle of the
minimum and maximum, at around 50.
Another way to visualize the distribution of a variable is to use a box plot (sometimes called a
box-and-whiskers plot). Let’s create one for the grade data.

[45]: # Get the variable to examine

var = df_students['Grade']

# Create a Figure
fig = plt.figure(figsize=(10,4))

# Plot a histogram
plt.boxplot(var)

# Add titles and labels

plt.title('Data Distribution')

# Show the figure

fig.show()

21
<ipython-input-45-3a87a30ff398>:14: UserWarning: Matplotlib is currently using
module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

The box plot shows the distribution of the grade values in a different format to the histogram. The
box part of the plot shows where the inner two quartiles of the data reside - so in this case, half of
the grades are between approximately 36 and 63. The whiskers extending from the box show the
outer two quartiles; so the other half of the grades in this case are between 0 and 36 or 63 and 100.
The line in the box indicates the median value.
It’s often useful to combine histograms and box plots, with the box plot’s orientation changed to
align it with the histogram (in some ways, it can be helpful to think of the histogram as a “front
elevation” view of the distribution, and the box plot as a “plan” view of the distribution from
above.)

[46]: # Create a function that we can re-use

def show_distribution(var_data):
from matplotlib import pyplot as plt

# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]

print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.
,→ 2f}\n'.format(min_val,
␣
,→ mean_val,

Stastical Physics: The Energy Distribution
No ratings yet
Stastical Physics: The Energy Distribution
17 pages
Kill Team Army List - Eldar Corsairs
100% (2)
Kill Team Army List - Eldar Corsairs
8 pages
Landmasses and Bodies of Water Surrounding The Philippines
100% (2)
Landmasses and Bodies of Water Surrounding The Philippines
18 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Week2 lab
No ratings yet
Week2 lab
8 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
stanfordKNNassignment
No ratings yet
stanfordKNNassignment
78 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
python 2.1.3 (2)
No ratings yet
python 2.1.3 (2)
6 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
Block 1-Data Handling Using Pandas DataFrame
No ratings yet
Block 1-Data Handling Using Pandas DataFrame
17 pages
Tensor Flow and Keras Sample Programs
No ratings yet
Tensor Flow and Keras Sample Programs
22 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Principal Component Analysis Notes : Info
No ratings yet
Principal Component Analysis Notes : Info
22 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
SBLC 1
No ratings yet
SBLC 1
23 pages
GEMA - IA B3 CNN - Transfer Learning - DenseNet121 - Colab
No ratings yet
GEMA - IA B3 CNN - Transfer Learning - DenseNet121 - Colab
9 pages
EDS - Python Cheat Sheet
No ratings yet
EDS - Python Cheat Sheet
3 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
3 pages
Informatic Practices Hhw
No ratings yet
Informatic Practices Hhw
21 pages
Unit 2 notes-II
No ratings yet
Unit 2 notes-II
47 pages
Ml Lab Manual Completed
No ratings yet
Ml Lab Manual Completed
56 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
09_Pandas slides
No ratings yet
09_Pandas slides
33 pages
Project Ali Huzaifa
No ratings yet
Project Ali Huzaifa
6 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Python ClassXII AI
No ratings yet
Python ClassXII AI
4 pages
IP Book 12 Question Bank
No ratings yet
IP Book 12 Question Bank
20 pages
Top Python Questions 1735201448
No ratings yet
Top Python Questions 1735201448
25 pages
Data Science 1-5
No ratings yet
Data Science 1-5
15 pages
Pandas Interview Questions
No ratings yet
Pandas Interview Questions
21 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
16 pages
Machine Learning With SQL
100% (1)
Machine Learning With SQL
12 pages
week_3
No ratings yet
week_3
10 pages
Machine Learning LAB: Practical-1
100% (2)
Machine Learning LAB: Practical-1
24 pages
20 Pandas Functions For 80% of Your Data Science
No ratings yet
20 Pandas Functions For 80% of Your Data Science
22 pages
IT Project Pandas Matplotlib SQL
No ratings yet
IT Project Pandas Matplotlib SQL
4 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
Lab 02 - Introduction to Pandas
No ratings yet
Lab 02 - Introduction to Pandas
6 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
mypnotes
No ratings yet
mypnotes
3 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
48 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
ml record
No ratings yet
ml record
21 pages
Pandas DataFrame1
No ratings yet
Pandas DataFrame1
22 pages
Data Analysis and Visulaization Experiment
No ratings yet
Data Analysis and Visulaization Experiment
104 pages
EDA LAB MANUAL (1) (1)
No ratings yet
EDA LAB MANUAL (1) (1)
34 pages
Pandas Notes (1)
No ratings yet
Pandas Notes (1)
10 pages
Panda Cheatsheet
No ratings yet
Panda Cheatsheet
17 pages
Lab 9
No ratings yet
Lab 9
9 pages
AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
Text to Word
No ratings yet
Text to Word
5 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Reviewer On Philosophical Anthropology
No ratings yet
Reviewer On Philosophical Anthropology
5 pages
Kuznetsov & Kuznetsova 2008
No ratings yet
Kuznetsov & Kuznetsova 2008
17 pages
Sample 2
No ratings yet
Sample 2
5 pages
Feeling Stressed: Plant A Garden
No ratings yet
Feeling Stressed: Plant A Garden
4 pages
88 3892 1 PB
0% (1)
88 3892 1 PB
146 pages
Fundamentals of Criminal Investigation
No ratings yet
Fundamentals of Criminal Investigation
6 pages
UFP - Pre-Lift Safety Checklist - CFN-1092
No ratings yet
UFP - Pre-Lift Safety Checklist - CFN-1092
2 pages
LBHM 3Rd All India Open Fide Rated Chess Tournament-2019 Click On Invitation For Prize List
No ratings yet
LBHM 3Rd All India Open Fide Rated Chess Tournament-2019 Click On Invitation For Prize List
10 pages
Back To Long Ago PDF
No ratings yet
Back To Long Ago PDF
11 pages
Autolab Manual
100% (1)
Autolab Manual
327 pages
Ied Product Disassembly Chart 1 2
No ratings yet
Ied Product Disassembly Chart 1 2
2 pages
1.3 Patients Assessment Policy
0% (1)
1.3 Patients Assessment Policy
6 pages
(Occasional Papers) Sujit Mukherjee - Towards A Literary History of India-Indian Institute of Advanced Study (1975)
100% (1)
(Occasional Papers) Sujit Mukherjee - Towards A Literary History of India-Indian Institute of Advanced Study (1975)
112 pages
Block Approach Butterfield
No ratings yet
Block Approach Butterfield
5 pages
Transcription of Echtrae Chonnlai-McCone
No ratings yet
Transcription of Echtrae Chonnlai-McCone
3 pages
Aikido Dictionary
No ratings yet
Aikido Dictionary
14 pages
Final Thesis
No ratings yet
Final Thesis
50 pages
Chapter 1 Practice Test
100% (1)
Chapter 1 Practice Test
3 pages
CLIMDEX: Climate Extremes Indices
No ratings yet
CLIMDEX: Climate Extremes Indices
5 pages
Tandem-EnFORM Reference Manual
No ratings yet
Tandem-EnFORM Reference Manual
242 pages
20250000009622_HEMA NEGI
No ratings yet
20250000009622_HEMA NEGI
1 page
The Father of Meth
67% (6)
The Father of Meth
4 pages
Osai Controller Manual
100% (1)
Osai Controller Manual
98 pages
rr221002 Transducers in Instrumentation
No ratings yet
rr221002 Transducers in Instrumentation
8 pages
CCPP Final Report
No ratings yet
CCPP Final Report
325 pages
Version Control Systems
No ratings yet
Version Control Systems
6 pages
Nielsen Global New Products Report
No ratings yet
Nielsen Global New Products Report
22 pages