Data Preprocessing Python Tome I
Data Preprocessing Python Tome I
1
Suppose a college takes a sample of student grades for a data science class.
Run the code in the cell below by clicking the � Run button to see the data.
[50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52,
63, 64]
The data has been loaded into a Python list structure, which is a good data type for general data
manipulation, but not optimized for numeric analysis. For that, we’re going to use the NumPy
package, which includes specific data types and functions for working with Numbers in Python.
Run the cell below to load the data into a NumPy array.
grades = np.array(data)
print(grades)
[50 50 47 97 49 3 53 42 26 74 82 62 37 15 70 27 36 35 48 52 63 64]
Just in case you’re wondering about the differences between a list and a NumPy array, let’s
compare how these data types behave when we use them in an expression that multiplies them by
2.
<class 'list'> x 2: [50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70,
27, 36, 35, 48, 52, 63, 64, 50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37,
15, 70, 27, 36, 35, 48, 52, 63, 64]
---
<class 'numpy.ndarray'> x 2: [100 100 94 194 98 6 106 84 52 148 164 124
74 30 140 54 72 70
96 104 126 128]
Note that multiplying a list by 2 creates a new list of twice the length with the original sequence
of list elements repeated. Multiplying a NumPy array on the other hand performs an element-wise
calculation in which the array behaves like a vector, so we end up with an array of the same size in
which each element has been multiplied by 2.
The key takeaway from this is that NumPy arrays are specifically designed to support mathematical
operations on numeric data - which makes them more useful for data analysis than a generic list.
You might have spotted that the class type for the numpy array above is a numpy.ndarray.
The nd indicates that this is a structure that can consists of multiple dimensions (it can have n
dimensions). Our specific instance has a single dimension of student grades.
Run the cell below to view the shape of the array.
2
[6]: grades.shape
[6]: (22,)
The shape confirms that this array has only one dimension, which contains 22 elements (there
are 22 grades in the original list). You can access the individual elements in the array by their
zero-based ordinal position. Let’s get the first element (the one in position 0).
[7]: grades[0]
[7]: 50
Alright, now you know your way around a NumPy array, it’s time to perform some analysis of the
grades data.
You can apply aggregations across the elements in the array, so let’s find the simple average grade
(in other words, the mean grade value).
[8]: grades.mean()
[8]: 49.18181818181818
So the mean grade is just around 50 - more or less in the middle of the possible range from 0 to
100.
Let’s add a second set of data for the same students, this time recording the typical number of
hours per week they devoted to studying.
Now the data consists of a 2-dimensional array - an array of arrays. Let’s look at its shape.
3
The student_data array contains two elements, each of which is an array containing 22 elements.
To navigate this structure, you need to specify the position of each element in the hierarchy. So
to find the first value in the first array (which contains the study hours data), you can use the
following code.
[11]: 10.0
Now you have a multidimensional array containing both the student’s study time and grade infor-
mation, which you can use to compare data. For example, how does the mean study time compare
to the mean grade?
␣
,→'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha'],
'StudyHours':student_data[0],
'Grade':student_data[1]})
df_students
4
1 Joann 11.50 50.0
2 Pedro 9.00 47.0
3 Rosie 16.00 97.0
4 Ethan 9.25 49.0
5 Vicky 1.00 3.0
6 Frederic 11.50 53.0
7 Jimmie 9.00 42.0
8 Rhonda 8.50 26.0
9 Giovanni 14.50 74.0
10 Francesca 15.50 82.0
11 Rajab 13.75 62.0
12 Naiyana 9.00 37.0
13 Kian 8.00 15.0
14 Jenny 15.50 70.0
15 Jakeem 8.00 27.0
16 Helena 9.00 36.0
17 Ismat 6.00 35.0
18 Anila 10.00 48.0
19 Skye 12.00 52.0
20 Daniel 12.50 63.0
21 Aisha 12.00 64.0
Note that in addition to the columns you specified, the DataFrame includes an index to unique
identify each row. We could have specified the index explicitly, and assigned any kind of appropriate
value (for example, an email address); but because we didn’t specify an index, one has been created
with a unique integer value for each row.
You can also get the data at a range of index values, like this:
5
4 Ethan 9.25 49.0
5 Vicky 1.00 3.0
In addition to being able to use the loc method to find rows based on the index, you can use the
iloc method to find rows based on their ordinal position in the DataFrame (regardless of the index):
Look carefully at the iloc[0:5] results, and compare them to the loc[0:5] results you obtained
previously. Can you spot the difference?
The loc method returned rows with index label in the list of values from 0 to 5 - which includes 0,
1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in the positions included
in the range 0 to 5, and since integer ranges don’t include the upper-bound value, this includes
positions 0, 1, 2, 3, and 4 (five rows).
iloc identifies data values in a DataFrame by position, which extends beyond rows to columns. So
for example, you can use it to find the values for the columns in positions 1 and 2 in row 0, like
this:
[17]: df_students.iloc[0,[1,2]]
Let’s return to the loc method, and see how it works with columns. Remember that loc is used to
locate data items based on index values rather than positions. In the absence of an explicit index
column, the rows in our dataframe are indexed as integer values, but the columns are identified by
name:
[18]: df_students.loc[0,'Grade']
[18]: 50.0
Here’s another useful trick. You can use the loc method to find indexed rows based on a filtering
expression that references named columns other than the index, like this:
[19]: df_students.loc[df_students['Name']=='Aisha']
6
Actually, you don’t need to explicitly use the loc method to do this - you can simply apply a
DataFrame filtering expression, like this:
[20]: df_students[df_students['Name']=='Aisha']
And for good measure, you can achieve the same results by using the DataFrame’s query method,
like this:
[21]: df_students.query('Name=="Aisha"')
The three previous examples underline an occassionally confusing truth about working with Pandas.
Often, there are multiple ways to achieve the same results. Another example of this is the way you
refer to a DataFrame column name. You can specify the column name as a named index value (as
in the df_students['Name'] examples we’ve seen so far), or you can use the column as a property
of the DataFrame, like this:
The DataFrame’s read_csv method is used to load data from text files. As you can see in the
example code, you can specify options such as the column delimiter and which row (if any) contains
column headers (in this case, the delimiter is a comma and the first row contains the column names
- these are the default settings, so the parameters could have been omitted).
7
1.2.3 Handling missing values
One of the most common issues data scientists need to deal with is incomplete or missing data. So
how would we know that the DataFrame contains missing values? You can use the isnull method
to identify which individual values are null, like this:
[24]: df_students.isnull()
Of course, with a larger DataFrame, it would be inefficient to review all of the rows and columns
individually; so we can get the sum of missing values for each column, like this:
[25]: df_students.isnull().sum()
[25]: Name 0
StudyHours 1
Grade 2
dtype: int64
So now we know that there’s one missing StudyHours value, and two missing Grade values.
To see them in context, we can filter the dataframe to include only rows where any of the columns
(axis 1 of the DataFrame) are null.
[26]: df_students[df_students.isnull().any(axis=1)]
8
[26]: Name StudyHours Grade
22 Bill 8.0 NaN
23 Ted NaN NaN
When the DataFrame is retrieved, the missing numeric values show up as NaN (not a number).
So now that we’ve found the null values, what can we do about them?
One common approach is to impute replacement values. For example, if the number of study hours
is missing, we could just assume that the student studied for an average amount of time and replace
the missing value with the mean study hours. To do this, we can use the fillna method, like this:
df_students
Alternatively, it might be important to ensure that you only use data you know to be absolutely
correct; so you can drop rows or columns that contains null values by using the dropna method.
In this case, we’ll remove rows (axis 0 of the DataFrame) where any of the columns contain null
values.
9
[28]: Name StudyHours Grade
0 Dan 10.00 50.0
1 Joann 11.50 50.0
2 Pedro 9.00 47.0
3 Rosie 16.00 97.0
4 Ethan 9.25 49.0
5 Vicky 1.00 3.0
6 Frederic 11.50 53.0
7 Jimmie 9.00 42.0
8 Rhonda 8.50 26.0
9 Giovanni 14.50 74.0
10 Francesca 15.50 82.0
11 Rajab 13.75 62.0
12 Naiyana 9.00 37.0
13 Kian 8.00 15.0
14 Jenny 15.50 70.0
15 Jakeem 8.00 27.0
16 Helena 9.00 36.0
17 Ismat 6.00 35.0
18 Anila 10.00 48.0
19 Skye 12.00 52.0
20 Daniel 12.50 63.0
21 Aisha 12.00 64.0
[31]: # Get the mean study hours using the column name as an index
mean_study = df_students['StudyHours'].mean()
# Get the mean grade using the column name as a property (just to make the␣
,→point!)
mean_grade = df_students.Grade.mean()
[32]: # Get students who studied for the mean or more hours
df_students[df_students.StudyHours > mean_study]
10