Python Notes
Python Notes
What is NumPy
NumPy is a Python library that gives functionality comparable to mathematical tools such
as MATLAB and R. While NumPy significantly simplifies the user experience, it also offers
comprehensive mathematical functions.
What is Pandas
Pandas is an extremely popular Python library for data analysis and manipulation.
Pandas is like excel for Python - providing easy-to-use functionality for data tables.
Suppose a college takes a sample of student grades for a data science class.
Run the code in the cell below by clicking the ► Run button to see the data.
data = [50,50,47,97,49,3,53,42,26,74,82,62,37,15,70,27,36,35,48,52,63,64]
print(data)
[50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]
The data has been loaded into a Python list structure, which is a good data type for
general data manipulation, but not optimized for numeric analysis. For that, we're going
to use the NumPy package, which includes specific data types and functions for working
with Numbers in Python.
Run the cell below to load the data into a NumPy array.
import numpy as np
grades = np.array(data)
print(grades)
[50 50 47 97 49 3 53 42 26 74 82 62 37 15 70 27 36 35 48 52 63 64]
Just in case you're wondering about the differences between a list and a NumPy array,
let's compare how these data types behave when we use them in an expression that
multiplies them by 2.
<class 'list'> x 2: [50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36,
35, 48, 52, 63, 64, 50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35,
48, 52, 63, 64]
<class 'numpy.ndarray'> x 2: [100 100 94 194 98 6 106 84 52 148 164 124 74 30 140 54 72 70
96 104 126 128]
Note that multiplying a list by 2 creates a new list of twice the length with the original sequence of list
elements repeated. Multiplying a NumPy array on the other hand performs an element-wise
calculation in which the array behaves like a vector, so we end up with an array of the same size in
which each element has been multiplied by 2.
The key takeaway from this is that NumPy arrays are specifically designed to support mathematical
operations on numeric data - which makes them more useful for data analysis than a generic list.
You might have spotted that the class type for the numpy array above is a numpy.ndarray.
The nd indicates that this is a structure that can consists of multiple dimensions (it can
have n dimensions). Our specific instance has a single dimension of student grades.
grades.shape
(22,)
The shape confirms that this array has only one dimension, which contains 22 elements (there are 22
grades in the original list). You can access the individual elements in the array by their zero-based
ordinal position. Let's get the first element (the one in position 0).
grades[0]
50
Alright, now you know your way around a NumPy array, it's time to perform some analysis of the
grades data.
You can apply aggregations across the elements in the array, so let's find the simple average grade
(in other words, the mean grade value).
grades.mean()
49.18181818181818
So the mean grade is just around 50 - more or less in the middle of the possible range from 0 to 100.
Let's add a second set of data for the same students, this time recording the typical number of hours
per week they devoted to studying.
[50. , 50. , 47. , 97. , 49. , 3. , 53. , 42. , 26. , 74. , 82. , 62. , 37. , 15. , 70. ,
27. , 36. , 35. , 48. , 52. , 63. , 64. ]])
Now the data consists of a 2-dimensional array - an array of arrays. Let's look at its shape.
(2, 22)
The student_data array contains two elements, each of which is an array containing 22 elements.
To navigate this structure, you need to specify the position of each element in the hierarchy. So to
find the first value in the first array (which contains the study hours data), you can use the following
code.
10.0
Now you have a multidimensional array containing both the student's study time and grade
information, which you can use to compare data. For example, how does the mean study time
compare to the mean grade?
Run the following cell to import the Pandas library and create a DataFrame with three columns. The
first column is a list of student names, and the second and third columns are the NumPy arrays
containing the study time and grade data.
import pandas as pd
df_students = pd.DataFrame(
{'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie','Rhond
a', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny','Jakeem','Helena','Ismat
','Anila','Skye','Daniel','Aisha'],
'StudyHours':student_data[0],
'Grade':student_data[1]})
df_students
Note that in addition to the columns you specified, the DataFrame includes an index to unique
identify each row. We could have specified the index explicitly, and assigned any kind of appropriate
value (for example, an email address); but because we didn't specify an index, one has been created
with a unique integer value for each row.
You can use the DataFrame's loc method to retrieve data for a specific index value, like this.
Name Vicky
StudyHours 1.0
Grade 3.0
Name: 5, dtype: object
You can also get the data at a range of index values, like this:
In addition to being able to use the loc method to find rows based on the index, you can use
the iloc method to find rows based on their ordinal position in the DataFrame (regardless of the
index):
Look carefully at the iloc[0:5] results, and compare them to the loc[0:5] results you obtained
previously. Can you spot the difference?
The loc method returned rows with index label in the list of values from 0 to 5 - which
includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in
the positions included in the range 0 to 5, and since integer ranges don't include the upper-bound
value, this includes positions 0, 1, 2, 3, and 4 (five rows).
iloc identifies data values in a DataFrame by position, which extends beyond rows to columns. So for
example, you can use it to find the values for the columns in positions 1 and 2 in row 0, like this:
df_students.iloc[0,[1,2]]
StudyHours 10.0
Grade 50.0
Name: 0, dtype: object
Let's return to the loc method, and see how it works with columns. Remember that loc is used to
locate data items based on index values rather than positions. In the absence of an explicit index
column, the rows in our dataframe are indexed as integer values, but the columns are identified by
name:
df_students.loc[0,'Grade']
50.0
Here's another useful trick. You can use the loc method to find indexed rows based on a filtering
expression that references named columns other than the index, like this:
df_students.loc[df_students['Name']=='Aisha']
Actually, you don't need to explicitly use the loc method to do this - you can simply apply a
DataFrame filtering expression, like this:
df_students[df_students['Name']=='Aisha']
And for good measure, you can achieve the same results by using the DataFrame's query method,
like this:
df_students.query('Name=="Aisha"')
The three previous examples underline an occassionally confusing truth about working with Pandas.
Often, there are multiple ways to achieve the same results. Another example of this is the way you
refer to a DataFrame column name. You can specify the column name as a named index value (as in
the df_students['Name'] examples we've seen so far), or you can use the column as a property of
the DataFrame, like this:
df_students[df_students.Name == 'Aisha']
We constructed the DataFrame from some existing arrays. However, in many real-world scenarios,
data is loaded from sources such as files. Let's replace the student grades DataFrame with the
contents of a text file.
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-
learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
df_students.head()
The DataFrame's read_csv method is used to load data from text files. As you can see in the
example code, you can specify options such as the column delimiter and which row (if any) contains
column headers (in this case, the delimiter is a comma and the first row contains the column names -
these are the default settings, so the parameters could have been omitted).
Handling missing values
One of the most common issues data scientists need to deal with is incomplete or missing data. So
how would we know that the DataFrame contains missing values? You can use the isnull method to
identify which individual values are null, like this:
df_students.isnull()
Of course, with a larger DataFrame, it would be inefficient to review all of the rows and columns
individually; so we can get the sum of missing values for each column, like this:
df_students.isnull().sum()
Name 0
StudyHours 1
Grade 2
dtype: int64
So now we know that there's one missing StudyHours value, and two missing Grade values.
To see them in context, we can filter the dataframe to include only rows where any of the columns
(axis 1 of the DataFrame) are null.
df_students[df_students.isnull().any(axis=1)]
When the DataFrame is retrieved, the missing numeric values show up as NaN (not a number).
So now that we've found the null values, what can we do about them?
One common approach is to impute replacement values. For example, if the number of study hours is
missing, we could just assume that the student studied for an average amount of time and replace
the missing value with the mean study hours. To do this, we can use the fillna method, like this:
df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
df_students
Alternatively, it might be important to ensure that you only use data you know to be absolutely
correct; so you can drop rows or columns that contains null values by using the dropna method. In
this case, we'll remove rows (axis 0 of the DataFrame) where any of the columns contain null values.
Now that we've cleaned up the missing values, we're ready to explore the data in the DataFrame.
Let's start by comparing the mean study hours and grades.
# Get the mean grade using the column name as a property (just to make the point!)
mean_grade = df_students.Grade.mean()
OK, let's filter the DataFrame to find only the students who studied for more than the average amount
of time.
# Get students who studied for the mean or more hours
df_students[df_students.StudyHours > mean_study]
Note that the filtered result is itself a DataFrame, so you can work with its columns just like any other
DataFrame.
For example, let's find the average grade for students who undertook more than the average amount
of study time.
66.7
Let's assume that the passing grade for the course is 60.
We can use that information to add a new column to the DataFrame, indicating whether or not each
student passed.
First, we'll create a Pandas Series containing the pass/fail indicator (True or False), and then we'll
concatenate that series as a new column (axis 1) in the DataFrame.
df_students
DataFrames are designed for tabular data, and you can use them to perform many of the kinds of
data analytics operation you can do in a relational database; such as grouping and aggregating tables
of data.
For example, you can use the groupby method to group the student data into groups based on
the Pass column you added previously, and count the number of names in each group - in other
words, you can determine how many students passed and failed.
print(df_students.groupby(df_students.Pass).Name.count())
You can aggregate multiple fields in a group using any available aggregation function. For example,
you can find the mean study time and grade for the groups of students who passed and failed the
course.
print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())
DataFrames are amazingly versatile, and make it easy to manipulate data. Many DataFrame
operations return a new copy of the DataFrame; so if you want to modify a DataFrame but keep the
existing variable, you need to assign the result of the operation to the existing variable. For example,
the following code sorts the student data into descending order of Grade, and assigns the resulting
sorted DataFrame to the original df_students variable.
Numpy and DataFrames are the workhorses of data science in Python. They provide us ways to load,
explore, and analyze tabular data. As we will see in subsequent modules, even advanced analysis
methods typically rely on Numpy and Pandas for these important roles.
In our next workbook, we'll take a look at how create graphs and explore your data in more
interesting ways.
Visualize data
Data scientists visualize data to understand it better. This can mean looking at the raw
data, summary measures such as averages, or graphing the data. Graphs are a powerful
means of viewing data, as we can discern moderately complex patterns quickly without
needing to define mathematical summary measures.
While sometimes we know ahead of time what kind of graph will be most useful, other
times we use graphs in an exploratory way. To understand the power of data
visualization, consider the data below: the location (x,y) of a self-driving car. In its raw
form, it's hard to see any real patterns. The mean or average, tells us that its path was
centred around x=0.2 and y=0.3, and the range of numbers appears to be between
about -2 and 2.
If we now plot Location-X over time, we can see that we appear to have some missing values
between times 7 and 12.
If we graph X vs Y, we end up with a map of where the car has driven. It’s instantly obvious that
the car has been driving in a circle, but at some point drove to the center of that circle.
Graphs aren't limited to 2D scatter plots like those above, but can be used to explore other kinds
of data, like proportions - shown through pie charts, stacked bar graphs - how data are spread -
with histograms, box and whisker plots - and how two data sets differ. Often, when we're trying
to understand raw data or results, we may experiment with different types of graphs until we
come across one that explains the data in a visually intuitive way.
Exercise - Visualize data with Matplotlib
import pandas as pd
Well, that worked; but the chart could use some improvements to make it clearer what we're looking at.
Note that you used the pyplot class from Matplotlib to plot the chart. This class provides a whole bunch of ways to
improve the visual elements of the plot. For example, the following code:
Adds labels to the X and Y (so we know which axis shows which data)
Adds a grid (to make it easier to determine the values for the bars)
A plot is technically contained with a Figure. In the previous examples, the figure was created
implicitly for you; but you can create it explicitly. For example, the following code creates a figure
with a specific size.
# Create a Figure
fig = plt.figure(figsize=(10,5))
For example, the following code creates a figure with two subplots - one is a bar chart showing
student grades, and the other is a pie chart comparing the number of passing grades to non-passing
grades.
A lot of data science is rooted in statistics, so we'll explore some basic statistical techniques.
Note: This is not intended to teach you statistics - that's much too big a topic for this notebook. It will
however introduce you to some statistical concepts and techniques that data scientists use as they
explore data in preparation for machine learning modeling.
When examining a variable (for example a sample of student grades), data scientists are particularly
interested in its distribution (in other words, how are all the different grade values spread across the
sample). The starting point for this exploration is often to visualize the data as a histogram, and see
how frequently each value for the variable occurs.
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var_data)
The histogram for grades is a symmetric shape, where the most frequently occurring grades tend to be in the middle of
the range (around 50), with fewer grades at the extreme ends of the scale.
To understand the distribution better, we can examine so-called measures of central tendency; which is a fancy way of
describing statistics that represent the "middle" of the data. The goal of this is to try to find a "typical" value. Common
ways to define the middle of the data include:
The mean: A simple average based on adding together all of the values in the sample set, and then dividing the
total by the number of samples.
The median: The value in the middle of the range of all of the sample values.
The mode: The most commonly occuring value in the sample set*.
Let's calculate these values, along with the minimum and maximum values for comparison, and show them on the
histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in which case the dataset is described
as bimodal or even multimodal.
# Get statistics
min_val = var.min()
max_val = var.max()
mean_val = var.mean()
med_val = var.median()
mod_val = var.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,
m
ean_val,
m
ed_val,
m
od_val,
m
ax_val))
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var)
Another way to visualize the distribution of a variable is to use a box plot (sometimes called a box-
and-whiskers plot). Let's create one for the grade data.
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.boxplot(var)
For learning, it can be useful to combine histograms and box plots, with the box plot's orientation
changed to align it with the histogram (in some ways, it can be helpful to think of the histogram as a
"front elevation" view of the distribution, and the box plot as a "plan" view of the distribution from
above.)
# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,
mean_val,
med_val,
mod_val,
max_val))
# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (10,4))
To explore this distribution in more detail, you need to understand that statistics is fundamentally
about taking samples of data and using probability functions to extrapolate information about the
full population of data.
What does this mean? Samples refer to the data we have on hand - such as information about these
22 students' study habits and grades. The population refers to all possible data we could collect -
such as every student's grades and study habits across every educational institution throughout the
history of time. Usually we're interested in the population but it's simply not practical to collect all of
that data. Instead, we need to try estimate what the population is like from the small amount of data
(samples) that we have.
If we have enough samples, we can calculate something called a probability density function, which
estimates the distribution of grades for the full population.
The Pandas DataFrame class provides a helpful plot function to show this density.
def show_density(var_data):
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(10,4))
# Plot density
var_data.plot.density()
As expected from the histogram of the sample, the density shows the characteristic "bell curve" of what statisticians call
a normal distribution with the mean and mode at the center and symmetric tails.
Summary
Well done! There were a number of new concepts in here, so let's summarise.
Here we have:
6. Estimated what the population of graphse might look like from a sample of grades.
In our next notebook we will look at spotting unusual data, and finding relationships between data.
Further Reading
To learn more about the Python packages you explored in this notebook, see the following documentation:
NumPy
Pandas
Matplotlib
Examine real world data
Data presented in educational material is often remarkably perfect, designed to show
students how to find clear relationships between variables. ‘Real world’ data is a bit less
simple.
Because of the complexity of ‘real world’ data, raw data has to be inspected for issues
before being used.
As such, best practice is to inspect the raw data and process it before use, which reduces
errors or issues, typically by removing erroneous data points or modifying the data into a
more useful form.
It's important to realize that most real-world data are influenced by factors that weren't
recorded at the time. For example, we might have a table of race-car track times
alongside engine sizes, but various other factors that weren't written down—such as the
weather—probably also played a role. If problematic, the influence of these factors can
often be reduced by increasing the size of the dataset.
In other situations data points that are clearly outside of what is expected—also known
as ‘outliers’—can sometimes be safely removed from analyses, though care must be
taken to not remove data points that provide real insights.
Another common issue in real-world data is bias. Bias refers to a tendency to select
certain types of values more frequently than others, in a way that misrepresents the
underlying population, or ‘real world’. Bias can sometimes be identified by exploring data
while keeping in mind basic knowledge about where the data came from.
Remember, real-world data will always have issues, but this is often a surmountable
problem. Remember to:
Last time, we looked at grades for our student data, and estimated from this sample what the full
population of grades might look like. Just to refresh, lets take a look at this data again.
Run the code below to print out the data and make a histogram + boxplot that show the grades for
our sample of students.
import pandas as pd
from matplotlib import pyplot as plt
# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,
mean_val,
med_val,
mod_val,
max_val))
show_distribution(df_students['Grade'])
As you might recall, our data had the mean and mode at the center, with data spread symmetrically
from there.
Now let's take a look at the distribution of the study hours data.
Note that the whiskers of the box plot only begin at around 6.0, indicating that the vast majority of
the first quarter of the data is above this value. The minimum is marked with an o, indicating that it is
statistically an outlier - a value that lies significantly outside the range of the rest of the distribution.
Outliers can occur for many reasons. Maybe a student meant to record "10" hours of study time, but
entered "1" and missed the "0". Or maybe the student was abnormally lazy when it comes to
studying! Either way, it's a statistical anomaly that doesn't represent a typical student. Let's see what
the distribution looks like without it.
When we have more data available, our sample becomes more reliable. This makes it easier to
consider outliers as being values that fall below or above percentiles within which most of the data
lie. For example, the following code uses the Pandas quantile function to exclude observations below
the 0.01th percentile (the value above which 99% of the data reside).
With the outliers removed, the box plot shows all data within the four quartiles. Note that the
distribution is not symmetric like it is for the grade data though - there are some students with very
high study times of around 16 hours, but the bulk of the data is between 7 and 13 hours; The few
extremely high values pull the mean towards the higher end of the scale.
def show_density(var_data):
fig = plt.figure(figsize=(10,4))
# Plot density
var_data.plot.density()
This kind of distribution is called right skewed. The mass of the data is on the left side of the distribution, creating a long
tail to the right because of the values at the extreme high end; which pull the mean to the right.
Measures of variance
So now we have a good idea where the middle of the grade and study hours data distributions are. However, there's
another aspect of the distributions we should examine: how much variability is there in the data?
Range: The difference between the maximum and minimum. There's no built-in function for this, but it's easy to
calculate using the min and max functions.
Variance: The average of the squared difference from the mean. You can use the built-in var function to find
this.
Standard Deviation: The square root of the variance. You can use the built-in std function to find this.
When working with a normal distribution, the standard deviation works with the particular
characteristics of a normal distribution to provide even greater insight. Run the cell below to see the
relationship between standard deviations and the data in the normal distribution.
# Annotate 1 stdev
x1 = [m-s, m+s]
y1 = density(x1)
plt.plot(x1,y1, color='magenta')
plt.annotate('1 std (68.26%)', (x1[1],y1[1]))
# Annotate 2 stdevs
x2 = [m-(s*2), m+(s*2)]
y2 = density(x2)
plt.plot(x2,y2, color='green')
plt.annotate('2 std (95.45%)', (x2[1],y2[1]))
# Annotate 3 stdevs
x3 = [m-(s*3), m+(s*3)]
y3 = density(x3)
plt.plot(x3,y3, color='orange')
plt.annotate('3 std (99.73%)', (x3[1],y3[1]))
plt.axis('off')
plt.show()
The horizontal lines show the percentage of data within 1, 2, and 3 standard deviations of the mean (plus or minus).
Approximately 68.26% of values fall within one standard deviation from the mean.
Approximately 95.45% of values fall within two standard deviations from the mean.
Approximately 99.73% of values fall within three standard deviations from the mean.
So, since we know that the mean grade is 49.18, the standard deviation is 21.74, and distribution of grades is
approximately normal; we can calculate that 68.26% of students should achieve a grade between 27.44 and 70.92.
The descriptive statistics we've used to understand the distribution of the student data variables are the basis of
statistical analysis; and because they're such an important part of exploring your data, there's a built-in Describe method
of the DataFrame object that returns the main descriptive statistics for all numeric columns.
df_students.describe()
Comparing data
Now that you know something about the statistical distribution of the data in your dataset, you're
ready to examine your data to identify any apparent relationships between variables.
First of all, let's get rid of any rows that contain outliers so that we have a sample that is
representative of a typical class of students. We identified that the StudyHours column contains some
outliers with extremely low values, so we'll remove those rows.
df_sample = df_students[df_students['StudyHours']>1]
df_sample
Comparing numeric and categorical variables
The data includes two numeric variables (StudyHours and Grade) and two categorical variables
(Name and Pass). Let's start by comparing the numeric StudyHours column to the
categorical Pass column to see if there's an apparent relationship between the number of hours
studied and a passing grade.
To make this comparison, let's create box plots showing the distribution of StudyHours for each
possible Pass value (true and false).
Comparing the StudyHours distributions, it's immediately apparent (if not particularly surprising) that
students who passed the course tended to study for more hours than students who didn't. So if you
wanted to predict whether or not a student is likely to pass the course, the amount of time they
spend studying may be a good predictive feature.
Now let's compare two numeric variables. We'll start by creating a bar chart that shows both grade
and study hours.
A common technique when dealing with numeric data in different scales is to normalize the data so
that the values retain their proportional distribution, but are measured on the same scale. To
accomplish this, we'll use a technique called MinMax scaling that distributes the values proportionally
on a scale of 0 to 1. You could write the code to apply this transformation; but the Scikit-
Learn library provides a scaler to do it for you.
So there seems to be a correlation between study time and grade; and in fact, there's a
statistical correlation measurement we can use to quantify the relationship between these columns.
df_normalized.Grade.corr(df_normalized.StudyHours)
0.9117666413789677
The correlation statistic is a value between -1 and 1 that indicates the strength of a relationship.
Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high
values of the other), while values below 0 indicate a negative correlation (high values of one variable
tend to coincide with low values of the other). In this case, the correlation value is close to 1; showing
a strongly positive correlation between study time and grade.
Note: Data scientists often quote the maxim "correlation is not causation". In other words, as
tempting as it might be, you shouldn't interpret the statistical correlation as explaining why one of
the values is high. In the case of the student data, the statistics demonstrates that students with high
grades tend to also have high amounts of study time; but this is not the same as proving that they
achieved high grades because they studied a lot. The statistic could equally be used as evidence to
support the nonsensical conclusion that the students studied a lot because their grades were going to
be high.
Another way to visualise the apparent correlation between two numeric columns is to use
a scatter plot.
# Create a scatter plot
df_sample.plot.scatter(title='Study Time vs Grade', x='StudyHours', y='Grade')
Again, it looks like there's a discernible pattern in which the students who studied the most hours are also the students
who got the highest grades.
We can see this more clearly by adding a regression line (or a line of best fit) to the plot that shows the general trend in
the data. To do this, we'll use a statistical technique called least squares regression.
Cast your mind back to when you were learning how to solve linear equations in school, and recall that the slope-
intercept form of a linear equation looks like this:
y=mx+b
In this equation, y and x are the coordinate variables, m is the slope of the line, and b is the y-intercept (where the line
goes through the Y-axis).
In the case of our scatter plot for our student data, we already have our values for x (StudyHours) and y (Grade), so we
just need to calculate the intercept and slope of the straight line that lies closest to those points. Then we can form a
linear equation that calculates a new y value on that line for each of our x (StudyHours) values - to avoid confusion, we'll
call this new y value f(x) (because it's the output from a linear equation function based on x). The difference between
the original y (Grade) value and the f(x) value is the error between our regression line and the actual Grade achieved by
the student. Our goal is to calculate the slope and intercept for a line with the lowest overall error.
Specifically, we define the overall error by taking the error for each point, squaring it, and adding all the squared errors
together. The line of best fit is the line that gives us the lowest value for the sum of the squared errors - hence the
name least squares regression.
Fortunately, you don't need to code the regression calculation yourself - the SciPy package includes a stats class that
provides a linregress method to do the hard work for you. This returns (among other things) the coefficients you need
for the slope equation - slope (m) and intercept (b) based on a given pair of variable samples you want to compare.
#
df_regression = df_sample[['Grade', 'StudyHours']].copy()
# Use the function (mx + b) to calculate f(x) for each x (StudyHours) value
df_regression['fx'] = (m * df_regression['StudyHours']) + b
# Calculate the error between f(x) and the actual y (Grade) value
df_regression['error'] = df_regression['fx'] - df_regression['Grade']
The slope and intercept coefficients calculated for the regression line are shown above the plot.
The line is based on the f(x) values calculated for each StudyHours value. Run the following cell to see a table that
includes the following values:
The error between the calculated f(x) value and the actual Grade value.
Some of the errors, particularly at the extreme ends, and quite large (up to over 17.5 grade points); but in general, the
line is pretty close to the actual grades.
# Show the original x,y values, the f(x) value, and the error
df_regression[['StudyHours', 'Grade', 'fx', 'error']]
Using the regression coefficients for prediction
Now that you have the regression coefficients for the study time and grade relationship, you can use
them in a function to estimate the expected grade for a given amount of study.
study_time = 14
So by applying statistics to sample data, you've determined a relationship between study time and grade; and
encapsulated that relationship in a general function that can be used to predict a grade for a given amount of study
time.
This technique is in fact the basic premise of machine learning. You can take a set of sample data that includes one or
more features (in this case, the number of hours studied) and a known label value (in this case, the grade achieved) and
use the sample data to derive a function that calculates predicted label values for any given set of features.
Summary
Further Reading
To learn more about the Python packages you explored in this notebook, see the following documentation:
NumPy
Pandas
Matplotlib
Summary
Completed100 XP
1 minute
In this module, you learned how to use Python to explore, visualize, and manipulate data. Data exploration is at the core
of data science, and is a key element in data analysis and machine learning.
Machine learning is a subset of data science that deals with predictive modeling. In other words, machine learning uses
data to creates predictive models, in order to predict unknown values. You might use machine learning to predict how
much food a supermarket needs to order, or to identify plants in photographs.
Machine learning works by identifying relationships between data values that describe characteristics of something—
its features, such as the height and color of a plant—and the value we want to predict—the label, such as the species of
plant. These relationships are built into a model through a training process.
If this exercises in this module have inspired you to try exploring data for yourself, why not take on the challenge of a
real world dataset containing flight records from the US Department of Transportation? You'll find the challenge in
the 01 - Flights Challenge.ipynb notebook!
Note
The time to complete this optional challenge is not included in the estimated time for this module - you can spend as
little or as much time on it as you like!
Module complete: