lecture2-supervised-learning slides
lecture2-supervised-learning slides
For each patient we have a access to measurements from their medical record and
an estimate of diabetes risk.
We are interested in understanding how the measurements affect an individual's
diabetes risk.
Three Components of A Supervised Machine Learning
Problem
At a high level, a supervised machine learning problem has the following structure:
We will use the UCI Diabetes Dataset; it's a toy dataset that's often used to demonstrate
machine learning algorithms.
For each patient we have a access to a measurement of their body mass index (BMI)
and a quantiative diabetes risk score (from 0-400).
We are interested in understanding how BMI affects an individual's diabetes risk.
In [2]: import numpy as np
import pandas as pd
from sklearn import datasets
We could assume that risk is a linear function of BMI. In other words, for some unknown
𝜃0 , 𝜃1 ∈ ℝ , we have
𝑦 = 𝜃1 ⋅ 𝑥 + 𝜃0 ,
where 𝑥 is the BMI (also called the dependent variable), and 𝑦 is the diabetes risk score
(the independent variable).
Note that 𝜃1 , 𝜃0 are the slope and the intercept of the line relates 𝑥 to 𝑦 . We call them
parameters.
We can visualize this for a few values of 𝜃1 , 𝜃0 .
We will see many algorithms for this task. For now, let's call the
sklearn.linear_model library to find a 𝜃1 , 𝜃0 that fit the data well.
In [6]: from sklearn import linear_model
from sklearn.metrics import mean_squared_error
# The coefficients
print('Slope (theta1): \t', regr.coef_[0])
print('Intercept (theta0): \t', regr.intercept_)
plt.scatter(diabetes_X_train, diabetes_y_train)
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.legend(['Initial patients', 'New patients'])
Let's now look at what a general supervised learning problem looks like.
Recall: Three Components of A Supervised Machine
Learning Problem
At a high level, a supervised machine learning problem has the following structure:
Previsouly, we only looked at the patients' BMI, but this dataset actually records many
additional measurements.
The UCI dataset contains many additional data columns besides bmi , including age, sex,
and blood pressure. We can ask sklearn to give us more information about this dataset.
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
:Attribute Information:
- age age in years
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled
by the standard deviation times `n_samples` (i.e. the sum of squares of each
column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
In [12]: diabetes_X.iloc[0]
Note that thes attributes in the above example have been mean-centered at zero and re-
scaled to have a variance of one.
Training Dataset: Features
Often, an input object has many attributes, and we want to use these attributes to define
more complex descriptions of the input.
Is the patient old and a man? (Useful if old men are at risk).
Is the BMI above the obesity threshold?
We will follow this convention and use attribute only when there is ambiguity between
features and attributes.
Features: Discrete vs. Continuous
Features can be either discrete or continuous. We will see later that they may be handled
differently by ML algorthims.
The BMI feature that we have seen earlier is an example of a continuous feature.
Out[14]: <AxesSubplot:>
Other features take on one of a finite number of discrete values. The sex column is an
example of a categorical feature.
In this example, the dataset has been pre-processed such that the two values happen to be
0.05068012 and -0.04464164 .
[ 0.05068012 -0.04464164]
Out[15]: <AxesSubplot:>
Training Dataset: Targets
For each patient, we are interested in predicting a quantity of interest, the target. In our
example, this is the patient's diabetes risk.
Formally, when (𝑥(𝑖) , 𝑦(𝑖) ) form a training example, each 𝑦(𝑖) ∈ is a target. We call the
target space.
We plot the distirbution of risk scores below.
# Visualize it
plt.scatter(diabetes_X_train[diabetes_y_train_discr==0], diabetes_y_train[diabet
es_y_train_discr==0], marker='o', s=80, facecolors='none', edgecolors='g')
plt.scatter(diabetes_X_train[diabetes_y_train_discr==1], diabetes_y_train[diabet
es_y_train_discr==1], marker='o', s=80, facecolors='none', edgecolors='r')
plt.legend(['Low-Risk Patients', 'High-Risk Patients'])
In [18]: # Create logistic regression object (note: this is actually a classification alg
orithm!)
clf = linear_model.LogisticRegression()
# Visualize it
plt.scatter(diabetes_X_train[diabetes_y_train_discr==0], diabetes_y_train[diabet
es_y_train_discr==0], marker='o', s=140, facecolors='none', edgecolors='g')
plt.scatter(diabetes_X_train[diabetes_y_train_discr==1], diabetes_y_train[diabet
es_y_train_discr==1], marker='o', s=140, facecolors='none', edgecolors='r')
plt.scatter(diabetes_X_train[diabetes_y_train_pred==0], diabetes_y_train[diabete
s_y_train_pred==0], color='g', s=20)
plt.scatter(diabetes_X_train[diabetes_y_train_pred==1], diabetes_y_train[diabete
s_y_train_pred==1], color='r', s=20)
plt.legend(['Low-Risk Patients', 'High-Risk Patients', 'Low-Risk Predictions', '
High-Risk Predictions'])
Often, models have parameters 𝜃 ∈ Θ living in a set Θ . We will then write the model as
𝑓𝜃 : →
to denote that it's parametrized by 𝜃 .
Model Class: Notation
Formally, the model class is a set
⊆ {𝑓 ∣ 𝑓 : → }
of possible models that map input features to targets.
When the models 𝑓𝜃 are paremetrized by parameters 𝜃 ∈ Θ living in some set Θ . Thus we
can also write
= {𝑓𝜃 ∣ 𝑓 : → ; 𝜃 ∈ Θ}.
Model Class: Example
One simple approach is to assume that 𝑥 and 𝑦 are related by a linear model of the form
𝑦 = 𝜃0 + 𝜃1 ⋅ 𝑥1 + 𝜃2 ⋅ 𝑥2 +. . . +𝜃𝑑 ⋅ 𝑥𝑑
where 𝑥 is a featurized output and 𝑦 is the target.
y1 = np.array([1, 2, 3, 4])
y2 = np.array([-1, 1, 3, 5])
Intuitively, this is the function that bests "fits" the data on the training dataset.
1 𝑛
( 𝜃 )
(𝑖) (𝑖) 2
𝜃∈ℝ 2𝑛 ∑
min 𝑓 ( 𝑥 ) − 𝑦
𝑖=1
We can easily measure the quality of the fit on the training set and the test set.
Let's run the above algorithm on our diabetes dataset.
The algorithm returns a predictive model. We can visualize its predictions below.
In [55]: # visualize the results
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.scatter(diabetes_X_train.loc[:, ['bmi']], diabetes_y_train)
plt.scatter(diabetes_X_test.loc[:, ['bmi']], diabetes_y_test, color='red', marke
r='o')
# plt.scatter(diabetes_X_train.loc[:, ['bmi']], diabetes_y_train_pred, color='bl
ack', linewidth=1)
plt.plot(diabetes_X_test.loc[:, ['bmi']], diabetes_y_test_pred, 'x', color='red'
, mew=3, markersize=8)
plt.legend(['Model', 'Prediction', 'Initial patients', 'New patients'])
The predictive model is chosen to model the relationship between inputs and targets. For
instance, it can predict future targets.
In [ ]: